Does sampling design impact socio-economic classification?
Doing research in developing countries is not easy. However, with a bit of care and planning, one can do quality work which can have an impact on how much we know about the public health in poor countries and provide quality data where data is sadly scarce.
The root of a survey, however, is sampling. A good sample does its best to successfully represent a population of interest and can at least qualify all of the ways in which it does not. A bad sample either 1) does not represent the population (bias) and no way to account for it or 2) has no idea what it represents.
Without being a hater, my least favorite study design is the “school based survey.” Researchers like this design for a number of reasons.
First, it is logistically simple to conduct. If one is interested in kids, it helps to have a large number of them in one place. Visiting households individually is time consuming, expensive and one only has a small window of opportunity to catch kids at home since they are probably at school!
Second, since the time required to conduct a school based survey is short, researchers aren’t required to make extensive time commitments in developing countries. They can simply helicopter in for a couple of days and run away to the safety of wherever. Also, there is no need to manage large teams of survey workers over the long term. Data can be collected within a few days under the supervision of foreign researchers.
Third, school based surveys don’t require teams to lug around large diagnostic or sampling supplies (e.g. coolers for serum samples).
However, from a sampling perspective, assuming that one wishes to say something about the greater community, the “school based survey” is a TERRIBLE design.
The biases should be obvious. Schools tend to concentrate students which are similar to one another. Students are of similar socio-economic backgrounds, ethnicity or religion. Given the fee based structure of most schools in most African countries, sampling from schools will necessarily exclude the absolute poorest of the poor. Moreover, if one does not go out of the way to select more privileged private schools, one will exclude the wealthy, an important control if one wants to draw conclusions about socio-economic status and health.
Further, schools based surveys are terrible for studies of health since the sickest kids won’t attend school. School based surveys are biased in favor of healthy children.
So, after this long intro (assuming anyone has read this far) how does this work in practice?
I have a full dataset of socio-econonomic indicators for approximately 17,000 households in an area of western Kenya. We collect information on basic household assets such as possession of TVs, cars, radios and type of house construction (a la DHS). I boiled these down into a single continuous measure, where each households gets a wealth “score” so that we can compare one or more households to others in the community ( a la Filmer & Pritchett).
We also have a data set of school based samples from a malaria survey which comprises ~800 primary school kids. I compared the SES scores for the school based survey to the entire data set to see if the distribution of wealth for the school based sample compared with the distribution of wealth for the entire community. If they are the same, we have no problems of socio-economic bias.
We can see, however, from the above plot that the distributions differ. The distribution of SES scores for the school based survey is far more bottom heavy than that of the great community; the school based survey excludes wealthier households. The mean wealth score for the school based survey is well under that of the community as a whole (-.025 vs. -.004, t=-19.32, p<.0001).
Just from this, we can see that the school based survey is likely NOT representative of the community and that the school based sample is far more homogeneous than the community from which the kids are drawn.
Researchers find working with continuous measure of SES unwieldy and difficult to present. To solve this problem, they will often place households into socio-economic "classes" by dividing the data set up into . quantiles. These will represent households which range from "ultra poor" to "wealthy." A problem with samples is that these classifications may not be the same over the range of samples, and only some of them will accurately reflect the true population level classification.
In this case, when looking at a table of how these classes correspond to one another, we find the following:

Assuming that these SES “classes” are at all meaningful (another discussion) We can see that for all but the wealthiest households more than 80% of households have been misclassified! Further, due to the sensitivity of the method (multiple correspondence analysis) used to create the composite, 17 of households classified as “ultra poor” in the full survey have suddenly become “wealthy.”
Now, whether these misclassifications impact the results of the study remains to be seen. It may be that they do not. It also may be the case that investigators may not be interested in drawing conclusions about the community and may only want to say something about children who attend particular types of schools (though this distinction is often vague in practice). Regardless, sampling matters. A properly designed survey can improve data quality vastly.
Hello, thank you for the interesting post!
Though the distributions seem to have a close shape to me, I am wondering whether you have considered the possible problem of increase in standard error of the estimate (for your target parameter) due to the small population variance in predictors/covariates. Another thing is sampling unit. Using a house hold is getting harder in everywhere in telephone sampling because of widespread of mobile phone. Or another way to get around the nonrepresentative sample is to re-define the target population. Anyway, the process of study and sample design requires time and background research(reference) and I hope that a researcher understands that…
The biggest problem is that since the distribution of wealth is most communities is highly skewed, restricting the sample to certain schools (which are self-selected entities) increases the chances of excluding the extremes (or even just of specific groups, as you imply with the example of telephone sampling). Without a concerted attempt to account for this source of bias, the results of the study will have to be treated with some level of suspicion.
Unfortunately, it is unclear as to whether researchers consider these sources of bias, given that it often isn’t mentioned at all in published research.
Taking a bit of care before a study begins can vastly improve the quality of the final product (and even save a bit of money and time).
I am also wondering whether a researcher even compute the sample size for given power in order to test hypotheses, as well as selection/nonresponse rate.
I feel like they assume Simple Random Sample and Missing Completely at Random…
However, SRS can be most expensive way to sample unless you have a population frame and the cost of interview/travel is trivial(which is very unlikely, in practice).
I think that PIs better know that a sample in STATS 100 is the goal and that a method to adjust the deviation from the goal is not necessarily erase the flaw. Besides, often times, such method is studied under various kinds of cases and the effect is concluded from the majority of the results. So, one particular case in hand may not yield the expected adjustment.
Here, sampling variation comes in, again..
I hardly understand why most people associate complexity/unclear with being smart. Is this Plato’s cave story syndrome?