Does sampling design impact socio-economic classification?
Doing research in developing countries is not easy. However, with a bit of care and planning, one can do quality work which can have an impact on how much we know about the public health in poor countries and provide quality data where data is sadly scarce.
The root of a survey, however, is sampling. A good sample does its best to successfully represent a population of interest and can at least qualify all of the ways in which it does not. A bad sample either 1) does not represent the population (bias) and no way to account for it or 2) has no idea what it represents.
Without being a hater, my least favorite study design is the “school based survey.” Researchers like this design for a number of reasons.
First, it is logistically simple to conduct. If one is interested in kids, it helps to have a large number of them in one place. Visiting households individually is time consuming, expensive and one only has a small window of opportunity to catch kids at home since they are probably at school!
Second, since the time required to conduct a school based survey is short, researchers aren’t required to make extensive time commitments in developing countries. They can simply helicopter in for a couple of days and run away to the safety of wherever. Also, there is no need to manage large teams of survey workers over the long term. Data can be collected within a few days under the supervision of foreign researchers.
Third, school based surveys don’t require teams to lug around large diagnostic or sampling supplies (e.g. coolers for serum samples).
However, from a sampling perspective, assuming that one wishes to say something about the greater community, the “school based survey” is a TERRIBLE design.
The biases should be obvious. Schools tend to concentrate students which are similar to one another. Students are of similar socio-economic backgrounds, ethnicity or religion. Given the fee based structure of most schools in most African countries, sampling from schools will necessarily exclude the absolute poorest of the poor. Moreover, if one does not go out of the way to select more privileged private schools, one will exclude the wealthy, an important control if one wants to draw conclusions about socio-economic status and health.
Further, schools based surveys are terrible for studies of health since the sickest kids won’t attend school. School based surveys are biased in favor of healthy children.
So, after this long intro (assuming anyone has read this far) how does this work in practice?
I have a full dataset of socio-econonomic indicators for approximately 17,000 households in an area of western Kenya. We collect information on basic household assets such as possession of TVs, cars, radios and type of house construction (a la DHS). I boiled these down into a single continuous measure, where each households gets a wealth “score” so that we can compare one or more households to others in the community ( a la Filmer & Pritchett).
We also have a data set of school based samples from a malaria survey which comprises ~800 primary school kids. I compared the SES scores for the school based survey to the entire data set to see if the distribution of wealth for the school based sample compared with the distribution of wealth for the entire community. If they are the same, we have no problems of socio-economic bias.
We can see, however, from the above plot that the distributions differ. The distribution of SES scores for the school based survey is far more bottom heavy than that of the great community; the school based survey excludes wealthier households. The mean wealth score for the school based survey is well under that of the community as a whole (-.025 vs. -.004, t=-19.32, p<.0001).
Just from this, we can see that the school based survey is likely NOT representative of the community and that the school based sample is far more homogeneous than the community from which the kids are drawn.
Researchers find working with continuous measure of SES unwieldy and difficult to present. To solve this problem, they will often place households into socio-economic "classes" by dividing the data set up into . quantiles. These will represent households which range from "ultra poor" to "wealthy." A problem with samples is that these classifications may not be the same over the range of samples, and only some of them will accurately reflect the true population level classification.
In this case, when looking at a table of how these classes correspond to one another, we find the following:

Assuming that these SES “classes” are at all meaningful (another discussion) We can see that for all but the wealthiest households more than 80% of households have been misclassified! Further, due to the sensitivity of the method (multiple correspondence analysis) used to create the composite, 17 of households classified as “ultra poor” in the full survey have suddenly become “wealthy.”
Now, whether these misclassifications impact the results of the study remains to be seen. It may be that they do not. It also may be the case that investigators may not be interested in drawing conclusions about the community and may only want to say something about children who attend particular types of schools (though this distinction is often vague in practice). Regardless, sampling matters. A properly designed survey can improve data quality vastly.
What are we talking about when we discuss socio-economic position and health in developing countries?
A wide body of literature has found that socio-economic position (SEP) has profound impacts on the health status of individuals. Poor people are sicker than rich people. We find this relationship all over the world and in countries like the United States, it couldn’t be more apparent.
Poor people, particularly poor minorities, are more likely to see their children die, are more likely to be obese, have worse cardiac outcomes, develop cancer more often, are disproportionately afflicted by infectious diseases and die earlier than people who are not poor. There is ample evidence to support this.
However, the exact factors which lead to this disparity are up to debate. Some focus on issues of lifestyle, diet, neighborhood effects and access to health care. Poor people, particularly minorities, live hard, eat worse, live in dangerous or toxic environments and have low access to quality care all contributing to a perfect storm of dangerous health risks.
However, even when controlling for all or any of these factors, we still find that poor people, and particularly African-Americans, still get sick more often, get sicker and die earlier. This leads us to speculate that health disparities are not simply a matter of access to material goods which promote good health, but are tightly related to something less tangible, such as social marginalization and racism, which are both incredibly difficult to measure. Though difficult to quantify, however, we do have plenty of well documented qualitative and historical data which indicate that these relationships are entirely plausible.
The awful history of slavery and apartheid, however, is somewhat (but not completely) unique to the United States. Further, our ideas of class come from another Western idea, the Marxist concept of one privileged group exploiting the weak for their own financial gain, particularly in the context of manufacturing.
Yet, though these ways of conceiving of race and class are so specific to the West, they are applied liberally to analyses of developing country health, with little consideration of their validity.
It is not uncommon to see studies of socio-economic status and health. The typical method of measuring socio-economic status in developing countries is to examine the collection of household assets such as TVs, radios, bicycles, etc. and, using statistically derived weights, sum up all of the things a household owns and call that sum a total measure of wealth. The collection of total measures for each household are then divided into categories, with the implication that they roughly approximate our conception of class.
Not surprisingly, it is usually found that people who don’t own much are, compared with people who do, at higher risk for malaria, TB, diarrheal disease, infant and maternal mortality and a host of other things that one wouldn’t wish on anyone.
But this measure is problematic. First, there is often little care taken to parse out which items are related to the disease of interest. For example, we would expect that better housing conditions are associated with a decreased risk for malaria, since mosquitoes aren’t able to enter a house at night. We would also expect that people with access to clean water would be more likely to not get cholera. If we find relationships of SEP with malaria or diarrheal disease which include these items, these associations should be treated with suspicion.
Second, if we do find a relationship of “class” with health, can we view it in the same way in which we might view this relationship in the United States? A Marxist approach, with a few exploiting the many for profit, in sub-Saharan Africa doesn’t make a whole lot of sense. The manufacturing capacity of African countries is tiny, and most people are sole entrepreneurs operating in an economy that hasn’t changed appreciably from pre-colonial times. Stripping away any requirements of legal protection of property rights, Africa looks incredibly libertarian.
Further, the elite in Africa hardly profit financially from the poor, receiving their cash flows mainly from abroad in the form of foreign aid or bribery and foreign activity is mostly limited to resource exploitation, which doesn’t make a dent into Africa’s vast levels of unemployment. While the West is certainly complicit is Africa’s economic woes, post slavery, the West rarely engages Africans themselves.
So, is it valid to attempt to apply the same ideas of class to African health problems? Is there a way to attribute health disparities to class in societies with limited economic capacity and where the “citizenry” is only marginally engaged and groups suffer mainly from a reluctance to cooperate and engage people of other tribes or neighboring countries?
Certainly, the causes of poverty and marginalization in Africa need to be examined, but I don’t think that we can approach them in the same way we do in the States.
The mismeasurement of humans: classification as “othering”
I was part of a short, but interesting discussion last night regarding this very good article on the political implications of data analysis. The argument made (assuming I understood it correctly) was simply that statistical measures are inherently ideological since they impose a particular view of the world from one social group (us, the elite) on another (the non-elite). She takes this further, stating that though the voice of the elite can be heard through anecdotes (and opinionated blog posts), the experience of the non-elite relies on statistics and numbers. Statistics, then, is the language of power.
The conversation went further to discuss the implications of statistical methods themselves, particularly the measures of central tendency: the mean, median and mode. With perfectly symmetrical data, these measures are all the same, but, of course, no set of data is perfectly symmetrical, so that the application of each will produce different results. Though any responsible statistician would make statements of assumptions, limitations and appropriateness, with politics, these statements are overlooked and the method chosen is often that which best supports one’s political position, asking for trouble.
Moreover, the measure of central tendency itself in inherently flawed since it concentrates on the center and silences the extremes, supporting the status quo, or so it was argued. The choice of measure, I would argue, depends on the goals of the particular study. For example, a study which sought to determine if average graduation rates lower for blacks than whites would necessarily use a measure of central tendency, while a study on which students in a particular school are the least likely to graduate might look at outliers and extremes.
Either way, I agreed with the writer that, no matter what, we are influenced by our ideology. However, there is a difference between performing a study which seeks to maintain impartiality for the greater good and one which seeks to deceive in order to merely win a political battle, particularly among those who benefit from marginalizing, for example, the poor and disenfranchised.
However, I found this passage quite interesting and it can be applied to a post on this blog regarding what we do and don’t know about the poor:
Perhaps statistics should be considered a technology of mistrust—statistics are used when personal experience is in doubt because the analyst has no intimate knowledge of it. Statistics are consistently used as a technology of the educated elite to discuss the lower classes and subaltern populations, those individuals that are considered unknowable and untrustworthy of delivering their own accounts of their daily life. A demand for statistical proof is blatant distrust of someone’s lived experience. The very demand for statistical proof is otherizing because it defines the subject as an outsider, not worthy of the benefit of the doubt.
Part of my academic work focuses on the refinement of measurements of poverty. I am keenly aware of the “othering” of this process and how these measurements use a language of the educated elite (me) to speak for the daily experiences of people not like me.
This “othering” is not limited to statistics at all. Even merely referring to “the poor” is a condescending labeling of a group of people who are mostly powerless to speak for themselves within global power structures. Moreover, “the poor” ignores the diverse and varied experiences of most of humanity.
When I first entered the School of Public Health at UM, I was extremely uncomfortable with the language used in studies of ethnicity and public health in the United States. Studies would simply throw people into simplistic categories of black, white, hispanic, asian and “other” (whatever that is), ignoring the great diversity of people within, for example, urban slums. The method of categorization seemed to be a horrible anachronism and bought back awful memories of Mississippi. Simply putting people into neat categories risked continuing an already divisive view of the world.
However, the more I thought about it, the method is justified since we are looking at the effects of a racist view of the world on the very people who are the most burdened by it. Certainly, there are better ways of viewing the world, but when criticizing social power structures, it can be advantageous to speak its language. I still don’t like it, but I’m at least more understanding of it.
It’s a fine thread to walk. On the one hand, as advocates for “the poor,” we have to work within the very structures which oppress, exploit and ignore them. To succeed, however uncomfortable it may be, we may be required to adopt the language of those structures. On the other, we must remain aware of the potentially dire implications of the ways in which we describe those we advocate for and how they can be misused.
We have no idea what poor people do
I was just reading a post from development economist Ed Carr’s blog, where he reflects on a book he wrote almost five years ago. Reflection is a pretty depressing excercise for any academic, but Carr seems to remain positive about his book.
He sums it up in three points:
“1. Most of the time, we have no idea what the global poor are doing or why they are doing it.
2. Because of this, most of our projects are designed for what we think is going on, which rarely aligns with reality
3. This is why so many development projects fail, and if we keep doing this, the consequences will get dire”
Well, yeah. This is a huge problem. In academics, we filter the experiences of the poor through a lens of academic frameworks, which we haphazardly impose with often no consultation with our subjects. Granted, this is likely inevtiable, but when designing public health interventions, it helps to have some idea of what the poorest of the poor do and why or our efforts are doomed to fail.
I remember a set of arguments a few years back on bed nets. Development and public health people were all upset because people were seen using nets for fishing. The reaction, particularly from in country workers was that poor people are stupid and will shoot themselves in the foot at any opportunity.
I couldn’t really understand the condescension and was rather fascinated that people were taking a new product and adapting it to their own needs. Business would see this as an opportunity and would seek to figure out why people were using nets for things other than malaria prevention and attempt to develop some new strategy to satisfy both needs (fishing and malaria prevention) at once. Academics simply weren’t interested.
To work with the poor, we have to understand them and understanding them requires that we respect their agency. If we don’t do this, we risk alienating the people we seek to help.
Measuring socio-economic status in Kenya
I was just screwing around with some data we collected a bit ago. In a nutshell, I’m working to try to improve the way we measure household wealth in developing countries. For the past 15 years, researchers have relied on a composite index based up easily observable household assets (a la Filmer/Pritchett, 2001).
Enumerators enter a households and quickly note the type of house construction, toilet facilities and the presence of things like radios, TVs, cars, bicycles, etc. Principal Components Analysis (PCA) is then used to create a single continuous measure of household wealth, which is then often broken into quantiles to somewhat appeal to our sense of class and privilege or lack thereof.
It’s a quick and dirty measure that’s almost universally used in large surveys in developing countries. It is the standard for quantifying wealth for Measure DHS, a USAID funded group which does large surveys in developing countries everywhere.
First, I take major issue with the use of PCA to create the composite. PCA assumes that inputs are continuous and normally distributed but the elements of the asset index are often dichotomous (yes/no) or categorical. Further, PCA is extremely sensitive to variations in the level of normality of the elements used, so that results will vary wildly depending on whether you induce normality in your variables or not.
It’s silly to use PCA on this kind of data, but people do it anyway and feel good about it. I’m sure that some of the reason for this is the inclusion of PCA in SPSS (why would anyone ever use SPSS (or PASW or whatever it is now)? a question for another day…)
So… we collected some data. I created a 220 question survey which asked questions typical of the DHS surveys, in addition to non-sensitive questions on household expenditures, income sources, non-observable assets like land and access to banking services and financial activity.
The DHS focuses exclusive on material assets mainly out of convenience, but also of the assumption that assets held today represent purchases in the past, which can act as pretty rough indicators of household income. So I started there and collected what they collect in addition to all my other stuff.
This time, however, I abandoned PCA and opted for Multiple Correspondence Analysis, a technique similar to PCA but intended for categorical data. The end result is similar. You get a set of weights for each item, which (in this case) are then tallied up to create a single continuous measure of wealth (or something like it) for each household in the data set.
Like PCA, the results are somewhat weak. The method only captured about 12% of the variation in the data set, which sort of begs the question as to what is happening with the other 88%. However, we got a cool graph which you can see up on the left. If you look closely, you can see that the variables used tend to follow an intuitive gradient of wealth, running from people who don’t have anything at all and shit in the shrubs to people who have cars and flush toilets.
We surveyed three areas, representing differing levels of development. Looking at how wealth varies by area. we can see that there is one very poor area, which very little variation to the others which have somewhat more spread, and a mean level of wealth that is considerably higher. All of this agreed with intuition.
“Area A” is known to be very rural, isolated and quite poor. Areas B and C are somewhat better though they are somewhat different contextually.
My biggest question, though, was whether a purely asset based index can truly represent a household’s financial status. I wondered if whether large expenditures on things like school fees and health care might actually depress the amount of money available to buy material items.
Thus, we also collected data on common expenditures such schools fees and health care, but also on weekly purchases of cell phone airtime. Interetinngly we found that over all the two were positively correlated with one another, suggesting that higher expenses do not depress the ability for households to make purchases, but found that this relationship does not hold among very poor households.
There is nothing to suggest that high expenses are having a negative effect on material assets among extremely poor households located in Area A at all. It might be the case that there is no relationship at all. This could indicate something else. Though overall there might not be a depressive effect of health care and school costs on material purchases, they might be preventing households from improving their situation. It might only be after a certain point that the two diverge from one another and households are then able to handle paying for both effectively.
Also of interest were the similar patterns found in the three areas.