A complex sampling journey

Sampling scheme

Sampling scheme

When I have been a part of surveys in the past, little attention has been paid toward following any sort of intelligent sampling plan. The time allotted to collect the data has always been too short, resources extremely limited and the conditions of the field mostly unknown.

Mostly what we’re left with is a convenience sample of some kind, usually determined by introductions from the survey workers themselves. It is absolutely the worst way to run a survey and the data is usually crap, but, worse yet, unverifiable crap.

Ideally, in a household level survey, we’d run in establish target areas for sampling, do a complete census on target areas and then perhaps take a random sample within those areas. At the minimum this would be a relatively decent approach.

Unfortunately, I often encounter one of two situations. The first is the convenience sample I mentioned above, which is inherently biased toward the social connections and thus the demographic of the survey workers themselves. If you want to do a sample of someone’s friends and family, this might be a good start, otherwise its completely awful.

The second is the “school based survey,” a design I think I hate more than all others. This travesty of sample design depends on the good graces of families which send their children to school, being lucky that the kids you are interested in show up to school the day of the survey and reasonable connections with school administrations. Worse yet, if you’re doing a survey on health, the chances that you’ll the kids you’re really interested in at school is really low. People love this awful design because it’s convenient, cheap, can be done in a short time and has the added benefit of providing one with warm feelings.

I’ve resolved myself to do neither of these again. As the manager of a Health and Demographic Surveillance System based in Kenya which monitors more than 100,000 people in two regions of Kenya, I decided I have a unique opportunity to do something a little more interesting.

In gearing up for a pilot survey to improve measurement of socio-economic status in developing country contexts, I realized that I had an incredible set of resources at my disposal. I have a full sampling frame on two sets of 50,000 people in two areas of Kenya, basic demographic information and a competent staff with sufficient time to do a project which otherwise would interfere with their regular duties.

With some help from a friend (well, much more than a friend), I maneuvered the basic of complex survey design and came up with something that might work relatively well for my purposes.

The DSS of the area of Kwale, Kenya I’m working in is divided into nine areas, each delegated to a single field interviewer who visits each of the households three times a year. Each field interviewer area is then divided into a number of subgrids, the number of which arbitrarily follows the population surveyed and the logistics of the survey rounds. Some areas are easier to survey than others. Each grid then has a number of households within them, the number of which varies depending on population density.

I want to target three areas, each of which ostensibly will represent different levels of economic development, but in reality represent different types of economic activities and lifestyles. One is relatively urbanized, another is purely agricultural and the third is occupied by agro-pastoralists who keep larger herds of large animals.

RplotI then decided to choose 20 grids in each area at random, and then want to select up to 10 households from each selected grid again at random. The reason for choosing this strategy was purely a logistic one. Survey workers can do about 10 households in a day and I’ve given them a month (20 working days) to do it before they have to start on their next round of regular duties. Normally, I’d like to do something fancier, but without any previous data on the variables I’m interested in, it just wasn’t possible.

I have discovered that this design is called a stratified two stage cluster design which makes it all sound fancier that I really believe it to be. The advantage to using this design is that I’m able to control for the selection probabilities, which can bias the results when doing statistical tests. I have no doubt that the piss poor strategies I’ve used in the past and the dreaded “school based survey” I mentioned above are horribly biased and don’t really tell us a whole lot about whatever it is we’re trying to find out.

I used the survey package in R to determine the selection probabilities and, as I suspected, found that the probability of selection is not uniform across the sampling frame. Some households are more likely to be included in the survey, biasing the data in favor of, for example, people in more densely populated areas.


Alright, enough for now….


About Pete Larson

Assistant Professor of Epidemiology at the Nagasaki University Institute for Tropical Medicine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: