Archive | Pointless Data Exercises RSS for this section

The Jigger flea: a neglected scourge

Jigger infestation of the hands. I picked the least awful picture I could find. Note the deformity of the hands. This person has likely been suffering from infections since childhood.

Jigger infestation of the hands. I picked the least awful picture I could find. Note the deformity of the hands. This person has likely been suffering from infections since childhood.

I just learned about probably one of the most horrible dieases I’ve ever seen: the jigger. Tunga penetrans is one of the smallest fleas around, less than 1 mm in length. The gravid female attaches itself to a mammalian host, burrows into the skin head first leaving its read end exposed for breathing and defecation. It feeds on blood from the subcutaneous capillaries and proceeds to produce anywhere from 20-200 eggs. Under the skin it can grow to nearly 1 cm in width.

Tunga penetrans is native to South America, was brought to West Africa through the slave trade. In the mid 19th century it was brought on an English shipping vessel and made its way through trade routes and is now found everywhere throughout the continent.

Bacteria opportunistically invades the site and super-infections (multiple pathogens) are common. Victims suffer from itching and pain and multiple fleas are common. Due to the location of the bite, people often have trouble walking and due to the disgusting nature of the infection, victims are stigmatized and marginalized. Worse yet, the site can becomes gangrenous and auto-amputations of digits and feet and eventually death are not uncommon.

The Parliaments of both Kenya and Uganda have introduced bills in the past calling for the arrest of people suffering from jiggers. Of course, these ridiculous bills don’t come with public health actions to control the disease.

Jiggers are entirely preventable, treatable through either surgical excision or through various medications but risk factors for it are mostly unknown and the data contradictory and mostly inconclusive.

It sometimes occurs in travelers and is easily treated in a clinic on an outpatient basis but is a debilitating infection for poor communities. Thus, it is not taken seriously by international public health groups who choose to focus on big issues like HIV and malaria.

Jiggers are a classic example of the neglected tropical disease: it devastates the poorest of the poor but gets almost no attention from donors or the international press.

We gathered some data on jiggers back in 2011 along the coast of Kenya. Without presenting these results as official, I was drawn to the attached map.

Animals of various species have been implicated as reservoirs for the disease, most notably pigs and dogs. Less understood is the role of wildlife in maintaining transmission. On the map below, the large yellow dots represent cases. Note that they are nearly all located along the Shimba Hills Wildlife Reserve. I calculated the distance of each household to the park’s border (see the funny graph at the bottom), and found a graded relationship between distance and jiggers infections. Past 5km away from the park, the risk of jiggers is nearly zero.

What does this mean? I have ruled out domesticated animals, at least as a primary reservoir. People in this area tend to all own the same types and numbers of animals. Being Islamic, there are no pigs here, but dogs are found everywhere. Despite this, there are distinct spatial patterns which are associated with the park. Note that all of the cases are found between the parks border and a set of lakes, perhaps implying that certain wild animals are traveling there for water and food.

The ecology of jiggers is very poorly understood and, like many pathogens (like Ebola, for example), wildlife probably play an important role.

It’s worth paying me a lot of money to study it.

Locations of jiggers cases. note the proximity to the park.

Locations of jiggers cases. note the proximity to the park.

Distance to wildlife reserve and jiggers risk. Note that risk drops until 5km, then becomes nearly zero.

Distance to wildlife reserve and jiggers risk. Note that risk drops until 5km, then becomes nearly zero.

A complex sampling journey

Sampling scheme

Sampling scheme

When I have been a part of surveys in the past, little attention has been paid toward following any sort of intelligent sampling plan. The time allotted to collect the data has always been too short, resources extremely limited and the conditions of the field mostly unknown.

Mostly what we’re left with is a convenience sample of some kind, usually determined by introductions from the survey workers themselves. It is absolutely the worst way to run a survey and the data is usually crap, but, worse yet, unverifiable crap.

Ideally, in a household level survey, we’d run in establish target areas for sampling, do a complete census on target areas and then perhaps take a random sample within those areas. At the minimum this would be a relatively decent approach.

Unfortunately, I often encounter one of two situations. The first is the convenience sample I mentioned above, which is inherently biased toward the social connections and thus the demographic of the survey workers themselves. If you want to do a sample of someone’s friends and family, this might be a good start, otherwise its completely awful.

The second is the “school based survey,” a design I think I hate more than all others. This travesty of sample design depends on the good graces of families which send their children to school, being lucky that the kids you are interested in show up to school the day of the survey and reasonable connections with school administrations. Worse yet, if you’re doing a survey on health, the chances that you’ll the kids you’re really interested in at school is really low. People love this awful design because it’s convenient, cheap, can be done in a short time and has the added benefit of providing one with warm feelings.

I’ve resolved myself to do neither of these again. As the manager of a Health and Demographic Surveillance System based in Kenya which monitors more than 100,000 people in two regions of Kenya, I decided I have a unique opportunity to do something a little more interesting.

In gearing up for a pilot survey to improve measurement of socio-economic status in developing country contexts, I realized that I had an incredible set of resources at my disposal. I have a full sampling frame on two sets of 50,000 people in two areas of Kenya, basic demographic information and a competent staff with sufficient time to do a project which otherwise would interfere with their regular duties.

With some help from a friend (well, much more than a friend), I maneuvered the basic of complex survey design and came up with something that might work relatively well for my purposes.

The DSS of the area of Kwale, Kenya I’m working in is divided into nine areas, each delegated to a single field interviewer who visits each of the households three times a year. Each field interviewer area is then divided into a number of subgrids, the number of which arbitrarily follows the population surveyed and the logistics of the survey rounds. Some areas are easier to survey than others. Each grid then has a number of households within them, the number of which varies depending on population density.

I want to target three areas, each of which ostensibly will represent different levels of economic development, but in reality represent different types of economic activities and lifestyles. One is relatively urbanized, another is purely agricultural and the third is occupied by agro-pastoralists who keep larger herds of large animals.

RplotI then decided to choose 20 grids in each area at random, and then want to select up to 10 households from each selected grid again at random. The reason for choosing this strategy was purely a logistic one. Survey workers can do about 10 households in a day and I’ve given them a month (20 working days) to do it before they have to start on their next round of regular duties. Normally, I’d like to do something fancier, but without any previous data on the variables I’m interested in, it just wasn’t possible.

I have discovered that this design is called a stratified two stage cluster design which makes it all sound fancier that I really believe it to be. The advantage to using this design is that I’m able to control for the selection probabilities, which can bias the results when doing statistical tests. I have no doubt that the piss poor strategies I’ve used in the past and the dreaded “school based survey” I mentioned above are horribly biased and don’t really tell us a whole lot about whatever it is we’re trying to find out.

I used the survey package in R to determine the selection probabilities and, as I suspected, found that the probability of selection is not uniform across the sampling frame. Some households are more likely to be included in the survey, biasing the data in favor of, for example, people in more densely populated areas.

dstrat1<-svydesign(id=~grid+houseid,strata=~fiarea,fpc=~fiareagridcounts+gridcounts,data=out)

Alright, enough for now….

Measuring socio-economic status in Kenya

Rplot04I was just screwing around with some data we collected a bit ago. In a nutshell, I’m working to try to improve the way we measure household wealth in developing countries. For the past 15 years, researchers have relied on a composite index based up easily observable household assets (a la Filmer/Pritchett, 2001).

Enumerators enter a households and quickly note the type of house construction, toilet facilities and the presence of things like radios, TVs, cars, bicycles, etc. Principal Components Analysis (PCA) is then used to create a single continuous measure of household wealth, which is then often broken into quantiles to somewhat appeal to our sense of class and privilege or lack thereof.

It’s a quick and dirty measure that’s almost universally used in large surveys in developing countries. It is the standard for quantifying wealth for Measure DHS, a USAID funded group which does large surveys in developing countries everywhere.

First, I take major issue with the use of PCA to create the composite. PCA assumes that inputs are continuous and normally distributed but the elements of the asset index are often dichotomous (yes/no) or categorical. Further, PCA is extremely sensitive to variations in the level of normality of the elements used, so that results will vary wildly depending on whether you induce normality in your variables or not.

It’s silly to use PCA on this kind of data, but people do it anyway and feel good about it. I’m sure that some of the reason for this is the inclusion of PCA in SPSS (why would anyone ever use SPSS (or PASW or whatever it is now)? a question for another day…)

So… we collected some data. I created a 220 question survey which asked questions typical of the DHS surveys, in addition to non-sensitive questions on household expenditures, income sources, non-observable assets like land and access to banking services and financial activity.

The DHS focuses exclusive on material assets mainly out of convenience, but also of the assumption that assets held today represent purchases in the past, which can act as pretty rough indicators of household income. So I started there and collected what they collect in addition to all my other stuff.

This time, however, I abandoned PCA and opted for Multiple Correspondence Analysis, a technique similar to PCA but intended for categorical data. The end result is similar. You get a set of weights for each item, which (in this case) are then tallied up to create a single continuous measure of wealth (or something like it) for each household in the data set.

Like PCA, the results are somewhat weak. The method only captured about 12% of the variation in the data set, which sort of begs the question as to what is happening with the other 88%. However, we got a cool graph which you can see up on the left. If you look closely, you can see that the variables used tend to follow an intuitive gradient of wealth, running from people who don’t have anything at all and shit in the shrubs to people who have cars and flush toilets.

HistogramsWe surveyed three areas, representing differing levels of development. Looking at how wealth varies by area. we can see that there is one very poor area, which very little variation to the others which have somewhat more spread, and a mean level of wealth that is considerably higher. All of this agreed with intuition.

“Area A” is known to be very rural, isolated and quite poor. Areas B and C are somewhat better though they are somewhat different contextually.

My biggest question, though, was whether a purely asset based index can truly represent a household’s financial status. I wondered if whether large expenditures on things like school fees and health care might actually depress the amount of money available to buy material items.

Thus, we also collected data on common expenditures such schools fees and health care, but also on weekly purchases of cell phone airtime. Interetinngly we found that over all the two were positively correlated with one another, suggesting that higher expenses do not depress the ability for households to make purchases, but found that this relationship does not hold among very poor households.

There is nothing to suggest that high expenses are having a negative effect on material assets among extremely poor households located in Area A at all. It might be the case that there is no relationship at all. This could indicate something else. Though overall there might not be a depressive effect of health care and school costs on material purchases, they might be preventing households from improving their situation. It might only be after a certain point that the two diverge from one another and households are then able to handle paying for both effectively.

Also of interest were the similar patterns found in the three areas.

Expenses

(Mostly) Vindicated: Euclidean measures of distance are just as good as high priced, fancy measures

DistancePlotsITNIn my seminal paper, “Distance to health services influences insecticide-treated net possession and use among six to 59 month-old children in Malawi,” I indicated that Euclidean (straight line) measures of distance were just as good as more complicated, network based measures.

I didn’t include the graph showing how correlated the two were, but I wish I had and I can’t find it here my computer.

Every time I’ve done presentations of research of the association of distances to various things and health outcomes, someone inevitably asks why I didn’t use a more complex measure of actual travel paths. The idea is that no one walks in a straight line anywhere, but rather follows a road network, or even utilizes a number of transportation options which might be lost in a simple measure.

I always respond that a straight line distance is as good as any other when investigating relationships on a coarse scale. Inevitably, audiences are never convinced.

A new paper came out today, “Methods to measure potential spatial access to delivery care in low- and middle-income countries: a case study in rural Ghana” which compared the Euclidean measure with a number of more complex measurements.

The conclusion confirmed what I already knew, that the Euclidean measure is just as good in most cases, and the pain and cost of producing sexy and complicated ways of calculating distance just isn’t worth it.

It’s a pretty decent paper, but I wish they had put some graphs in to illustrate their points. It would be good to see exactly where the measures disagree.

Background
Access to skilled attendance at childbirth is crucial to reduce maternal and newborn mortality. Several different measures of geographic access are used concurrently in public health research, with the assumption that sophisticated methods are generally better. Most of the evidence for this assumption comes from methodological comparisons in high-income countries. We compare different measures of travel impedance in a case study in Ghana’s Brong Ahafo region to determine if straight-line distance can be an adequate proxy for access to delivery care in certain low- and middle-income country (LMIC) settings.

Methods
We created a geospatial database, mapping population location in both compounds and village centroids, service locations for all health facilities offering delivery care, land-cover and a detailed road network. Six different measures were used to calculate travel impedance to health facilities (straight-line distance, network distance, network travel time and raster travel time, the latter two both mechanized and non-mechanized). The measures were compared using Spearman rank correlation coefficients, absolute differences, and the percentage of the same facilities identified as closest. We used logistic regression with robust standard errors to model the association of the different measures with health facility use for delivery in 9,306 births.

Results
Non-mechanized measures were highly correlated with each other, and identified the same facilities as closest for approximately 80% of villages. Measures calculated from compounds identified the same closest facility as measures from village centroids for over 85% of births. For 90% of births, the aggregation error from using village centroids instead of compound locations was less than 35 minutes and less than 1.12 km. All non-mechanized measures showed an inverse association with facility use of similar magnitude, an approximately 67% reduction in odds of facility delivery per standard deviation increase in each measure (OR = 0.33).

Conclusion
Different data models and population locations produced comparable results in our case study, thus demonstrating that straight-line distance can be reasonably used as a proxy for potential spatial access in certain LMIC settings. The cost of obtaining individually geocoded population location and sophisticated measures of travel impedance should be weighed against the gain in accuracy.

A network visualization of international migration

The UN keeps data on migrations patterns around the world, tracking origin and destination countries and number of migrants (Trends in International Migrant Stock: Migrants by Destination and Origin). I took some time out and created this network visualization of origin and destination countries from 2010. Other years were available, but this is all I had time for.

The size of each node represents the number of countries from which migrants arrive. By far, the most connected country is the United States, accepting more people from more countries than any other place on the planet. Most areas of the network represent geographic regions. Note that Africa is clustered at the top, and pacific island countries are clustered at the bottom.

An interesting result is that countries tend to send migrants to other countries which are only slightly better off than they are. For example, Malawi sends most of its migrants to Zambia and Mozambique, and Zambians go to South Africa, whereas those countries do not reciprocate to countries poorer than them. Wealthy countries tend to be more cosmopolitan in their acceptance of migrants.

Click on the picture to explore a larger version of the graphic.

migration2

Food Prices and Riots in South Africa: One Year Later

I noticed this morning that gold is tanking. This is no surprise as the rally in gold over the 00’s was fueled by excessive commodity speculation following the loosening of finance regulations at the end of Clinton’s term. On the down side, this commodity bubble drove food prices up around the world but, more positively injected a decent amount of cash into Sub-Saharan African economies. Much of the economic growth that Africa has experienced over the past decade has been financed by this bubble.

I’m wondering if Glenn Beck and Ron Paul foresaw this event, encouraging the market in gold to help encourage a price rally, so that they, and other savvy individuals might jump out at the right moment. That smacks of conspiracy, however, and I’m not sure that either are smart enough to think that far ahead. It’s a fun thought to consider.

I was curious, however, if the collapse of the commodity bubble (though really less of a collapse and more of a “cooling down” of a slow burn market) might also be having impacts on social unrest. As I’ve shown here on this blog and other, more learned individuals have confirmed, food prices are associated with social unrest around the world.

I have written before on how the problem of Wall Street speculation is contributing to the problem of rising food prices around the world.

Fortunately, food prices are declining with the real or expected tightening of commodity speculation. Most notably, the Volcker Rule, as part of the Dodd-Frank set of reforms, aim to prevent banks from engaging in risky speculation in commodity derivatives. The Commodity Futures Trading Commission has also introduced several new rules regarding commodity speculation, most notably limiting the grabbing of large stakes of commodity futures with the aim of price manipulation. These rules, which place specific restrictions on agricultural futures and are though to only affect approximately seventy traders worldwide, have been fought vociferously by Wall Street.

Most relevant to developing countries, corn, price fluctuations of which have been shown to be tightly associated with speculative behavior, is returning to pre-2006 levels and could even get back to 1996 levels. This is bad for corn exporters (or maybe a return to normal), but great for corn buyers, particularly families living on pennies a day.

As before, I counted (or rather, the computer did) the number of newspaper articles which mentioned protests in South Africa, arguably the protest capital of the world, and combined them with the FAO’s food price index to see if a dip in prices was associated with a dip in protests. As the graph shows below, there is evidence to suggest that it is.

plot_zoom_png

The Dodd-Frank reforms are welcome, but they aren’t enough. Starving kids and violence that result from excessively high food prices should be considered a major human rights priority. Fortunately, some groups like the World Development Movement are putting pressure on the UK government to enact and support badly needed reforms.

Economics, Austerity and the Selective Use of Data

Reinhart and Rogoff

Recently, I’ve been following the “Reinhart and Rogoff Austerity Debacle.” Reinhart and Rogoff (RR) are two well know economists who penned a paper entitled “Growth in a Time of Debt.” They showed that when the debt to GDP ratio of a given country exceeds a 90% threshold, economic growth stalls. They offer that economies may even contract.

Policy makers in the US and Europe seized on the paper as proof that cutting stimulus and social programs was a good idea, and proceeded to do so with abandon. Of course, right wingers wanted to cut money to social programs anyway, and would have done so regardless, but the paper was held out as scientific proof that it was a solid plan of action.

I won’t comment on how strange it was that Republicans were interested in science at all, given recent efforts to politicize the NSF and micromanage the grant decision process.

The trouble was that the results presented in RR were shown to be based on the selective use of data. Thomas Herndon, a 28-year-old graduate student, obtained the dataset from RR themselves and couldn’t reproduce the results.

In fact, he found that the only way to accurately reproduce the results in RR’s paper that showed that high debt restrained economic growth was to exclude important cases. When including the missing data, high debt was associated with consistently positive growth, though modestly slowed.

Originally, I took the view that this was a case of sloppy science. RR had a dataset, got some results which fit the narrative they were pushing and didn’t pursue the matter any further. Reading Herndon’s paper, however, I changed my mind.

Economic growth vs. debt to GDP Ratio (1946-2010). Loess curve included to illustrate trend.

Economic growth vs. debt to GDP Ratio (1946-2010). Loess curve included to illustrate trend.

Herdon took the data and did what any analyst would do when starting exploratory analysis, he plotted it (see figure on the right). Debt to GDP ratios and growth are both continuous measures. We can do a simple scatterplot and see if there’s any evidence that would suggest that the two things are related.

To me, this is a pretty fuzzy result. Though the loess curve (an interpolation method to illustrate trend) suggest that there is *some* decline in growth overall, I’d still ding any intro stats student for trying to suggest that there’s any relationship at all. There is no way that RR, both trained PhD’s and likely having the help of a paid research assistant, didn’t produce such a plot.

Noting that the loess curve drops past approximately 120%, I calculated the median growth for each country represented. Only 7 countries have had debt to GDP ratios greater than 120% in the past 60+ years: Australia, Belgium, Canada, Japan, New Zealand, the UK and the United States. Out of these only two had (median) negative growth: Belgium (-.69%, effectively zero) and the United States (-10.94%), which has only had a debt to GDP greater than 120% one time. All other countries has positive growth under high debt, even beleaguered Japan. New Zealand can even claim a strong 9.8% growth under high debt. The US, then, is a major outlier, possibly bringing the entire curve down.

As this doesn’t fit their story, RR’s solution was to categorize debt to GDP ratios into five rough classifications, and calculate the mean growth within each group. This is a common trick to extract results from bad data. It’s highly tempting for researchers (and epidemiologists do it far too often), but a bad idea to present it without all the caveats and warnings that should go with it.

I’m not surprised that ideologues such as RR would be so keen to produce the result they did. After all, they published the popular economics work “This Time Is Different: Eight Centuries of Financial Folly” where they try to suggest that budget policy of the US in 2013 should somehow be informed by the economy of 14th century Spain.

I am, however, surprised that reviewers let this pass. If I would have been a reviewer, I would have:

1) pointed out the problems of categorization, where data doesn’t require it
2) noted that categorizing the data (or even plotting it) tears out temporal correlation. For example, one data point from 2008 (stimulus) may be put in the high debt category, but another from 2007 (crash) in the low debt category. While budgets of one year may have little to do with the budget of another, the economy of one year is likely related to the economy of the previous year.
3) questioned the causal mechanisms behind debt and growth. This is obviously a deep question for economists (and not epidemiologists), but of particular import. When does the economy start to react to debt? I’m pretty sure that there is a lag effect as spending bills tend to space disbursements over the course of the fiscal year.

The RR debacle should be a lesson, not only to economists, but to all scientists. While we may always be under pressure to produce results and hope that those results fit and support whatever position we take, shoddy methods don’t get us off the hook. In RR’s case, I would call this fabrication. A good many studies are merely guilty of wishful thinking, but the chance always exists that someone will come out of the woodwork and expose our flaws. After all, that’s what science is all about.

Follow

Get every new post delivered to your Inbox.

Join 1,654 other followers

%d bloggers like this: