Archive | Pointless Data Exercises RSS for this section

Terror in the Mid-East: It’s never been worse

TerrorWe are entering into one of the most chaotic chapters of modern history, though the geographic space of this chaos is smaller than it has ever been. While most countries are experiencing less terror, Mid-Eastern terrorist have never been busier or more successful.

I downloaded data from the Global Terrorism Database, which comprises more then 125,000 individual acts of terror and found that, since 2010, the number of weekly terror events when from somewhere around 10 to more than 40, and the trend doesn’t look like it’s ending anytime soon.

Moreover, while terror events are becoming more frequent, they are becoming more and more unpredictable.

While the world was shocked over Charlie Hebdoe, the troubling scale up in the number of terror events seems to have mostly gone unnoticed. Terrorists strike Islamic countries far more than they do France, and kill more than just cartoonists and policemen.

It is unproductive to view all terror groups and even acts of terror as being the same. Terror has turned into a morass of competing groups, with differing political aims and the loose nature of Al Qaeda has led to an outsourcing of terror by any local thug with a gun.

It is also unproductive to view Mid-East terror as simply restricted to the angry victims of drone attacks. Islamic terrorism has a deep history with roots going back decades, a history which seems to be widely ignored. It is also important to note that ISIS’ membership consists of a frighteningly large number of Westerners and a careful watch of their videos reveals that English, rather than Arabic, is a common language among its followers.

Where will this go? No one knows, but Charlie Hebdoe will be just a blip on the pattern on terror.


A brief thought on evolution: multi-generational survival


Often people will mention that we are “adapted” to do this or another thing, either indicating some crime of modernity (of course, ignoring the fact that a larger percentage of babies are surviving and people are living longer and healthier than at any time in human history) or trying to point out some example of the glaring perfection of our creation, with either an implicit or vocal reference to divine creation.

For example, obesity is attributed to fat and protein rich modern diets since we aren’t “adapted” to eat these types of foods (despite having found the food in East Africa so unpalatable that we had to learn to crush or cook it to digest it efficiently). Our bad disposition is blamed on a lack of sleep since we aren’t “adapted” to sleep as little as we do (this might be true). Most recently, a book writer blamed our problems with depression on a divorced relationship to nature, given that we are “adapted” to hunt and kill for food and then revel over the blood stained corpse (of course, the writer doesn’t consider that people in antiquity might have been depressed, too).

There may be some truth to some of this. However, “adaptation” implies something about the individual, when evolution, in fact, is about reproduction. We aren’t “adapted” to anything. Rather, certain traits are selected for based on the survival of at least two generations of living things, at least for complex social animals like ourselves.

Simply surviving as an individual does not insure the survival of a species. Living things must first survive long enough to reproduce and then, at least in humans, insure that the children make it to reproductive age. Human babies are horribly weak in contrast to sharks, which are ready to go even before they leave the mother. Further, in the case of humans, a full three generations must live at once to insure long term survival.

Thus, we maintain a tenuous relationship with out environment, where traits do not necessarily favor a single individual, but rather an entire family unit, and these traits may or may not imply perfect harmony with an environment, but rather do the job at least satisfactorily.

Nature cares little for quality as numerous examples throughout our physiology show. To claim that we are somehow “perfectly suited” to a specific environment is just simply wrong. Merely, we have come to a brokered peace (after generations of brutal trial and error, what we eat today is thanks to the deaths of millions, mostly children, who had to die to allow us to do so) with wherever we live in order to allow a few of our kids and grandkids to survive.

This, of course, has deep implications for public health. Some public health problems are known to be passed down from parents to children, but in a multi-generational evolutionary framework, it is possible that certain public health problems can be passed through 3 or more generations at a time, complicating interventions. Certainly, the multi-generational health problems of the descendants of African slaves can be an example of this. How can we intervene to protect the public health over a full century?

OK, back to work.

2014 in review: I’m not sure what this all really says

It is pretty obvious that after July, something happened and I stopped posting with any sort of regularity. I really need to fix this or whatever is keeping me from posting. I don’t get a whole lot of traffic on this blog, but it seems that every day I don’t post is a missed opportunity for me.

Anyway, to all of you who read this blog in 2014, I thank you. It’s great to have you around. I wish everyone a great 2015.


The stats helper monkeys prepared a 2014 annual report for this blog.

Here's an excerpt:

Madison Square Garden can seat 20,000 people for a concert. This blog was viewed about 62,000 times in 2014. If it were a concert at Madison Square Garden, it would take about 3 sold-out performances for that many people to see it.

Click here to see the complete report.

Does the environment cause poverty?

SESKwaleAfrican countries are blessed with ample cropland and resources, but suffer from crippling and unforgivable levels of poverty, have some of the shortest lifespans on the planet and the highest rates of infant mortality in the world. Meanwhile, Japan, Korea, Sweden, Switzerland and Singapore are wholly the opposite, yet mostly lacking in everything that Africa has. Clearly, the picture is more complicated than merely having access to a natural resources.

However, within countries, the picture might be different. African countries are complex and diverse places. Poverty is often confined to the most unproductive regions, areas with poor soils, poor rainfalls or dangerous terrains.

I was just working with some socio-economic data from one of our field sites, and noticed some interesting patterns (note the map up top). In Kwale, a small area along the Coast, socio-economic levels vary widely, but neighbors tend to be like neighbors and patterns of socio-economic clustering emerge.

Note that the poorest of the poor are concentrated to an area in the middle, which I know to be extremely dry, difficult to get to, difficult to farm and generally tough to live in.

I tried to see if socio-economic status (as measured through a composite material wealth index a la Filmer and Pritchett but using multiple correspondence analysis rather than PCA) was related to any environmental variables that I might have data for.

I fit a generalized additive model using the continuous measure of of wealth from the MCA as an outcome. Knowing that very few things in nature or human societies are linear, I also applied smoothing to the predictors to relax these assumptions. The results can be seen in the plot at the bottom.

A few interesting things came out. While it is hard to tell much about the poorest of the poor, we can tell something about the most wealthy. The richest in this poor area, tend to live in areas with the richest vegetation (possibly representing water), a high altitude (low temperature), high relief (no standing water) and in locations distant from a wildlife reserve (far from annoying and dangerous wildlife).

I’m not sure the wildlife reserve is meaningful (unless the reserve was an area undesirable for human habitation to begin with), but the others might be and represent a trend seen in other Sub-Saharan contexts. Areas without malarious swamps and ample farm land tend to do the best. Central Province, one of the most developed areas of Kenya, would be an example.

But the question has to be, does a harsh environment doom people to poverty, or do people self shuffle into and compete for access to more favorable areas? Is environmentally determined poverty (or wealth) an accident of birth, or the result of competitive selection?

Alright, back to work. Oh wait, this is my work. Well….

Results of GAM model of SES in Kwale. Y axis is the continuous measure of socio-economic status.

Results of GAM model of SES in Kwale. Y axis is the continuous measure of socio-economic status.

The Jigger flea: a neglected scourge

Jigger infestation of the hands. I picked the least awful picture I could find. Note the deformity of the hands. This person has likely been suffering from infections since childhood.

Jigger infestation of the hands. I picked the least awful picture I could find. Note the deformity of the hands. This person has likely been suffering from infections since childhood.

I just learned about probably one of the most horrible dieases I’ve ever seen: the jigger. Tunga penetrans is one of the smallest fleas around, less than 1 mm in length. The gravid female attaches itself to a mammalian host, burrows into the skin head first leaving its read end exposed for breathing and defecation. It feeds on blood from the subcutaneous capillaries and proceeds to produce anywhere from 20-200 eggs. Under the skin it can grow to nearly 1 cm in width.

Tunga penetrans is native to South America, was brought to West Africa through the slave trade. In the mid 19th century it was brought on an English shipping vessel and made its way through trade routes and is now found everywhere throughout the continent.

Bacteria opportunistically invades the site and super-infections (multiple pathogens) are common. Victims suffer from itching and pain and multiple fleas are common. Due to the location of the bite, people often have trouble walking and due to the disgusting nature of the infection, victims are stigmatized and marginalized. Worse yet, the site can becomes gangrenous and auto-amputations of digits and feet and eventually death are not uncommon.

The Parliaments of both Kenya and Uganda have introduced bills in the past calling for the arrest of people suffering from jiggers. Of course, these ridiculous bills don’t come with public health actions to control the disease.

Jiggers are entirely preventable, treatable through either surgical excision or through various medications but risk factors for it are mostly unknown and the data contradictory and mostly inconclusive.

It sometimes occurs in travelers and is easily treated in a clinic on an outpatient basis but is a debilitating infection for poor communities. Thus, it is not taken seriously by international public health groups who choose to focus on big issues like HIV and malaria.

Jiggers are a classic example of the neglected tropical disease: it devastates the poorest of the poor but gets almost no attention from donors or the international press.

We gathered some data on jiggers back in 2011 along the coast of Kenya. Without presenting these results as official, I was drawn to the attached map.

Animals of various species have been implicated as reservoirs for the disease, most notably pigs and dogs. Less understood is the role of wildlife in maintaining transmission. On the map below, the large yellow dots represent cases. Note that they are nearly all located along the Shimba Hills Wildlife Reserve. I calculated the distance of each household to the park’s border (see the funny graph at the bottom), and found a graded relationship between distance and jiggers infections. Past 5km away from the park, the risk of jiggers is nearly zero.

What does this mean? I have ruled out domesticated animals, at least as a primary reservoir. People in this area tend to all own the same types and numbers of animals. Being Islamic, there are no pigs here, but dogs are found everywhere. Despite this, there are distinct spatial patterns which are associated with the park. Note that all of the cases are found between the parks border and a set of lakes, perhaps implying that certain wild animals are traveling there for water and food.

The ecology of jiggers is very poorly understood and, like many pathogens (like Ebola, for example), wildlife probably play an important role.

It’s worth paying me a lot of money to study it.

Locations of jiggers cases. note the proximity to the park.

Locations of jiggers cases. note the proximity to the park.

Distance to wildlife reserve and jiggers risk. Note that risk drops until 5km, then becomes nearly zero.

Distance to wildlife reserve and jiggers risk. Note that risk drops until 5km, then becomes nearly zero.

A complex sampling journey

Sampling scheme

Sampling scheme

When I have been a part of surveys in the past, little attention has been paid toward following any sort of intelligent sampling plan. The time allotted to collect the data has always been too short, resources extremely limited and the conditions of the field mostly unknown.

Mostly what we’re left with is a convenience sample of some kind, usually determined by introductions from the survey workers themselves. It is absolutely the worst way to run a survey and the data is usually crap, but, worse yet, unverifiable crap.

Ideally, in a household level survey, we’d run in establish target areas for sampling, do a complete census on target areas and then perhaps take a random sample within those areas. At the minimum this would be a relatively decent approach.

Unfortunately, I often encounter one of two situations. The first is the convenience sample I mentioned above, which is inherently biased toward the social connections and thus the demographic of the survey workers themselves. If you want to do a sample of someone’s friends and family, this might be a good start, otherwise its completely awful.

The second is the “school based survey,” a design I think I hate more than all others. This travesty of sample design depends on the good graces of families which send their children to school, being lucky that the kids you are interested in show up to school the day of the survey and reasonable connections with school administrations. Worse yet, if you’re doing a survey on health, the chances that you’ll the kids you’re really interested in at school is really low. People love this awful design because it’s convenient, cheap, can be done in a short time and has the added benefit of providing one with warm feelings.

I’ve resolved myself to do neither of these again. As the manager of a Health and Demographic Surveillance System based in Kenya which monitors more than 100,000 people in two regions of Kenya, I decided I have a unique opportunity to do something a little more interesting.

In gearing up for a pilot survey to improve measurement of socio-economic status in developing country contexts, I realized that I had an incredible set of resources at my disposal. I have a full sampling frame on two sets of 50,000 people in two areas of Kenya, basic demographic information and a competent staff with sufficient time to do a project which otherwise would interfere with their regular duties.

With some help from a friend (well, much more than a friend), I maneuvered the basic of complex survey design and came up with something that might work relatively well for my purposes.

The DSS of the area of Kwale, Kenya I’m working in is divided into nine areas, each delegated to a single field interviewer who visits each of the households three times a year. Each field interviewer area is then divided into a number of subgrids, the number of which arbitrarily follows the population surveyed and the logistics of the survey rounds. Some areas are easier to survey than others. Each grid then has a number of households within them, the number of which varies depending on population density.

I want to target three areas, each of which ostensibly will represent different levels of economic development, but in reality represent different types of economic activities and lifestyles. One is relatively urbanized, another is purely agricultural and the third is occupied by agro-pastoralists who keep larger herds of large animals.

RplotI then decided to choose 20 grids in each area at random, and then want to select up to 10 households from each selected grid again at random. The reason for choosing this strategy was purely a logistic one. Survey workers can do about 10 households in a day and I’ve given them a month (20 working days) to do it before they have to start on their next round of regular duties. Normally, I’d like to do something fancier, but without any previous data on the variables I’m interested in, it just wasn’t possible.

I have discovered that this design is called a stratified two stage cluster design which makes it all sound fancier that I really believe it to be. The advantage to using this design is that I’m able to control for the selection probabilities, which can bias the results when doing statistical tests. I have no doubt that the piss poor strategies I’ve used in the past and the dreaded “school based survey” I mentioned above are horribly biased and don’t really tell us a whole lot about whatever it is we’re trying to find out.

I used the survey package in R to determine the selection probabilities and, as I suspected, found that the probability of selection is not uniform across the sampling frame. Some households are more likely to be included in the survey, biasing the data in favor of, for example, people in more densely populated areas.


Alright, enough for now….

Measuring socio-economic status in Kenya

Rplot04I was just screwing around with some data we collected a bit ago. In a nutshell, I’m working to try to improve the way we measure household wealth in developing countries. For the past 15 years, researchers have relied on a composite index based up easily observable household assets (a la Filmer/Pritchett, 2001).

Enumerators enter a households and quickly note the type of house construction, toilet facilities and the presence of things like radios, TVs, cars, bicycles, etc. Principal Components Analysis (PCA) is then used to create a single continuous measure of household wealth, which is then often broken into quantiles to somewhat appeal to our sense of class and privilege or lack thereof.

It’s a quick and dirty measure that’s almost universally used in large surveys in developing countries. It is the standard for quantifying wealth for Measure DHS, a USAID funded group which does large surveys in developing countries everywhere.

First, I take major issue with the use of PCA to create the composite. PCA assumes that inputs are continuous and normally distributed but the elements of the asset index are often dichotomous (yes/no) or categorical. Further, PCA is extremely sensitive to variations in the level of normality of the elements used, so that results will vary wildly depending on whether you induce normality in your variables or not.

It’s silly to use PCA on this kind of data, but people do it anyway and feel good about it. I’m sure that some of the reason for this is the inclusion of PCA in SPSS (why would anyone ever use SPSS (or PASW or whatever it is now)? a question for another day…)

So… we collected some data. I created a 220 question survey which asked questions typical of the DHS surveys, in addition to non-sensitive questions on household expenditures, income sources, non-observable assets like land and access to banking services and financial activity.

The DHS focuses exclusive on material assets mainly out of convenience, but also of the assumption that assets held today represent purchases in the past, which can act as pretty rough indicators of household income. So I started there and collected what they collect in addition to all my other stuff.

This time, however, I abandoned PCA and opted for Multiple Correspondence Analysis, a technique similar to PCA but intended for categorical data. The end result is similar. You get a set of weights for each item, which (in this case) are then tallied up to create a single continuous measure of wealth (or something like it) for each household in the data set.

Like PCA, the results are somewhat weak. The method only captured about 12% of the variation in the data set, which sort of begs the question as to what is happening with the other 88%. However, we got a cool graph which you can see up on the left. If you look closely, you can see that the variables used tend to follow an intuitive gradient of wealth, running from people who don’t have anything at all and shit in the shrubs to people who have cars and flush toilets.

HistogramsWe surveyed three areas, representing differing levels of development. Looking at how wealth varies by area. we can see that there is one very poor area, which very little variation to the others which have somewhat more spread, and a mean level of wealth that is considerably higher. All of this agreed with intuition.

“Area A” is known to be very rural, isolated and quite poor. Areas B and C are somewhat better though they are somewhat different contextually.

My biggest question, though, was whether a purely asset based index can truly represent a household’s financial status. I wondered if whether large expenditures on things like school fees and health care might actually depress the amount of money available to buy material items.

Thus, we also collected data on common expenditures such schools fees and health care, but also on weekly purchases of cell phone airtime. Interetinngly we found that over all the two were positively correlated with one another, suggesting that higher expenses do not depress the ability for households to make purchases, but found that this relationship does not hold among very poor households.

There is nothing to suggest that high expenses are having a negative effect on material assets among extremely poor households located in Area A at all. It might be the case that there is no relationship at all. This could indicate something else. Though overall there might not be a depressive effect of health care and school costs on material purchases, they might be preventing households from improving their situation. It might only be after a certain point that the two diverge from one another and households are then able to handle paying for both effectively.

Also of interest were the similar patterns found in the three areas.



Get every new post delivered to your Inbox.

Join 1,784 other followers

%d bloggers like this: