While other blog posts will do a much better job of explaining the Data Explorer package in R, it still seemed useful to mention it here.
A huge hurdle to data analysis is data cleaning, and to effectively develop a strategy to efficiently prepare data for analysis, a basic snapshot of the data is helpful.
Enter the Data Explorer package, a set of tools that can provide minimal descriptive information for not much effort at all. With a single command, you can take a raw dataset, and produce a useful report that you can use to start working on your plan of data cleaning attack.
I downloaded a portion of the Social Indicators Survey from Columbia University, and picked a small subset of variables.
Using this small set of code, I produced the report below.
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, hispanic, pearn,
Data Profiling Report
The data is 34.8 Kb in size. There are 453 rows and 12 columns (features). Of all 12 columns, 9 are discrete, 3 are continuous, and 0 are all missing. There are 1,245 missing values out of 5,436 data points.
Data Structure (Text)
## 'data.frame': 453 obs. of 12 variables: ## $ sex : Factor w/ 2 levels "1","2": 2 1 2 1 2 2 1 2 2 1 ... ## $ race : Factor w/ 4 levels "1","2","3","4": 3 1 1 2 3 3 3 4 1 4 ... ## $ educ_r : Factor w/ 4 levels "1","2","3","4": 4 4 2 2 2 1 1 4 4 2 ... ## $ r_age : num 40 28 22 24 31 42 36 63 69 24 ... ## $ hispanic: Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ... ## $ pearn : num 14400 14400 12000 15000 8000 9600 2400 9600 NA NA ... ## $ assets : num 5000 50000 4000 NA NA 6000 NA 1250 100000 NA ... ## $ poor : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ... ## $ read : Factor w/ 4 levels "1","2","3","4": NA NA NA NA NA NA NA NA NA NA ... ## $ homework: Factor w/ 4 levels "1","2","3","4": NA NA NA NA 4 1 1 NA NA NA ... ## $ black : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ... ## $ police : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 NA 2 2 ...
Data Structure (Network Graph)
The following graph shows the distribution of missing values.
Continuous Features (Histogram)
Discrete Features (Bar Chart)
I am always looking for free alternatives to ArcGIS for making pretty maps. R is great for graphics and the new-to-me ggmap package is no exception.
I’m working with some data from Botswana for a contract and needed to plot maps for several years of count based data, where the GPS coordinates for facilities were known. ArcGIS is unwieldy for creating multiple maps of the same type of data based on time points, so R is an ideal choice…. the trouble is the maps I can easily make don’t look all that good (though with tweaking can be made to look better.)
ggmap offered me an easy solution. It downloads a topographic base map from Google and I can easily overlay proportionally sized points represent counts at various geo-located points. This is just a map of Botswanan health facilities (downloaded from Humanitarian Data Exchange) with the square of counts chosen from a normal distribution. The results are rather nice.
#read in grographic extent and boundary for bots
btw <- admin<-readOGR(“GIS Layers/Admin”,”BWA_adm2″) #from DIVA-GIS
# fortify bots boundary for ggplot
btw_df <- fortify(btw)
# get a basemap
btw_basemap <- get_map(location = “botswana”, zoom = 6)
# get the hf data
# create random counts
# Plot this dog
geom_polygon(data=btw_df, aes(x=long, y=lat, group=group), fill=”red”, alpha=0.1) +
geom_point(data=HFs.open.street.map, aes(x=X, y=Y, size=Counts, fill=Counts), shape=21, alpha=0.8) +
scale_size_continuous(range = c(2, 12), breaks=pretty_breaks(5)) +
scale_fill_distiller(breaks = pretty_breaks(5))
I keep staring at this picture, which appeared on “Economist’s View” last March and wondering exactly what I’m supposed to learn from this, aside from the obvious fact that health care in the US is too expensive.
We have known that health care in the US is too expensive for a long while now. We are also pretty sure of the reasons why, none of which are easily solved.
But we shouldn’t assume that there is a causal relationship between health care expenditures and life expectancy. The message here seems to be that other countries increase their health budgets and their citizens live progressively longer, but for some reason it doesn’t work in the US. Well, I don’t think it works anywhere.
There’s no evidence to suggest that extra spending this year will increase life expectancy this year. If anything, it is long past expenditures and improvements to health care that will increase life expectancy today. I think that if we looked at overall economic growth and life expectancy, we would see the same trend. Most of us will live longer, because we were born under better conditions than our grandparents, not because of government spending for health care, the vast majority of which goes to the elderly.
What this tells us, though, is two things: one, that health care in the US costs too much and seems to be increasing without bound (math talk). Second, that life expectancy in the US is shorter than these other countries. This is true, but the US is a fundamentally different place than any of the countries on that list, some of which has to do with social problems (racism) and some of which likely has to do with the fact that we take in larger numbers of immigrants from countries which have low life expectancies than any country on that list. These places aren’t comparable. While solving the problem of racism is noble, I don’t think that many people (except our President and his bigoted minions) want to suggest that we increase US life expectancy by deporting immigrants or closing the door to people from, say, Africa.
But we should be careful not to take home the message that there is an intrinsic relationship between spending and lifespan because that would be just misleading in my opinion.
Currently, I’m doing a research project on snakebites and found this gem in the literature, of which there is little:
“Snake bites are common in many regions of the world. Snake envenomation is relatively uncommon in Egypt; such unfortunate events usually attract much publicity. Snake bite is almost only accidental, occurring in urban areas and desert. Few cases were reported to commit suicide by snake. Homicidal snake poisoning is so rare. It was known in ancient world by executing capital punishment by throwing the victim into a pit full of snakes. Another way was to ask the victim to put his hand inside a small basket harboring a deadly snake. Killing a victim by direct snake bite is so rare. There was one reported case where an old couple was killed by snake bite. Here is the first reported case of killing three children by snake bite. It appeared that the diagnosis of such cases is so difficult and depended mainly on the circumstantial evidences.”
When does a person “ask” someone to “put his hand inside a small basket harboring a deadly snake?” Does that ever happen? Apparently so.
Apparently a man killed his three children using a snake.
It gets better:
“In deep police office investigations, it was found that the father disliked these three children as they were girls. He married another woman and had a male baby. The father decided to get rid of his girl children. To achieve his plan, he trained to become snake charmer and bought a snake (Egyptian cobra). The father forced the snake to bite the three children several times and left them to die. At last, he burned the snake.”
Paulis, M. G. and Faheem, A. L. (2016), Homicidal Snake Bite in Children. J Forensic Sci, 61: 559–561. doi:10.1111/1556-4029.12997
I was reading Chris Blattman‘s blog this morning where he had a cool post on the increasing use of development jargon in published material. Words like “impact,” “stakeholder,” and “capacity” are all over the place here on the continent.
These terms are so pervasive, that people drop them in everyday conversation, almost creating a language on their own.
Honestly, I’m not really sure what “capacity” is supposed to mean, let alone am I able to identify who is and who isn’t a “stakeholder.” The cynical me says that a “stakeholder” is a person who is able to scrape off development funds into their own pockets, which seems to be a national pastime here. “Capacity” is as condescending as it sounds. Who decides who has the “capacity” to do things anyway? Are people who lack skills “incapacitated?”
The most annoying to me are “self help groups” which are, in essence, simply small business cooperatives. Not sure why their existence has to be treated as writing some past individual wrong. Given that it is mostly illegal to have a business here in Kenya (due to onerous laws on trade left over from the Brits and overzealous bureaucrats looking for bribes), it is possible that a “self help group” simply avoids many of the most costly permitting laws but more likely that a development group felt the need to give a fancy name to something completely normal.
That, however, is an aside.
If Google Trends is to be believed, interest in the development industry is waning in Kenya. I searched for trends in four terms, “capacity,” “sustainable development,” “stakeholder,” and the almighty “per diem.”
Development organizations often pay people to attend “seminars” on this or that topic in the form of “per diems” which are often not small. A fairly educated Kenyan can make a decent wage from attending these seminars on a regular basis. Harry Englund of Churchill College wrote a cool book on the subject called “Prisoners of Freedom.”
Anyway, here’s the graph. I found it kind of reassuring. Countries like Kenya can’t claim independence while holding out their hands waiting for development money to come through. Kenya is not a poor country. It doesn’t need many of these development projects when it is perfectly able to stand on its own. If these trends are to be believed, there is reason to be hopeful.