While other blog posts will do a much better job of explaining the Data Explorer package in R, it still seemed useful to mention it here.
A huge hurdle to data analysis is data cleaning, and to effectively develop a strategy to efficiently prepare data for analysis, a basic snapshot of the data is helpful.
Enter the Data Explorer package, a set of tools that can provide minimal descriptive information for not much effort at all. With a single command, you can take a raw dataset, and produce a useful report that you can use to start working on your plan of data cleaning attack.
I downloaded a portion of the Social Indicators Survey from Columbia University, and picked a small subset of variables.
Using this small set of code, I produced the report below.
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, hispanic, pearn,
Data Profiling Report
The data is 34.8 Kb in size. There are 453 rows and 12 columns (features). Of all 12 columns, 9 are discrete, 3 are continuous, and 0 are all missing. There are 1,245 missing values out of 5,436 data points.
Data Structure (Text)
## 'data.frame': 453 obs. of 12 variables: ## $ sex : Factor w/ 2 levels "1","2": 2 1 2 1 2 2 1 2 2 1 ... ## $ race : Factor w/ 4 levels "1","2","3","4": 3 1 1 2 3 3 3 4 1 4 ... ## $ educ_r : Factor w/ 4 levels "1","2","3","4": 4 4 2 2 2 1 1 4 4 2 ... ## $ r_age : num 40 28 22 24 31 42 36 63 69 24 ... ## $ hispanic: Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ... ## $ pearn : num 14400 14400 12000 15000 8000 9600 2400 9600 NA NA ... ## $ assets : num 5000 50000 4000 NA NA 6000 NA 1250 100000 NA ... ## $ poor : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ... ## $ read : Factor w/ 4 levels "1","2","3","4": NA NA NA NA NA NA NA NA NA NA ... ## $ homework: Factor w/ 4 levels "1","2","3","4": NA NA NA NA 4 1 1 NA NA NA ... ## $ black : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ... ## $ police : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 NA 2 2 ...
Data Structure (Network Graph)
The following graph shows the distribution of missing values.
Continuous Features (Histogram)
Discrete Features (Bar Chart)
Not sure why but for some reason over lunch I got interested in old labor songs. This one was particularly bleak. Apparently, it is intended to be sung over “My Bonnie Lies Over The Ocean.” As our administration erodes labor and environmental protections for the inexplicable sake of bringing back coal mining, it pays to have a look back at how bad it really was.
Song: My Children are Seven in Number
Lyrics: Eleanor Kellogg(1)
Music: to the tune of “My Bonnie Lies Over the Ocean”
My children are seven in number,
We have to sleep four in a bed;
I’m striking with my fellow workers.
To get them more clothes and more bread.
Shoes, shoes, we’re striking for pairs of shoes,
Shoes, shoes, we’re striking for pairs of shoes.
Pellagra(3) is cramping my stomach,
My wife is sick with TB(4);
My babies are starving for sweet milk,
Oh, there as so much sickness for me.
Milk, milk, we’re striking for gallons of milk,
Milk, milk, we’re striking for gallons of milk.
I’m needing a shave and a haircut,
But barbers I cannot afford;
My wife cannot wash without soapsuds,
And she had to borrow a board.
This song was originally posted on protestsonglyrics.net
Soap, soap, we’re striking for bars of soap,
Soap, soap, we’re striking for bars of soap.
My house is a shack on the hillside,
Its doors are unpainted and bare;
I haven’t a screen to my windows,
And carbide cans do for a chair.
Homes, homes, we’re striking for better homes,
Homes, homes, we’re striking for better homes.
They shot Barney Graham(5) our leader,
His spirit abides with us still;
The spirit of strength for justice,
No bullets have power to kill.
This song was originally posted on protestsonglyrics.net
Barney, Barney, we’re thinking of you today,
Barney, Barney, we’re thinking of you today.
Oh, miners, go on with the union,
Oh, miners, go on with the fight;
For we’re in the struggle for justice,
And we’re in the struggle for right.
Justice, justice, we’re striking for justice for all,
Justice, justice, we’re striking for justice for all.
I am always looking for free alternatives to ArcGIS for making pretty maps. R is great for graphics and the new-to-me ggmap package is no exception.
I’m working with some data from Botswana for a contract and needed to plot maps for several years of count based data, where the GPS coordinates for facilities were known. ArcGIS is unwieldy for creating multiple maps of the same type of data based on time points, so R is an ideal choice…. the trouble is the maps I can easily make don’t look all that good (though with tweaking can be made to look better.)
ggmap offered me an easy solution. It downloads a topographic base map from Google and I can easily overlay proportionally sized points represent counts at various geo-located points. This is just a map of Botswanan health facilities (downloaded from Humanitarian Data Exchange) with the square of counts chosen from a normal distribution. The results are rather nice.
#read in grographic extent and boundary for bots
btw <- admin<-readOGR(“GIS Layers/Admin”,”BWA_adm2″) #from DIVA-GIS
# fortify bots boundary for ggplot
btw_df <- fortify(btw)
# get a basemap
btw_basemap <- get_map(location = “botswana”, zoom = 6)
# get the hf data
# create random counts
# Plot this dog
geom_polygon(data=btw_df, aes(x=long, y=lat, group=group), fill=”red”, alpha=0.1) +
geom_point(data=HFs.open.street.map, aes(x=X, y=Y, size=Counts, fill=Counts), shape=21, alpha=0.8) +
scale_size_continuous(range = c(2, 12), breaks=pretty_breaks(5)) +
scale_fill_distiller(breaks = pretty_breaks(5))
I keep staring at this picture, which appeared on “Economist’s View” last March and wondering exactly what I’m supposed to learn from this, aside from the obvious fact that health care in the US is too expensive.
We have known that health care in the US is too expensive for a long while now. We are also pretty sure of the reasons why, none of which are easily solved.
But we shouldn’t assume that there is a causal relationship between health care expenditures and life expectancy. The message here seems to be that other countries increase their health budgets and their citizens live progressively longer, but for some reason it doesn’t work in the US. Well, I don’t think it works anywhere.
There’s no evidence to suggest that extra spending this year will increase life expectancy this year. If anything, it is long past expenditures and improvements to health care that will increase life expectancy today. I think that if we looked at overall economic growth and life expectancy, we would see the same trend. Most of us will live longer, because we were born under better conditions than our grandparents, not because of government spending for health care, the vast majority of which goes to the elderly.
What this tells us, though, is two things: one, that health care in the US costs too much and seems to be increasing without bound (math talk). Second, that life expectancy in the US is shorter than these other countries. This is true, but the US is a fundamentally different place than any of the countries on that list, some of which has to do with social problems (racism) and some of which likely has to do with the fact that we take in larger numbers of immigrants from countries which have low life expectancies than any country on that list. These places aren’t comparable. While solving the problem of racism is noble, I don’t think that many people (except our President and his bigoted minions) want to suggest that we increase US life expectancy by deporting immigrants or closing the door to people from, say, Africa.
But we should be careful not to take home the message that there is an intrinsic relationship between spending and lifespan because that would be just misleading in my opinion.
Currently, I’m doing a research project on snakebites and found this gem in the literature, of which there is little:
“Snake bites are common in many regions of the world. Snake envenomation is relatively uncommon in Egypt; such unfortunate events usually attract much publicity. Snake bite is almost only accidental, occurring in urban areas and desert. Few cases were reported to commit suicide by snake. Homicidal snake poisoning is so rare. It was known in ancient world by executing capital punishment by throwing the victim into a pit full of snakes. Another way was to ask the victim to put his hand inside a small basket harboring a deadly snake. Killing a victim by direct snake bite is so rare. There was one reported case where an old couple was killed by snake bite. Here is the first reported case of killing three children by snake bite. It appeared that the diagnosis of such cases is so difficult and depended mainly on the circumstantial evidences.”
When does a person “ask” someone to “put his hand inside a small basket harboring a deadly snake?” Does that ever happen? Apparently so.
Apparently a man killed his three children using a snake.
It gets better:
“In deep police office investigations, it was found that the father disliked these three children as they were girls. He married another woman and had a male baby. The father decided to get rid of his girl children. To achieve his plan, he trained to become snake charmer and bought a snake (Egyptian cobra). The father forced the snake to bite the three children several times and left them to die. At last, he burned the snake.”
Paulis, M. G. and Faheem, A. L. (2016), Homicidal Snake Bite in Children. J Forensic Sci, 61: 559–561. doi:10.1111/1556-4029.12997