in Academics

Data Explorer package in R

While other blog posts will do a much better job of explaining the Data Explorer package in R, it still seemed useful to mention it here.

A huge hurdle to data analysis is data cleaning, and to effectively develop a strategy to efficiently prepare data for analysis, a basic snapshot of the data is helpful.

Enter the Data Explorer package, a set of tools that can provide minimal descriptive information for not much effort at all. With a single command, you can take a raw dataset, and produce a useful report that you can use to start working on your plan of data cleaning attack.

I downloaded a portion of the Social Indicators Survey from Columbia University, and picked a small subset of variables.

Using this small set of code, I produced the report below.

library(DataExplorer)

sis<-read.csv(“siswave3v4impute3.csv”)

sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, hispanic, pearn,
assets,poor,read,homework,black, police)))

create_report(sis_sm)

Basic Statistics

The data is 34.8 Kb in size. There are 453 rows and 12 columns (features). Of all 12 columns, 9 are discrete, 3 are continuous, and 0 are all missing. There are 1,245 missing values out of 5,436 data points.

Data Structure (Text)

## 'data.frame':    453 obs. of  12 variables:
##  $ sex     : Factor w/ 2 levels "1","2": 2 1 2 1 2 2 1 2 2 1 ...
##  $ race    : Factor w/ 4 levels "1","2","3","4": 3 1 1 2 3 3 3 4 1 4 ...
##  $ educ_r  : Factor w/ 4 levels "1","2","3","4": 4 4 2 2 2 1 1 4 4 2 ...
##  $ r_age   : num  40 28 22 24 31 42 36 63 69 24 ...
##  $ hispanic: Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ...
##  $ pearn   : num  14400 14400 12000 15000 8000 9600 2400 9600 NA NA ...
##  $ assets  : num  5000 50000 4000 NA NA 6000 NA 1250 100000 NA ...
##  $ poor    : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
##  $ read    : Factor w/ 4 levels "1","2","3","4": NA NA NA NA NA NA NA NA NA NA ...
##  $ homework: Factor w/ 4 levels "1","2","3","4": NA NA NA NA 4 1 1 NA NA NA ...
##  $ black   : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ police  : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 NA 2 2 ...

Data Structure (Network Graph)

Missing Values

The following graph shows the distribution of missing values.

Data Distribution

Continuous Features (Histogram)

Discrete Features (Bar Chart)

Correlation Analysis

About Pete Larson

Researcher at the University of Michigan Institute for Social Research. Lecturer in the University of Michigan School of Public Health and at the University of Massachusetts Amherst. I do epidemiology, public health, GIS, health disparities and environmental justice. I also do music and weird stuff.

View all posts by Pete Larson »

« Previous post