Archive | February 2018

Data Explorer package in R

While other blog posts will do a much better job of explaining the Data Explorer package in R, it still seemed useful to mention it here.

A huge hurdle to data analysis is data cleaning, and to effectively develop a strategy to efficiently prepare data for analysis, a basic snapshot of the data is helpful.

Enter the Data Explorer package, a set of tools that can provide minimal descriptive information for not much effort at all. With a single command, you can take a raw dataset, and produce a useful report that you can use to start working on your plan of data cleaning attack.

I downloaded a portion of the Social Indicators Survey from Columbia University, and picked a small subset of variables.

Using this small set of code, I produced the report below.

library(DataExplorer)

sis<-read.csv(“siswave3v4impute3.csv”)

sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, hispanic, pearn,
assets,poor,read,homework,black, police)))

create_report(sis_sm)

Basic Statistics

The data is 34.8 Kb in size. There are 453 rows and 12 columns (features). Of all 12 columns, 9 are discrete, 3 are continuous, and 0 are all missing. There are 1,245 missing values out of 5,436 data points.

Data Structure (Text)

## 'data.frame':    453 obs. of  12 variables:
##  $ sex     : Factor w/ 2 levels "1","2": 2 1 2 1 2 2 1 2 2 1 ...
##  $ race    : Factor w/ 4 levels "1","2","3","4": 3 1 1 2 3 3 3 4 1 4 ...
##  $ educ_r  : Factor w/ 4 levels "1","2","3","4": 4 4 2 2 2 1 1 4 4 2 ...
##  $ r_age   : num  40 28 22 24 31 42 36 63 69 24 ...
##  $ hispanic: Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ...
##  $ pearn   : num  14400 14400 12000 15000 8000 9600 2400 9600 NA NA ...
##  $ assets  : num  5000 50000 4000 NA NA 6000 NA 1250 100000 NA ...
##  $ poor    : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
##  $ read    : Factor w/ 4 levels "1","2","3","4": NA NA NA NA NA NA NA NA NA NA ...
##  $ homework: Factor w/ 4 levels "1","2","3","4": NA NA NA NA 4 1 1 NA NA NA ...
##  $ black   : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ police  : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 NA 2 2 ...

Data Structure (Network Graph)

0

Missing Values

The following graph shows the distribution of missing values.

1

Data Distribution

Continuous Features (Histogram)

2

Discrete Features (Bar Chart)

3

Correlation Analysis

4

 

 

Advertisements
%d bloggers like this: