Explorative data analysis has the goal to explore and describe the patterns in your data, without any particular hypothesis.

Good introductions on this topic

Some examples in R

Exploring categorical data

Summary statistics

Some piece of information that gives a quick and simple description of the data.

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
fam=mtcars$fam=factor(mtcars$am, levels=c(0,1), labels=c("automatic","manual"))
#we substract the variable Transmission (0 = automatic, 1 = manual) with "$" from the dataset 
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                         fam
## Mazda RX4            manual
## Mazda RX4 Wag        manual
## Datsun 710           manual
## Hornet 4 Drive    automatic
## Hornet Sportabout automatic
## Valiant           automatic

Frequency table of the Transmission variable

## fam
## automatic    manual 
##        19        13
## fam
## automatic    manual 
##        19        13

% frequencies calculation

## fam
## automatic    manual 
##   0.59375   0.40625

Bar charts

Bar charts are appropiate to summarize categorical variables distributions

barplot(percent, main="Percentage of cars with / without transmission", xlab="transmission", ylab="%", las=1, ylim=c(0,1), names.arg=c("auto transm", "manual transm") )

Exploring continous / categorical data

Summary statistics

For a numerical variable, like “mpg”

## [1] 20.09062
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
sd(mpg) #standard deviation
## [1] 6.026948
var(mpg) #variance
## [1] 36.3241
sqrt(var(mpg)) # = to sd
## [1] 6.026948
sd(mpg)^2 # = to variance
## [1] 36.3241
## [1] 33.9
## automatic    manual 
##  17.14737  24.39231
##                  3      4     5
## automatic 16.10667 21.050    NA
## manual          NA 26.275 21.38


Boxplots are appropiate to summarize numerical variables distributions

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
##     0%    25%    50%    75%   100% 
## 10.400 15.425 19.200 22.800 33.900
##    0%   20%   40%   60%   80%  100% 
## 10.40 15.20 17.92 21.00 24.08 33.90
boxplot(mpg~fam, main="mpg by transmission")


Histograms are appropiate to summarize numerical variables distributions


Stem and Leaf Plots

Stem and Leaf plots are appropiate to summarize numerical variables distributions (low sample size)

##   The decimal point is at the |
##   10 | 44
##   12 | 3
##   14 | 3702258
##   16 | 438
##   18 | 17227
##   20 | 00445
##   22 | 88
##   24 | 4
##   26 | 03
##   28 | 
##   30 | 44
##   32 | 49

?stem for more info There are 2 obs 10.4 There is one obs 32.4 and one 32.9


Scatterplots are appropiate to summarize the relation between two numerical variables

Relation ship between horsepower hp and consumption mpg

plot(mpg~hp) # y~x

plot(hp, mpg) # x,y

plot(hp, mpg,xlab = "Gross horsepower", ylab="Miles/(US) gallon", las=1, col="red", xlim=c(0,400), cex =2 )

#cex (plotting characters size times 2)