Hypothesis testing is one of the most widely used approaches of statistical inference.

The idea of hypothesis testing (more formally: null hypothesis significance testing - NHST) is the following: if we have some data observed, and we have a statistical model, we can use this statistical model to specify a fixed hypothesis about how the data did arise. For the example with the plants and music, this hypothesis could be: music has no influence on plants, all differences we see are due to random variation between individuals.

The null hypothesis H0 and the alternative hypothesis H1

Such a scenario is called the null hypothesis H0. Although it is very typical to use the assumption of no effect as null-hypothesis, note that it is really your choice, and you could use anything as null hypothesis, also the assumption: “classical music doubles the growth of plants”. The fact that it’s the analyst’s choice what to fix as null hypothesis is part of the reason why there are are a large number of tests available. We will see a few of them in the following chapter about important hypothesis tests.

The hypothesis that H0 is wrong, or !H0, is usually called the alternative hypothesis, H1

Given a statistical model, a “normal” or “simple” null hypothesis specifies a single value for the parameter of interest as the “base expectation”. A composite null hypothesis specifies a range of values for the parameter.

p-value

If we have a null hypothesis, we calculate the probability that we would see the observed data or data more extreme under this scenario. This is called a hypothesis tests, and we call the probability the p-value. If the p-value falls under a certain level (the significance level $\alpha$) we say the null hypothesis was rejected, and there is significant support for the alternative hypothesis. The level of $\alpha$ is a convention, in ecology we chose typically 0.05, so if a p-value falls below 0.05, we can reject the null hypothesis.

Test Statistic

Type I and II error

Significance level, Power

Misinterpretations

A problem with hypothesis tests and p-values is that their results are notoriously misinterpreted. The p-value is NOT the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false, although many authors have made the mistake of interpreting it like that \citep[][]{Cohen-earthisround-1994}. Rather, the idea of p-values is to control the rate of false positives (Type I error). When doing hypothesis tests on random data, with an $\alpha$ level of 0.05, one will get exactly 5\% false positives. Not more and not less.

Further readings

  • The Essential Statistics lecture notes
  • http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html

Examples in R

Recall statistical tests, or more formally, null-hypothesis significance testing (NHST) is one of several ways in which you can approach data. The idea is that you define a null-hypothesis, and then you look a the probability that the data would occur under the assumption that the null hypothesis is true.

Now, there can be many null hypothesis, so you need many tests. The most widely used tests are given here.

library(knitr)
opts_knit$set(global.par=TRUE) 
opts_chunk$set(cache.extra = rand_seed,fig.align='center')
set.seed(12)

t-test

The t-test can be used to test whether one sample is different from a reference value (e.g. 0: one-sample t-test), whether two samples are different (two-sample t-test) or whether two paired samples are different (paired t-test).

The t-test assumes that the data are normally distributed. It can handle samples with same or different variances, but needs to be “told” so.

t-test for 1 sample (PARAMETRIC TEST)

The one-sample t-test compares the MEAN score of a sample to a known value, usually the population MEAN (the average for the outcome of some population of interest).

data = read.table("../Data/Simone/das.txt",header=T)
attach(data)
boxplot(y)

Our null hypothesis is that the mean of the sample is not less than 2.5 (real example: weight data of 200 lizards collected for a research, we want to compare it with the known average weights available in the scientific literature)

t.test(y,mu=2.5,alt="less",conf=0.95)  # mean = 2.5, alternative hypothesis one-sided; we get a one-sided 95% CI for the mean 
## 
## 	One Sample t-test
## 
## data:  y
## t = -3.3349, df = 99, p-value = 0.000601
## alternative hypothesis: true mean is less than 2.5
## 95 percent confidence interval:
##      -Inf 2.459557
## sample estimates:
## mean of x 
##  2.419456
t.test(y,mu=2.5,alt="two.sided",conf=0.95) #2 sided-version
## 
## 	One Sample t-test
## 
## data:  y
## t = -3.3349, df = 99, p-value = 0.001202
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
##  2.371533 2.467379
## sample estimates:
## mean of x 
##  2.419456
detach(data)

t-test for 1 sample (NON-PARAMETRIC TEST)

One-sample Wilcoxon signed rank test is a non-parametric alternative method of one-sample t-test, which is used to test whether the location (MEDIAN) of the measurement is equal to a specified value

Create fake data log-normally distributed and verify data distribution

x<-exp(rnorm(15))
plot(x)

boxplot(x)

qqnorm(x)
qqline(x,lty=2,col=2,lwd=3)

shapiro.test(x) # not normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  x
## W = 0.5674, p-value = 1.297e-05
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1357  0.3913  0.5335  0.9005  0.8788  4.8410

Our null hypothesis is that the median of x is not different from 1

wilcox.test(x, alternative="two.sided", mu=1) # high p value-> median of x is not different from 1
## 
## 	Wilcoxon signed rank test
## 
## data:  x
## V = 23, p-value = 0.03534
## alternative hypothesis: true location is not equal to 1

Two Independent Samples T-test (PARAMETRIC TEST)

Parametric method for examining the difference in MEANS between two independent populations. The t-test should be preceeded by a graphical depiction of the data in order to check for normality within groups and for evidence of heteroscedasticity (= differences in variance), like so:

Reshape the data:

Y1 <- rnorm(20)
Y2 <- rlnorm(20)
Y <- c(Y1, Y2)
groups <- as.factor(c(rep("Y1", length(Y1)), rep("Y2", length(Y2))))

Now plot them as points (not box-n-whiskers):

plot.default(Y ~ groups)

The points to the right scatter similar to those on the left, although a bit more asymmetrically. Although we know that they are from a log-normal distribution (right), they don’t look problematic.

If data are not normally distributed, we sometimes succeed making data normal by using transformations, such as square-root, log, or alike (see section on transformations).

While t-tests on transformed data now actually test for differences between these transformed data, that is typically fine. Think of the pH-value, which is only a log-transform of the proton concentration. Do we care whether two treatments are different in pH or in proton concentrations? If so, then we need to choose the right data set. Most likely, we don’t and only choose the log-transform because the data are actually lognormally distributed, not normally.

A non-parametric alternative is the Mann-Whitney-U-test, or, the ANOVA-equivalent, the Kruskal-Wallis test. Both are available in R and explained later, but instead we recommend the following:

Use rank-transformations, which replaces the values by their rank (i.e. the lowest value receives a 1, the second lowest a 2 and so forth). A t-test of rank-transformed data is not the same as the Mann-Whitney-U-test, but it is more sensitive and hence preferable (Ruxton 2006) or at least equivalent (Zimmerman 2012).

t.test(rank(Y) ~ groups)
## 
## 	Welch Two Sample t-test
## 
## data:  rank(Y) by groups
## t = -3.9252, df = 35.266, p-value = 0.000384
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.811615  -5.988385
## sample estimates:
## mean in group Y1 mean in group Y2 
##             14.3             26.7

To use the rank, we need to employ the “formula”-invokation of t.test! In this case, results are the same, indicating that our hunch about acceptable skew and scatter was correct.

(Note that the original t-test is a test for differences between means, while the rank-t-test becomes a test for general differences in values between the two groups, not specifically of the mean.)

Cars example:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars$fam=factor(mtcars$am, levels=c(0,1), labels=c("automatic","manual")) #transform "am" into a factor (fam) [Transmission (0 = automatic, 1 = manual)]

attach(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                         fam
## Mazda RX4            manual
## Mazda RX4 Wag        manual
## Datsun 710           manual
## Hornet 4 Drive    automatic
## Hornet Sportabout automatic
## Valiant           automatic
summary(mtcars$fam)
## automatic    manual 
##        19        13

Test the difference in car consumption depending on the transmission type. Check wherever the 2 ‘independent populations’ are normally distributed

par(mfrow=c(1,2))
qqnorm(mpg[fam=="manual"]);qqline(mpg[fam=="manual"])
qqnorm(mpg[fam=="automatic"]); qqline(mpg[fam=="automatic"])

shapiro.test(mpg[fam=="manual"])  #normal distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  mpg[fam == "manual"]
## W = 0.9458, p-value = 0.5363
shapiro.test(mpg[fam=="automatic"])   #normal distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  mpg[fam == "automatic"]
## W = 0.9768, p-value = 0.8987
par(mfrow=c(1,1))

Graphic representation

boxplot(mpg~fam, ylab="Miles/gallon",xlab="Transmission",las=1)

alt text

We have two ~normally distributed populations. In order to test for differences in means, we applied a t-test for independent samples.

Any time we work with the t-test, we have to verify whether the variance is equal betwenn the 2 populations or not, then we fit the t-test accordingly.
Our Ho or null hypothesis is that the consumption is the same irrespective to transmission. We assume non-equal variances

t.test(mpg~fam,mu=0, alt="two.sided",conf=0.95,var.eq=F,paired=F)
## 
## 	Welch Two Sample t-test
## 
## data:  mpg by fam
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

From the output: please note that CIs are the confidence intervales for differences in means

Same results if you run the following (meaning that the other commands were all by default)

t.test(mpg~fam) 
## 
## 	Welch Two Sample t-test
## 
## data:  mpg by fam
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

The alternative could be one-sided (greater, lesser) as we discussed earlier for one-sample t-tests

If we assume equal variance, we run the following

t.test(mpg~fam,var.eq=TRUE,paired=F)
## 
## 	Two Sample t-test
## 
## data:  mpg by fam
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

Ways to check for equal / not equal variance

1) To examine the boxplot visually

boxplot(mpg~fam, ylab="Miles/gallon",xlab="Transmission",las=1)

2) To compute the actual variance

var(mpg[fam=="manual"])
## [1] 38.02577
var(mpg[fam=="automatic"])
## [1] 14.6993

There is 2/3 times difference in variance.

3) Levene’s test

library(car)
leveneTest(mpg~fam) # Ho is that the population variances are equal
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  4.1876 0.04957 *
##       30                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Ho rejected, non-equal variances

Mann-Whitney U test/Wilcoxon rank-sum test for two independent samples (NON-PARAMETRIC TEST)

We change the response variable to hp (Gross horsepower)

qqnorm(hp[fam=="manual"]);qqline(hp[fam=="manual"])

qqnorm(hp[fam=="automatic"]);qqline(hp[fam=="automatic"])

shapiro.test(hp[fam=="manual"])  #not normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  hp[fam == "manual"]
## W = 0.7676, p-value = 0.00288
shapiro.test(hp[fam=="automatic"])  #normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  hp[fam == "automatic"]
## W = 0.9583, p-value = 0.5403

The ‘population’ of cars with manual transmission has a hp not normally distributed, so we have to use a test for independent samples - non-parametric

We want to test a difference in hp depending on the transmission Using a non-parametric test, we test for differences in MEDIANS between 2 independent populations

boxplot(hp~fam)

Our null hypothesis will be that the medians are equal (two-sided)

wilcox.test(hp~fam,mu=0,alt="two.sided",conf.int=T,conf.level=0.95,paired=F,exact=F) ## non parametric conf.int are reported
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data:  hp by fam
## W = 176, p-value = 0.0457
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  8.624237e-05 9.199995e+01
## sample estimates:
## difference in location 
##               55.00007
detach(mtcars)

Wilcoxon signed rank test for two dependend samples (NON PARAMETRIC)

This is a non-parametric method appropriate for examining the median difference in 2 populations observations that are paired or dependent one of the other.

This is a dataset about some water measurements taken at different levels of a river: ‘up’ and ‘down’ are water quality measurements of the same river taken before and after a water treatment filter, respectively

streams = read.table("../Data/Simone/streams.txt",header=T)
head(streams)
##   down up
## 1   20 23
## 2   15 16
## 3    6 10
## 4    5  4
## 5   20 22
## 6   15 15
attach(streams)
summary(streams)
##       down             up       
##  Min.   : 5.00   Min.   : 4.00  
##  1st Qu.: 5.75   1st Qu.: 9.25  
##  Median :12.50   Median :13.00  
##  Mean   :12.00   Mean   :13.38  
##  3rd Qu.:15.75   3rd Qu.:17.25  
##  Max.   :20.00   Max.   :23.00
plot(up,down)
abline(a=0,b=1) #add a line with intercept 0 and slope 1

The line you see in the plot corresponds to x=y, that is, the same water measuremets before and after the water treatment (it seems to be true in 2 rivers only, 5 and 15)

Our null hypothesis is that the median before and after the treatment are not different

shapiro.test(down) #not normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  down
## W = 0.866, p-value = 0.02367
shapiro.test(up) #normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  up
## W = 0.9361, p-value = 0.3038

the assumption of normality is certainly not met for the measurements after the treatment

summary(up)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    9.25   13.00   13.38   17.25   23.00
summary(down)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.75   12.50   12.00   15.75   20.00
wilcox.test(up,down,mu=0,paired=T,conf.int=T,exact=F) #paired =T, low p ->reject Ho, medians are different
## 
## 	Wilcoxon signed rank test with continuity correction
## 
## data:  up and down
## V = 97, p-value = 0.004971
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  0.9999931 2.4999530
## sample estimates:
## (pseudo)median 
##            1.5
detach(streams)

Paired T-test for two dependend samples test. (PARAMETRIC)

This parametric method examinates the difference in means for two populations that are paired or dependent one of the other

fish = read.table("../Data/Simone/fishing.txt",header=T)

This is a dataset about the density of a fish prey species (fish/km2) in 121 lakes before and after removing a non-native predator

attach(fish)
head(fish)
##   lakes    before     after
## 1     1 19.508582 20.508582
## 2     2  5.297289  7.297289
## 3     3 26.495652 27.495652
## 4     4  3.928250  5.928250
## 5     5 12.955881 15.955881
## 6     6  9.776376 15.776376
boxplot(before,after,ylab="Fish Density",
        names=c("before", "after"))

shapiro.test(before) #normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  before
## W = 0.9958, p-value = 0.9777
shapiro.test(after)  #normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  after
## W = 0.9946, p-value = 0.9258
plot(before,after)
abline(a=0,b=1) 

t.test(before,after,mu=0,paired=T)
## 
## 	Paired t-test
## 
## data:  before and after
## t = -12.0616, df = 120, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.513985 -1.805017
## sample estimates:
## mean of the differences 
##               -2.159501
t.test(after,before,mu=0,paired=T)
## 
## 	Paired t-test
## 
## data:  after and before
## t = 12.0616, df = 120, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.805017 2.513985
## sample estimates:
## mean of the differences 
##                2.159501

changing the order of variables, we have a change in the sign of the t-test estimated mean of differences

low p ->reject Ho, means are equal

detach(fish)

Testing for normality

The normal distribution is the most important and most widely used distribution in statistics. We can say that a distribution is normally distributed when: 1) is symmetric around their mean. 2) the mean, median, and mode of a normal distribution are equal. 3) the area under the normal curve is equal to 1.0. 4) distributions are denser in the center and less dense in the tails. 5) distributions are defined by two parameters, the mean and the standard deviation (sd). 6) 68% of the area of a normal distribution is within one standard deviation of the mean. 7) Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

Normal distribution

#Simulation of the weight of 1570 adult males normally distributed
data1=rnorm(1570,mean=75,sd=8)
hist(data1,main="Adult male weights",xlab="adult male weight",col="red",las=1)

Load example data

data = read.table("../Data/Simone/das.txt",header=T)

Visualize example data

attach(data) #command search() helps to verify what is/is not attached)
par(mfrow=c(2,2)) #to divide the plot window
plot(y)
boxplot(y)
hist(y,breaks=20)
y2=y
y2[52]=21.75 # to change the 52nd value for 21.75 instead of 2.175:
plot(y2)     #very good to spot mistakes, outliers

par(mfrow=c(1,1)) #back to one plot window

Visual Check for Normality: quantile-quantile plot

This one plots the ranked samples from our distribution against a similar number of ranked quantiles taken from a normal distribution. If our sample is normally distributed then the line will be straight. Exceptions from normality show up different sorts of non-linearity (e.g. S-shapes or banana shapes).

qqnorm(y)
qqline(y,lty=2,col=2,lwd=3)

Normality test: the shapiro.test

shapiro.test(y)  # p-value=0.753, these data are normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  y
## W = 0.9911, p-value = 0.753
detach(data)

As an example we will create a fake data log-normally distributed and verify the assumption of normality

x=exp(rnorm(30))  #rnorm without specification (normal distributed or not) picks data from the standard normal, mean = 0, sd = 1
plot(x)

boxplot(x)

alt text

hist(x,breaks=50)

qqnorm(x)
qqline(x,lty=2,col=2,lwd=3)

shapiro.test(x)  #p-value=8.661e-07, not normally distributed
## 
## 	Shapiro-Wilk normality test
## 
## data:  x
## W = 0.8531, p-value = 0.0007207

Correlations tests

Correlation tests measure the relationship between variables. This relationship can goes from +1 to -1, where 0 means no relation. Some of the tests that we can use to estimate this relationship are the following:

-Pearson’s correlation is a parametric measure of the linear association between 2 numeric variables (PARAMETRIC TEST)

-Spearman’s rank correlation is a non-parametric measure of the monotonic association between 2 numeric variables (NON-PARAMETRIC TEST)

-Kendall’s rank correlation is another non-parametric measure of the associtaion, based on concordance or discordance of x-y pairs (NON-PARAMETRIC TEST)

attach(mtcars)  
plot(hp,wt, main="scatterplot",las=1, xlab ="gross horse power", ylab="Weight (lb/1000)")

Compute the three correlation coefficients

cor(hp,wt,method="pearson")
## [1] 0.6587479
cor(hp,wt)#Pearson is the default method; the order of variables is not important
## [1] 0.6587479
cor(hp,wt,method="spearman")
## [1] 0.7746767
cor(hp,wt,method="kendal")
## [1] 0.6113081

`

Test the null hypothesis, that means that the correlation is 0 (there is no correlation)

cor.test(hp,wt,method="pearson") #Pearson correlation test
## 
## 	Pearson's product-moment correlation
## 
## data:  hp and wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479
cor.test(hp,wt,method="spearman") #Spearmn is a non-parametric, thus it is not possible to get CIs. There is a error message because R cannot compute exact p values (the test is based on ranks, we have few cars with the same hp or wt).We can get rid off the warning letting R know that approximate values are fine
## Warning in cor.test.default(hp, wt, method = "spearman"): Cannot compute
## exact p-value with ties
## 
## 	Spearman's rank correlation rho
## 
## data:  hp and wt
## S = 1229.364, p-value = 1.954e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7746767
cor.test(hp,wt,method="spearman",exact=F) 
## 
## 	Spearman's rank correlation rho
## 
## data:  hp and wt
## S = 1229.364, p-value = 1.954e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7746767
cor.test(hp,wt,method="kendal",exact=F) #same happens with Kendal correlation test
## 
## 	Kendall's rank correlation tau
## 
## data:  hp and wt
## z = 4.845, p-value = 1.266e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.6113081

When we have non-parametric data and we do not know which correlation method to choose, as a rule of thumb, if the correlation looks non-linear, Kendall tau should be better than Spearman Rho.

Further handy functions for correlations

Plot all possible combinations with “pairs”

pairs(mtcars)  # all possible pairwise plots

To make it simpler we select what we are interested

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb" "fam"
pairs(mtcars[,c(1,4,6)]) # subsetting the categories we will use

Building a correlation matrix

#cor(mtcars)
cor(mtcars[,c(1,4,6)])
##            mpg         hp         wt
## mpg  1.0000000 -0.7761684 -0.8676594
## hp  -0.7761684  1.0000000  0.6587479
## wt  -0.8676594  0.6587479  1.0000000
detach(mtcars)

References

  • Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student??????s t-test and the Mann-Whitney U test. Behavioral Ecology, 17, 688-690.
  • Zimmerman, D. W. (2012). A note on consistency of non-parametric rank tests and related rank transformations. British Journal of Mathematical and Statistical Psychology, 65, 122-44.
  • http://www.uni-kiel.de/psychologie/rexrepos/rerDescriptive.html