This document is summarised in the table below. It shows the linear models underlying common parametric and nonparametric tests. Formulating all the tests in the same language highlights the many similarities between them. Get it as an image or as a PDF.
Most of the common statistical models (ttest, correlation, ANOVA; chisquare, etc.) are special cases of linear models or a very close approximation. This beautiful simplicity means that there is less to learn. In particular, it all comes down to (y = a cdot x + b) which most students know from highschool. Unfortunately, stats intro courses are usually taught as if each test is an independent tool, needlessly making life more complicated for students and teachers alike.
This needless complexity multiplies when students try to rote learn the parametric assumptions underlying each test separately rather than deducing them from the linear model.
For this reason, I think that teaching linear models first and foremost and then namedropping the special cases along the way makes for an excellent teaching strategy, emphasizing understanding over rote learning. Since linear models are the same across frequentist, Bayesian, and permutationbased inferences, I’d argue that it’s better to start with modeling than pvalues, type1 errors, Bayes factors, or other inferences.
Concerning the teaching of nonparametric tests in introcourses, I think that we can justify lyingtochildren and teach nonparametric tests as if they are merely ranked versions of the corresponding parametric tests. It is much better for students to think “ranks!” than to beleive that you can magically throw away assumptions. Indeed, the Bayesian equivalents of nonparametric tests implemented in JASP literally just do (latent) ranking and that’s it. For the frequentist nonparametric tests considered here, this approach is highly accurate for N > 15.
Use the menu to jump to your favourite section. There are links to lots of similar (though more scattered) stuff under sources and teaching materials. I hope that you will join in suggesting improvements or submitting improvements yourself in the Github repo to this page. Let’s make it awesome!
Unfold this if you want to see functions and other settings for this notebook:
# Load packages for data handling and plotting
library(tidyverse)
library(patchwork)
library(broom)
# Reproducible "random" results
set.seed(40)
# To show tables. Rounds
print_df = function(D, decimals=4, navigate=FALSE) {
DT::datatable(mutate_if(D, is.numeric, round, decimals),
rownames = FALSE,
options = list(
searching=FALSE,
lengthChange=FALSE,
ordering=FALSE,
autoWidth=TRUE,
bPaginate=navigate,
bInfo=navigate,
paging=navigate
)
)
}
# Generate normal data with known parameters
rnorm_fixed = function(N, mu=0, sd=1) scale(rnorm(N))*sd + mu
# Plot style.
theme_axis = function(P, jitter=FALSE, xlim=c(0.5, 2), ylim=c(0.5, 2), legend.position=NULL) {
P = P + theme_bw(15) +
geom_segment(x=1000, xend=1000, y=0, yend=0, lty=2, color='dark gray', lwd=0.5) +
geom_segment(x=0, xend=0, y=1000, yend=1000, lty=2, color='dark gray', lwd=0.5) +
coord_cartesian(xlim=xlim, ylim=ylim) +
theme(axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
legend.position = legend.position)
# Return jittered or nonjittered plot?
if(jitter) {
P + geom_jitter(width=0.1, size=2)
}
else {
P + geom_point(size=2)
}
}
For a start, we’ll keep it simple and play with three standard normals in wide (a
, b
, c
) and long format (value
, group
):
# Wide format (sort of)
y = rnorm_fixed(50, mu=0.3, sd=2) # Almost zero mean
x = rnorm_fixed(50, mu=0, sd=1) # Used in correlation where this is on xaxis
y2 = rnorm_fixed(50, mu=0.5, sd=1.5) # Used in two means
# Long format data with indicator
value = c(y, y2)
group = rep(c('y1', 'y2'), each = 50)
# We'll need the signed rank function for a lot of the "nonparametric" tests
signed_rank = function(x) sign(x) * rank(abs(x))
Theory: As linear models
Model: the recipe for (y) is a slope ((beta_1)) times (x) plus an intercept ((beta_0), aka a straight line).
(y = beta_0 + beta_1 x qquad mathcal{H}_0: beta_1 = 0)
… which is a mathy way of writing the good old (y = ax + b) (here ordered as (y = b + ax)). In R we are lazy and write y ~ 1 + x
which R reads like y = 1*number + x*othernumber
and the task of ttests, lm, etc., is simply to find the numbers that best predict (y).
Either way you write it, it’s an intercept ((beta_0)) and a slope ((beta_1)) yielding a straight line:
# Fixed correlation
D_correlation = data.frame(MASS::mvrnorm(30, mu=c(0.9, 0.9), Sigma=matrix(c(1, 0.8, 1, 0.8), ncol=2), empirical=TRUE)) # Correlated data
# Add labels (for next plot)
D_correlation$label_num = sprintf('(%.1f,%.1f)', D_correlation$X1, D_correlation$X2)
D_correlation$label_rank = sprintf('(%i,%i)', rank(D_correlation$X1), rank(D_correlation$X2))
# Plot it
fit = lm(I(X2*0.5+0.4) ~ I(X1*0.5+0.2), D_correlation)
intercept_pearson = coefficients(fit)[1]
P_pearson = ggplot(D_correlation, aes(x=X1*0.5+0.2, y=X2*0.5+0.4)) +
geom_smooth(method=lm, se=FALSE, lwd=2, aes(colour='beta_1')) +
geom_segment(x=100, xend=100,
y=intercept_pearson, yend=intercept_pearson,
lwd=2, aes(color="beta_0")) +
scale_color_manual(name=NULL, values=c("blue", "red"), labels=c(bquote(beta[0]*" (intercept)"), bquote(beta[1]*" (slope)")))
theme_axis(P_pearson, legend.position=c(0.4, 0.9))
This is often simply called a regression model which can be extended to multiple regression where there are several (beta)s and on the righthand side multiplied with the predictors. Everything below, from onesample ttest to twoway ANOVA are just special cases of this system. Nothing more, nothing less.
As the name implies, the Spearman rank correlation is a Pearson correlation on ranktransformed (x) and (y):
(rank(y) = beta_0 + beta_1 cdot rank(x) qquad mathcal{H}_0: beta_1 = 0)
The correlation coefficient of the linear model is identical to a “real” Pearson correlation, but pvalues are an approximation which is is appropriate for samples greater than N=10 and almost perfect when N > 20. Such a nice and nonmysterious equivalence that many students are left unaware of! Visualizing them side by side including data labels, we see this ranktransformation in action:
# Spearman intercept
intercept_spearman = coefficients(lm(rank(X2) ~ rank(X1), D_correlation))[1]
# Spearman plot
P_spearman = ggplot(D_correlation, aes(x=rank(X1), y=rank(X2))) +
geom_smooth(method=lm, se=FALSE, lwd=2, aes(color='beta_1')) +
geom_text(aes(label=label_rank), nudge_y=1, size=3, color='dark gray') +
geom_segment(x=100, xend=100,
y=intercept_spearman, yend=intercept_spearman,
lwd=2, aes(color='beta_0')) +
scale_color_manual(name=NULL, values=c("blue", "red"), labels=c(bquote(beta[0]*" (intercept)"), bquote(beta[1]*" (slope)")))
# Stich together using patchwork
(theme_axis(P_pearson, legend.position=c(0.5, 0.1)) + geom_text(aes(label=label_num), nudge_y=0.1, size=3, color='dark gray') + labs(title=' Pearson')) + (theme_axis(P_spearman, xlim=c(7.5, 30), ylim=c(7.5, 30), legend.position=c(0.5, 0.1)) + labs(title=' Spearman'))
R code: Pearson correlation
It couldn’t be much simpler to run these models in R. They yield identical p
and t
, but there’s a catch: lm
gives you the slope and even though that is usually much more interpretable and informative than the correlation coefficient r, you may still want r. Luckily, the slope becomes r
if x
and y
have a standard deviation of exactly 1. You can do this using scale(x)
or I(x/sd(x))
:
a = cor.test(y, x, method = "pearson") # Builtin
b = lm(y ~ 1 + x) # Equivalent linear model: y = Beta0*1 + Beta1*x
c = lm(scale(y) ~ 1 + scale(x)) # On scaled vars to recover r
Results:
##
## Pearson's productmoment correlation
##
## data: y and x
## t = 1.6507, df = 48, pvalue = 0.1053
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.47920849 0.04978276
## sample estimates:
## cor
## 0.2317767
##
##
## Call:
## lm(formula = y ~ 1 + x)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.3393 1.6593 0.3349 1.3629 3.5214
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.3000 0.2780 1.079 0.286
## x 0.4636 0.2808 1.651 0.105
##
## Residual standard error: 1.966 on 48 degrees of freedom
## Multiple Rsquared: 0.05372, Adjusted Rsquared: 0.03401
## Fstatistic: 2.725 on 1 and 48 DF, pvalue: 0.1053
##
##
## Call:
## lm(formula = scale(y) ~ 1 + scale(x))
##
## Residuals:
## Min 1Q Median 3Q Max
## 1.6697 0.8297 0.1675 0.6815 1.7607
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 1.341e17 1.390e01 0.000 1.000
## scale(x) 2.318e01 1.404e01 1.651 0.105
##
## Residual standard error: 0.9828 on 48 degrees of freedom
## Multiple Rsquared: 0.05372, Adjusted Rsquared: 0.03401
## Fstatistic: 2.725 on 1 and 48 DF, pvalue: 0.1053
The CIs are not exactly identical, but very close.
R code: Spearman correlation
Note that we can interpret the slope which is the number of ranks (y) change for each rank on (x). I think that this is a pretty interesting number. However, the intercept is less interpretable since it lies at (rank(x) = 0) which is impossible since x starts at 1.
See the identical r
(now “rho”) and p
:
# Spearman correlation
a = cor.test(y, x, method = "spearman") # Builtin
b = lm(rank(y) ~ 1 + rank(x)) # Equivalent linear model
Let’s look at the results:
##
## Spearman's rank correlation rho
##
## data: y and x
## S = 25544, pvalue = 0.1135
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2266026
##
##
## Call:
## lm(formula = rank(y) ~ 1 + rank(x))
##
## Residuals:
## Min 1Q Median 3Q Max
## 26.4655 11.5603 0.4458 11.5628 25.6921
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 31.2784 4.1191 7.593 9.11e10 ***
## rank(x) 0.2266 0.1406 1.612 0.114
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.35 on 48 degrees of freedom
## Multiple Rsquared: 0.05135, Adjusted Rsquared: 0.03159
## Fstatistic: 2.598 on 1 and 48 DF, pvalue: 0.1135
One sample ttest and Wilcoxon signedrank
Theory: As linear models
ttest model: A single number predicts (y).
(y = beta_0 qquad mathcal{H}_0: beta_0 = 0)
In other words, it’s our good old (y = beta_0 + beta_1*x) where the last term is gone since there is no (x) (essentially (x=0), see left figure below).
The same is to a very close approximately true for Wilcoxon signedrank test, just with the signed ranks of (y) instead of (y) itself (see right panel below and caveat in the end of this section):
(signed_rank(y) = beta_0)
# Ttest
D_t1 = data.frame(y=rnorm_fixed(20, 0.5, 0.6),
x=runif(20, 0.93, 1.07)) # Fix mean and SD
P_t1 = ggplot(D_t1, aes(y=y, x=0)) +
stat_summary(fun.y=mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color='beta_0'), lwd=2) +
scale_color_manual(name=NULL, values=c("blue"), labels=c(bquote(beta[0]*" (intercept)"))) +
geom_text(aes(label=round(y, 1)), nudge_x = 0.2, size=3, color='dark gray') +
labs(title=' Ttest')
# Wilcoxon
D_t1_rank = data.frame(y = signed_rank(D_t1$y))
P_t1_rank = ggplot(D_t1_rank, aes(y=y, x=0)) +
stat_summary(fun.y=mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color='beta_0'), lwd=2) +
scale_color_manual(name=NULL, values=c("blue"), labels=c(bquote(beta[0]*" (intercept)"))) +
geom_text(aes(label=y), nudge_x=0.2, size=3, color='dark gray') +
labs(title=' Wilcoxon')
# Stich together using patchwork
theme_axis(P_t1, ylim=c(1, 2), legend.position=c(0.6, 0.1)) +
theme_axis(P_t1_rank, ylim=NULL, legend.position=c(0.6, 0.1))
One interesting implication is that many “nonparametric tests” are precisely as parametric as their parametric counterparts with means, standard deviations, homogeneity of variance, etc. – just on transformed data.
R code: Onesample ttest
Try running the R code below and see that the linear model (lm
) produces the same (t), (p), and (r) as the builtin t.test
. The confidence interval is not presented in the output of lm
but is also identical if you use confint(lm(...))
:
# Builtin ttest
a = t.test(y)
# Equivalent linear model: interceptonly
b = lm(y ~ 1)
Results:
##
## One Sample ttest
##
## data: y
## t = 1.0607, df = 49, pvalue = 0.294
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.2683937 0.8683937
## sample estimates:
## mean of x
## 0.3
##
##
## Call:
## lm(formula = y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.521 1.673 0.481 1.427 3.795
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.3000 0.2828 1.061 0.294
##
## Residual standard error: 2 on 49 degrees of freedom
R code: Wilcoxon signedrank test
In addition to matching p
values, lm
also gives us the mean signed rank, which I find to be an informative number.
# Builtin
a = wilcox.test(y)
# Equivalent linear model
b = lm(signed_rank(y) ~ 1) # See? Same as above, just on signed ranks
# Bonus: of course also works for onesample ttest
c = t.test(signed_rank(y))
Results:
##
## Wilcoxon signed rank test with continuity correction
##
## data: y
## V = 731, pvalue = 0.3693
## alternative hypothesis: true location is not equal to 0
##
##
## Call:
## lm(formula = signed_rank(y) ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 49.74 25.49 4.76 22.76 46.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 3.740 4.151 0.901 0.372
##
## Residual standard error: 29.36 on 49 degrees of freedom
##
##
## One Sample ttest
##
## data: signed_rank(y)
## t = 0.90088, df = 49, pvalue = 0.3721
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.60275 12.08275
## sample estimates:
## mean of x
## 3.74
Paired samples ttest and Wilcoxon matched pairs
Theory: As linear models
ttest model: a single number (intercept) predicts the pairwise differences.
(y_2y_1 = beta_0 qquad mathcal{H}_0: beta_0 = 0)
This means that there is just one (y = y_2 – y_1) to predict and it becomes a onesample ttest on the pairwise differences. The visualization is therefore also the same as for the onesample ttest. At the risk of overcomplicating a simple substraction, you can think of these pairwise differences as slopes (see left panel of the figure), which we can represent as yoffsets (see right panel of the figure):
# Data for plot
N = nrow(D_t1)
start = rnorm_fixed(N, 0.2, 0.3)
D_tpaired = data.frame(x = rep(c(0, 1), each=N), y = c(start, start + D_t1$y), id=1:N)
# Plot
P_tpaired = ggplot(D_tpaired, aes(x=x, y=y)) +
geom_line(aes(group=id)) +
labs(title=' Pairs')
# Use patchwork to put them sidebyside
theme_axis(P_tpaired) + theme_axis(P_t1, legend.position=c(0.6, 0.1))
Similarly, the Wilcoxon matched pairs only differ from Wilcoxon signedrank in that it’s testing the signed ranks of the pairwise (yx) differences.
(signed_rank(y_2y_1) = beta_0 qquad mathcal{H}_0: beta_0 = 0)
R code: Paired sample ttest
a = t.test(y, y2, paired = TRUE) # Builtin paired ttest
b = lm(y  y2 ~ 1) # Equivalent linear model
Results:
##
## Paired ttest
##
## data: y and y2
## t = 0.52642, df = 49, pvalue = 0.601
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9634894 0.5634894
## sample estimates:
## mean of the differences
## 0.2
##
##
## Call:
## lm(formula = y  y2 ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 6.5253 1.5642 0.0844 1.9715 5.0361
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.2000 0.3799 0.526 0.601
##
## Residual standard error: 2.686 on 49 degrees of freedom
R code: Wilcoxon matched pairs
Again, we do the signedranks trick. This is still an approximation, but a close one:
# Builtin Wilcoxon matched pairs
a = wilcox.test(y, y2, paired = TRUE)
# Equivalent linear model:
b = lm(signed_rank(y  y2) ~ 1)
# Bonus: identical to onesample ttest ong signed ranks
c = t.test(signed_rank(y  y2))
Results:
##
## Wilcoxon signed rank test with continuity correction
##
## data: y and y2
## V = 614, pvalue = 0.8243
## alternative hypothesis: true location shift is not equal to 0
##
##
## Call:
## lm(formula = signed_rank(y  y2) ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 49.06 23.81 2.56 26.19 46.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.940 4.184 0.225 0.823
##
## Residual standard error: 29.58 on 49 degrees of freedom
##
##
## One Sample ttest
##
## data: signed_rank(y  y2)
## t = 0.22469, df = 49, pvalue = 0.8232
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 9.347227 7.467227
## sample estimates:
## mean of x
## 0.94
For large sample sizes (N >> 100), this approaches the sign test to a reasonable degree, but this approximation is too inaccurate to flesh out here.
Independent ttest and MannWhitney U
Theory: As linear models
Independent ttest model: two means predict (y).
(y_i = beta_0 + beta_1 x_i qquad mathcal{H}_0: beta_1 = 0)
where (x_i) is an indicator (0 or 1) saying whether data point (i) was sampled from one or the other group. Indicator variables (also called “dummy coding”) underly a lot of linear models and we’ll take an aside to see how it works in a minute.
MannWhitney U (also known as Wilcoxon signedrank test for two independent groups) is the same model to a very close approximation, just on the ranks of (x) and (y) instead of the actual values:
(rank(y_i) = beta_0 + beta_1 x_i qquad mathcal{H}_0: beta_1 = 0)
To me, equivalences like this make “nonparametric” statistics much easier to understand. The approximation is appropriate when the sample size is larger than 11 in each group and virtually perfect when N > 30 in each group.
Theory: Dummy coding
Dummy coding can be understood visually. The indicator is on the xaxis so data points from the first group are located at (x = 0) and data points from the second group is located at (x = 1). Then (beta_0) is the intercept (red line) and (beta_1) is the slope between the two means (green line). Why? Because when (Delta x = 1) the slope equals the difference because:
(slope = Delta y / Delta x = Delta y / 1 = Delta y = difference)
Magic! Even categorical differences can be modelled using linear models! It’s a true Swizz army knife.
# Data
N = 20 # Number of data points per group
D_t2 = data.frame(
x = rep(c(0, 1), each=N),
y = c(rnorm_fixed(N, 0.3, 0.3), rnorm_fixed(N, 1.3, 0.3))
)
# Plot
P_t2 = ggplot(D_t2, aes(x=x, y=y)) +
stat_summary(fun.y=mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color='something'), lwd=2) +
geom_segment(x=10, xend=10, y=0.3, yend=0.3, lwd=2, aes(color='beta_0')) +
geom_segment(x=0, xend=1, y=0.3, yend=1.3, lwd=2, aes(color='beta_1')) +
scale_color_manual(name=NULL, values=c("blue", "red", "darkblue"), labels=c(bquote(beta[0]*" (group 1 mean)"), bquote(beta[1]*" (slope = difference)"), bquote(beta[0]+beta[1]%.%1*" (group 2 mean)")))
#scale_x_discrete(breaks=c(0.5, 1.5), labels=c('1', '2'))
theme_axis(P_t2, jitter=TRUE, xlim=c(0.3, 2), legend.position=c(0.53, 0.08))
Theory: Dummy coding (continued)
If you feel like you get dummy coding now, just skip ahead to the next section. Here is a more elaborate explanation of dummy coding:
If a data point was sampled from the first group, i.e., when (x_i = 0), the model simply becomes (y = beta_0 + beta_1 cdot 0 = beta_0). In other words, the model predicts that that data point is (beta_0). It turns out that the (beta) which best predicts a set of data points is the mean of those data points, so (beta_0) is the mean of group 1.
On the other hand, data points sampled from the second group would have (x_i = 1) so the model becomes (y_i = beta_0 + beta_1cdot 1 = beta_0 + beta_1). In other words, we add (beta_1) to “shift” from the mean of the first group to the mean of the second group. Thus (beta_1) becomes the mean difference between the groups.
As an example, say group 1 is 25 years old ((beta_0 = 25)) and group 2 is 28 years old ((beta_1 = 3)), then the model for a person in group 1 is (y = 25 + 3 cdot 0 = 25) and the model for a person in group 2 is (y = 25 + 3 cdot 1 = 28).
Hooray, it works! For firsttimers it takes a few moments to understand dummy coding, but you only need to know addition and multiplication to get there!
R code: independent ttest
As a reminder, when we write y ~ 1 + x
in R, it is shorthand for (y = beta_0 cdot 1 + beta_1 cdot x) and R goes on computing the (beta)s for you. Thus (y ~ 1 + x) is the Rway of writing (y = a cdot x + b).
Notice the identical t
, df
, p
, and estimates. We can get the confidence interval by running confint(lm(...))
.
# Builtin independent ttest on wide data
a = t.test(y, y2, var.equal = TRUE)
# Be explicit about the underlying linear model by handdummycoding:
group_y2 = ifelse(group == 'y2', 1, 0) # 1 if group == y2, 0 otherwise
b = lm(value ~ 1 + group_y2) # Using our handmade dummy regressor
# Note: We could also do the dummycoding in the model
# specification itself. Same result.
c = lm(value ~ 1 + I(group=='y2'))
Results:
##
## Two Sample ttest
##
## data: y and y2
## t = 0.56569, df = 98, pvalue = 0.5729
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9016152 0.5016152
## sample estimates:
## mean of x mean of y
## 0.3 0.5
##
##
## Call:
## lm(formula = value ~ 1 + group_y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.5211 1.1259 0.2124 0.9151 4.9268
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.3000 0.2500 1.200 0.233
## group_y2 0.2000 0.3536 0.566 0.573
##
## Residual standard error: 1.768 on 98 degrees of freedom
## Multiple Rsquared: 0.003255, Adjusted Rsquared: 0.006916
## Fstatistic: 0.32 on 1 and 98 DF, pvalue: 0.5729
##
##
## Call:
## lm(formula = value ~ 1 + I(group == "y2"))
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.5211 1.1259 0.2124 0.9151 4.9268
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.3000 0.2500 1.200 0.233
## I(group == "y2")TRUE 0.2000 0.3536 0.566 0.573
##
## Residual standard error: 1.768 on 98 degrees of freedom
## Multiple Rsquared: 0.003255, Adjusted Rsquared: 0.006916
## Fstatistic: 0.32 on 1 and 98 DF, pvalue: 0.5729
R code: MannWhitney U
# Wilcoxon / MannWhitney U
a = wilcox.test(y, y2)
# As linear model with our dummycoded group_y2:
b = lm(rank(value) ~ 1 + group_y2) # compare to linear model above
##
## Wilcoxon rank sum test with continuity correction
##
## data: y and y2
## W = 1211, pvalue = 0.7907
## alternative hypothesis: true location shift is not equal to 0
##
##
## Call:
## lm(formula = rank(value) ~ 1 + group_y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## 48.72 24.64 0.78 24.22 48.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 49.720 4.122 12.061 <2e16 ***
## group_y2 1.560 5.830 0.268 0.79
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 98 degrees of freedom
## Multiple Rsquared: 0.0007302, Adjusted Rsquared: 0.009466
## Fstatistic: 0.07161 on 1 and 98 DF, pvalue: 0.7896
Welch’s ttest
This is identical to the (Student’s) independent ttest above except that Student’s assumes identical variances and Welch’s ttest does not. So the linear model is the same and the trick is in the variances, which I won’t go further into here.
# Builtin
a = t.test(y, y2, var.equal=FALSE)
# As linear model with pergroup variances
b = nlme::gls(value ~ 1 + group_y2, weights = nlme::varIdent(form=~1group), method="ML")
Results:
##
## Welch Two Sample ttest
##
## data: y and y2
## t = 0.56569, df = 90.875, pvalue = 0.573
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9023034 0.5023034
## sample estimates:
## mean of x mean of y
## 0.3 0.5
##
## Generalized least squares fit by maximum likelihood
## Model: value ~ 1 + group_y2
## Data: NULL
## AIC BIC logLik
## 399.6287 410.0493 195.8143
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1  group
## Parameter estimates:
## y1 y2
## 1.00 0.75
##
## Coefficients:
## Value Std.Error tvalue pvalue
## (Intercept) 0.3 0.2828427 1.0606602 0.2915
## group_y2 0.2 0.3535534 0.5656854 0.5729
##
## Correlation:
## (Intr)
## group_y2 0.8
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## 1.7784113 0.7177541 0.1430665 0.5635189 3.3178730
##
## Residual standard error: 1.979899
## Degrees of freedom: 100 total; 98 residual
ANOVAs are linear models with (only) categorical predictors so they simply extend everything we did above, relying heavily on dummy coding. Do make sure to read the section on dummy coding if you haven’t already.
Oneway ANOVA and KruskalWallis
Theory: As linear models
Model: One mean for each group predicts (y).
(y = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 +… qquad mathcal{H}_0: y = beta_1)
where (x_i) are indicators ((x=0) or (x=1)) where at most one (x_i=1) while all others are (x_i=0).
Notice how this is just “more of the same” of what we already did in other models above. When there are only two groups, this model is (y = beta_0 + beta_1*x), i.e. the independent ttest. If there is only one group, it is (y = beta_0), i.e. the onesample ttest. This is easy to see in the visualization below – just cover up a few groups and see that it matches the other visualizations above, though I did omit adding green lines from the intercept (red) to the group means (blue) for visual clarity.
# Figure
N = 15
D_anova1 = data.frame(
y=c(rnorm_fixed(N, 0.5, 0.3),
rnorm_fixed(N, 0, 0.3),
rnorm_fixed(N, 1, 0.3),
rnorm_fixed(N, 0.8, 0.3)),
x=rep(0:3, each=15)
)
ymeans = aggregate(y~x, D_anova1, mean)$y
P_anova1 = ggplot(D_anova1, aes(x=x, y=y)) +
stat_summary(fun.y=mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color='intercepts'), lwd=2) +
geom_segment(x=10, xend=100, y=0.5, yend=0.5, lwd=2, aes(color='beta_0')) +
geom_segment(x=0, xend=1, y=ymeans[1], yend=ymeans[2], lwd=2, aes(color='betas')) +
geom_segment(x=1, xend=2, y=ymeans[1], yend=ymeans[3], lwd=2, aes(color='betas')) +
geom_segment(x=2, xend=3, y=ymeans[1], yend=ymeans[4], lwd=2, aes(color='betas')) +
scale_color_manual(
name=NULL, values=c("blue", "red", "darkblue"),
labels=c(
bquote(beta[0]*" (group 1 mean)"),
bquote(beta[1]*", "*beta[2]*", etc. (slopes/differences to "*beta[0]*")"),
bquote(beta[0]*"+"*beta[1]*", "*beta[0]*"+"*beta[2]*", etc. (absolute intercepts)")
)
)
theme_axis(P_anova1, xlim=c(0.5, 4), legend.position=c(0.7, 0.1))
A oneway ANOVA has a loglinear counterpart called goodnessoffit test which we’ll return to. By the way, since we now regress on more than one (x), the oneway ANOVA is a multiple regression model.
The KruskalWallis test is simply a oneway ANOVA on the ranktransformed (y) (value
):
(rank(y) = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 +…)
This approximation is good enough for 12 or more data points. Again, if you do this for just one or two groups, we’re already acquainted with those equations, i.e. the Wilcoxon signedrank test or the MannWhitney U test respectively.
Example data
We make a threelevel factor with the levels a
, b
, and c
so that the oneway ANOVA basically becomes a “threesample ttest”. Then we manually do the dummy coding of the groups.
# Three variables in "long" format
N = 20 # Number of samples per group
D = data.frame(
value = c(rnorm_fixed(N, 0), rnorm_fixed(N, 1), rnorm_fixed(N, 0.5)),
group = rep(c('a', 'b', 'c'), each=N),
# Explicitly add indicator/dummy variables
# Could also be done using model.matrix(~D$group)
group_a = rep(c(1, 0, 0), each=N),
group_b = rep(c(0, 1, 0), each=N),
group_c = rep(c(0, 0, 1), each=N)) # N of each level
See? Exactly one parameter predicts a value
in a given row while the others are not included in the modeling of that value
.
R code: oneway ANOVA
OK, let’s see the identity between the builtin ANOVA (car::Anova
) and the dummycoded inyourface linear model in lm
. Actually, car::Anova
and aov
are just wrappers around lm
so the identity comes as no surprise. The latter returns parameter estimates as well (bonus!), but we’ll just look at the overall model statistics for now. Note that I do not use the aov
function because it computes typeI sum of squares. There is a BIG polarized debate about whether to use typeII (as car::Anova
does by default) or typeIII sum of squares, but let’s skip that for now.
# Compare builtin and linear model
a = car::Anova(aov(value ~ group, D)) # Builtin ANOVA
b = lm(value ~ 1 + group_a + group_b + group_c, data=D) # As inyourface linear model
Results:
## Anova Table (Type II tests)
##
## Response: value
## Sum Sq Df F value Pr(>F)
## group 10 2 5 0.009984 **
## Residuals 57 57
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = value ~ 1 + group_a + group_b + group_c, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.4402 0.6427 0.1393 0.8060 1.8574
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.5000 0.2236 2.236 0.0293 *
## group_a 0.5000 0.3162 1.581 0.1194
## group_b 0.5000 0.3162 1.581 0.1194
## group_c NA NA NA NA
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 57 degrees of freedom
## Multiple Rsquared: 0.1493, Adjusted Rsquared: 0.1194
## Fstatistic: 5 on 2 and 57 DF, pvalue: 0.009984
R code: KruskalWallis
a = kruskal.test(value ~ group, D) # Builtin
b = lm(rank(value) ~ 1 + group_a + group_b + group_c, D) # As linear model
c = car::Anova(aov(rank(value) ~ group, D)) # Of course the same using the builtin ANOVA, which is just a wrapper around lm
Results:
##
## KruskalWallis rank sum test
##
## data: value by group
## KruskalWallis chisquared = 6.6777, df = 2, pvalue = 0.03548
##
##
## Call:
## lm(formula = rank(value) ~ 1 + group_a + group_b + group_c, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## 28.950 12.213 0.675 15.912 26.850
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>t)
## (Intercept) 30.950 3.741 8.272 2.43e11 ***
## group_a 7.800 5.291 1.474 0.146
## group_b 6.450 5.291 1.219 0.228
## group_c NA NA NA NA
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.73 on 57 degrees of freedom
## Multiple Rsquared: 0.1132, Adjusted Rsquared: 0.08206
## Fstatistic: 3.637 on 2 and 57 DF, pvalue: 0.03261
##
## Anova Table (Type II tests)
##
## Response: rank(value)
## Sum Sq Df F value Pr(>F)
## group 2036.7 2 3.6374 0.03261 *
## Residuals 15958.3 57
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Twoway ANOVA (plot in progress!)
Theory: As linear models
Model: one mean per group (main effects) plus these means multiplied across factors (interaction effects). The main effects are the oneway ANOVAs above, though in the context of a larger model. The interaction effect is harder to explain in the abstract even though it’s just a few numbers multiplied with each other. I will leave that to the teachers to keep focus on equivalences here 🙂
Switching to matrix notation:
(y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_1 X_2 qquad mathcal{H}_0: beta_3 = 0)
Here (beta_i) are vectors of betas of which only one is selected by the indicator vector (X_i). The (mathcal{H}_0) shown here is the interaction effect. Note that the intercept (beta_0), to which all other (beta)s are relative, is now the mean for the first level of all factors.
Continuing with the dataset from the oneway ANOVA above, let’s add a crossing factor mood
so that we can test the group:mood
interaction (a 3×2 ANOVA). We also do the dummy coding of this factor needed for the linear model.
# Crossing factor
D$mood = c('happy', 'sad')
# Dummy coding
D$mood_happy = ifelse(D$mood == 'happy', 1, 0) # 1 if mood==happy. 0 otherwise.
D$mood_sad = ifelse(D$mood == 'sad', 1, 0) # Same, but we won't be needing this
(beta_0) is now the happy guys from group a!
# Add intercept line
# Add cross...
# Use other data?
means = lm(value ~ mood * group, D)$coefficients
P_anova2 = ggplot(D, aes(x=group, y=value, color=mood)) +
geom_segment(x=10, xend=100, y=means[1], yend=0.5, col='blue', lwd=2) +
stat_summary(fun.y=mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), lwd=2)
theme_axis(P_anova2, xlim=c(0.5, 3.5)) + theme(axis.text.x = element_text())
R code: Twoway ANOVA
Now let’s turn to the actual modeling in R. We compare the builtin ANOVA function to the linear model using lm
. Notice that in ANOVA, we are testing a full factor interaction all at once which involves many parameters (two in this case), so we can’t look at the overall model fit nor any particular parameter for the result. Therefore, I use a likelihoodratio test to compare a full twoway ANOVA model (“saturated”) to one without the interaction effect(s). We do so using the anova
function. Even though that looks like cheating, it’s just computing likelihoods, pvalues, etc. on the models that were already fitted, so it’s legit!
# Builtin twoway ANOVA.
a = car::Anova(aov(value ~ mood * group, D), type='II') # Normal notation. "*" both multiplies and adds main effects
b = car::Anova(aov(value ~ mood + group + mood:group, D)) # Identical but more verbose about main effects and interaction
# Testing the interaction terms as linear model.
full = lm(value ~ 1 + group_a + group_b + mood_happy + group_a:mood_happy + group_b:mood_happy, D) # Full model
null = lm(value ~ 1 + group_a + group_b + mood_happy, D) # Without interaction
c = anova(null, full) # whoop whoop, same F, p, and Dfs
Results:
## Anova Table (Type II tests)
##
## Response: value
## Sum Sq Df F value Pr(>F)
## mood 0.658 1 0.6314 0.43032
## group 10.000 2 4.8003 0.01206 *
## mood:group 0.096 2 0.0459 0.95520
## Residuals 56.247 54
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: value ~ 1 + group_a + group_b + mood_happy
## Model 2: value ~ 1 + group_a + group_b + mood_happy + group_a:mood_happy +
## group_b:mood_happy
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 56 56.342
## 2 54 56.247 2 0.095565 0.0459 0.9552
Below, I present approximate main effect models, though exact calculation of ANOVA main effects is more involved if it is to be accurate and furthermore depend on whether typeII or typeIII sum of squares are used for inference.
Look at the model summary statistics to find values comparable to the Anova
estimated main effects above.
# Main effect of group.
e = lm(value ~ 1 + group_a + group_b, D)
# Main effect of mood.
f = lm(value ~ 1 + mood_happy, D)
##
## Call:
## lm(formula = value ~ 1 + group_a + group_b, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.4402 0.6427 0.1393 0.8060 1.8574
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.5000 0.2236 2.236 0.0293 *
## group_a 0.5000 0.3162 1.581 0.1194
## group_b 0.5000 0.3162 1.581 0.1194
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 57 degrees of freedom
## Multiple Rsquared: 0.1493, Adjusted Rsquared: 0.1194
## Fstatistic: 5 on 2 and 57 DF, pvalue: 0.009984
##
##
## Call:
## lm(formula = value ~ 1 + mood_happy, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.65275 0.70847 0.00213 0.63391 1.88950
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.6047 0.1953 3.097 0.00301 **
## mood_happy 0.2094 0.2761 0.758 0.45136
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 58 degrees of freedom
## Multiple Rsquared: 0.009816, Adjusted Rsquared: 0.007256
## Fstatistic: 0.575 on 1 and 58 DF, pvalue: 0.4514
ANCOVA
This is simply ANOVA with a continuous regressor added so that it now contains continuous and (dummycoded) categorical predictors. For example, if we continue with the oneway ANOVA example, we can add age
and it is now called a oneway ANCOVA:
(y = beta_0 + beta_1 x_1 + beta_2 x_2 + … + beta_3 age)
… where (x_i) are our usual dummycoded indicator variables. (beta_0) is now the mean for the first group at (age=0). You can turn all ANOVAs into ANCOVAs this way, e.g. by adding (beta_N cdot age) to our twoway ANOVA in the previous section. But let us go ahead with our oneway ANCOVA, starting by adding (age) to our dataset:
# Update data with a continuous covariate
D$age = D$value + rnorm_fixed(nrow(D), sd=3) # Correlated to value
This is best visualized using colors for groups instead of xposition. The (beta)s are still the average (y)offset of the data points, only now we model each group using a slope instead of an intercept. In other words, the oneway ANOVA is sort of onesample ttests model for each group ((y = beta_0)) while the oneway ANCOVA is sort of Pearson correlation model for each group ((y_i = beta_0 + beta_i + beta_1 cdot age)):
# For linear model plot
D$pred = predict(lm(value ~ age + group, D))
# Plot
P_ancova = ggplot(D, aes(x=age, y=value, color=group, shape=group)) +
geom_line(aes(y=pred), lwd=2)
# Theme it
theme_axis(P_ancova, xlim=NULL, ylim=NULL, legend.position=c(0.8, 0.2)) + theme(axis.title=element_text())
And now some R code to run the oneway ANCOVA as a linear model:
# Builtin ANCOVA. The order of factors matter in pureaov (typeI variance).
# Use typeII or typeIII instead; implemented in car::Anova
a = car::Anova(aov(value ~ group + age, D))
#a = aov(value ~ group + age, D) # Predictor order matters. Not nice!
# As dummycoded linear model.
full = lm(value ~ 1 + group_a + group_b + age, D)
# Testing main effect of age using Likelihoodratio test
null_age = lm(value ~ 1 + group_a + group_b, D) # Full without age. Oneway ANOVA!
result_age = anova(null_age, full)
# Testing main effect of groupusing Likelihoodratio test
null_group = lm(value ~ 1 + age, D) # Full without group. Pearson correlation!
result_group = anova(null_group, full)
Results:
## Anova Table (Type II tests)
##
## Response: value
## Sum Sq Df F value Pr(>F)
## group 10.118 2 6.4258 0.0030738 **
## age 12.910 1 16.3967 0.0001595 ***
## Residuals 44.090 56
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: value ~ 1 + group_a + group_b
## Model 2: value ~ 1 + group_a + group_b + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 57 57.00
## 2 56 44.09 1 12.91 16.397 0.0001595 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: value ~ 1 + age
## Model 2: value ~ 1 + group_a + group_b + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 58 54.209
## 2 56 44.090 2 10.118 6.4258 0.003074 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F)
## 1 58 14870
## 2 56 12768 2 2102.3 4.6105 0.01401 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
“`
“`r
sm::sm.ancova(x=D$age, y=D$value, group=D$group, display=’none’, model=’equal’)
“`
“`
## Test of equality : h = 1.08418 pvalue = 0.3322
“`
–>
Recall that when you take the logarithm, you can easily make statements about proportions, i.e., that for every increase in (x), (y) increases a certain percentage. This turns out to be one of the simplest (and therefore best!) ways to make count data and contingency tables intelligible. See this nice introduction to ChiSquare tests as linear models.
Goodness of fit
Theory: As loglinear model
Model: a single intercept predicts (log(y)).
I’ll refer you to take a look at the section on contingency tables which is basically a “twoway goodness of fit”.
Example data
For this, we need some wide count data:
# Data in long format
D = data.frame(mood = c('happy', 'sad', 'meh'),
counts = c(60, 90, 70))
# Dummy coding for the linear model
D$mood_happy = ifelse(D$mood == 'happy', 1, 0)
D$mood_sad = ifelse(D$mood == 'sad', 1, 0)
R code: Goodness of fit
Now let’s see that the Goodness of fit is just a loglinear equivalent to a oneway ANOVA. We set family = poisson()
which defaults to setting a logarithmic link function (family = poisson(link='log')
).
# Builtin test
a = chisq.test(D$counts)
# As loglinear model, comparing to an interceptonly model
full = glm(counts ~ 1 + mood_happy + mood_sad, data=D, family=poisson())
null = glm(counts ~ 1, data=D, family=poisson())
b = anova(null, full, test='Rao')
# Note: glm can also do the dummy coding for you:
c = glm(counts ~ mood, data=D, family=poisson())
Let’s look at the results:
##
## Chisquared test for given probabilities
##
## data: D$counts
## Xsquared = 6.3636, df = 2, pvalue = 0.04151
##
## Analysis of Deviance Table
##
## Model 1: counts ~ 1
## Model 2: counts ~ 1 + mood_happy + mood_sad
## Resid. Df Resid. Dev Df Deviance Rao Pr(>Chi)
## 1 2 6.2697
## 2 0 0.0000 2 6.2697 6.3636 0.04151 *
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note the strange anova(..., test='Rao')
which merely states that pvalues should be computed using the (Rao) score test. We could also have jotted in test='Chisq'
or test='LRT'
which would have yielded approximate pvalues. You may think that we’re cheating here, sneaking in some sort of ChiSquare model posthoc. However, anova
only specifies how pvalues are calculated whereas all the loglinear modeling happened in glm
.
By the way, if there are only two counts and a large sample size (N > 100), this model begins to approximate the binomial test, binom.test
, to a reasonable degree. But this sample size is larger than most use cases, so I won’t raise to a ruleofthumb and won’t dig deeper into it here.
Contingency tables
Theory: As loglinear model
The theory here will be a bit more convoluted, and I mainly write it up so that you can get the feeling that it really is just a loglinear twoway ANOVA model. Let’s get started…
For a twoway contingency table, the model of the count variable (y) is a modeled using the marginal proportions of a contingency table. Why this makes sense, is too involved to go into here, but see the relevant slides by Christoph Scheepers here for an excellent exposition. The model is composed of a lot of counts and the regression coefficients (A_i) and (B_i):
(y_i = N cdot x_i(A_i/N) cdot z_j(B_j/N) cdot x_{ij}/((A_i x_i)/(B_j z_j)/N))
What a mess!!! Here, (i) is the row index, (j) is the column index, (x_{something}) is the sum of that row and/or column, (N = sum(y)). Remember that (y) is a count variable, so (N) is just the total count.
We can simplify the notation by defining the proportions: (alpha_i = x_i(A_i/N)), (beta_i = x_j(B_i/N)) and (alpha_ibeta_j = x_{ij}/(A_i x_i)/(B_j z_j)/N). Let’s write the model again:
(y_i = N cdot alpha_i cdot beta_j cdot alpha_ibeta_j)
Ah, much prettier. However, there is still lot’s of multiplication which makes it hard to get an intuition about how the actual numbers interact. We can make it much more intelligible when we remember that (log(A cdot B) = log(A) + log(B)). Doing logarithms on both sides, we get:
(log(y_i) = log(N) + log(alpha_i) + log(beta_j) + log(alpha_ibeta_j))
Snuggly! Now we can get a better grasp on how the regression coefficients (which are proportions) independently contribute to (y). This is why logarithms are so nice for proportions. Note that this is just the twoway ANOVA model with some logarithms added, so we are back to our good old linear models – only the interpretation of the regression coefficients have changed! And we cannot use lm
anymore in R.
Example data
Here we need some long data and we need it in table format for chisq.test
:
# Contingency data in long format for linear model
D = data.frame(mood = c('happy', 'happy', 'meh', 'meh', 'sad', 'sad'),
sex = c('male', 'female', 'male', 'female', 'male', 'female'),
Freq = c(100, 70, 30, 32, 110, 120))
# ... and as table for chisq.test
D_table = D %>%
spread(key=mood, value=Freq) %>% # Mood to columns
select(sex) %>% # Remove sex column
as.matrix()
# Dummy coding of D for linear model (skipping mood=="sad" and gender=="female")
# We could also use model.matrix(D$Freq~D$mood*D$sex)
D$mood_happy = ifelse(D$mood == 'happy', 1, 0)
D$mood_meh = ifelse(D$mood == 'meh', 1, 0)
D$sex_male = ifelse(D$sex == 'male', 1, 0)
R code: Chisquare test
Now let’s show the equivalence between a chisquare model and a loglinear model. This is very similar to our twoway ANOVA above:
# Builtin chisquare. It requires matrix format.
a = chisq.test(D_table)
# Using glm to do a loglinear model, we get identical results when testing the interaction term:
full = glm(Freq ~ 1 + mood_happy + mood_meh + sex_male + mood_happy*sex_male + mood_meh*sex_male, data=D, family=poisson())
null = glm(Freq ~ 1 + mood_happy + mood_meh + sex_male, data=D, family=poisson())
b = anova(null, full, test='Rao') # Could also use test='LRT' or test='Chisq'
# Note: let glm do the dummy coding for you
full = glm(Freq ~ mood * sex, family=poisson(), data=D)
c = anova(full, test='Rao')
# Note: even simpler syntax using MASS:loglm ("loglinear model")
d = MASS::loglm(Freq ~ mood + sex, D)
##
## Pearson's Chisquared test
##
## data: D_table
## Xsquared = 5.0999, df = 2, pvalue = 0.07809
##
## Analysis of Deviance Table
##
## Model 1: Freq ~ 1 + mood_happy + mood_meh + sex_male
## Model 2: Freq ~ 1 + mood_happy + mood_meh + sex_male + mood_happy * sex_male +
## mood_meh * sex_male
## Resid. Df Resid. Dev Df Deviance Rao Pr(>Chi)
## 1 2 5.1199
## 2 0 0.0000 2 5.1199 5.0999 0.07809 .
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Freq
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Rao Pr(>Chi)
## NULL 5 111.130
## mood 2 105.308 3 5.821 94.132 < 2e16 ***
## sex 1 0.701 2 5.120 0.701 0.40235
## mood:sex 2 5.120 0 0.000 5.100 0.07809 .
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## MASS::loglm(formula = Freq ~ mood + sex, data = D)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 5.119915 2 0.07730804
## Pearson 5.099859 2 0.07808717
##
## Call:
## glm(formula = Freq ~ mood * sex, family = poisson(), data = D)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 4.2485 0.1195 35.545 < 2e16 ***
## moodmeh 0.7828 0.2134 3.668 0.000244 ***
## moodsad 0.5390 0.1504 3.584 0.000339 ***
## sexmale 0.3567 0.1558 2.289 0.022094 *
## moodmeh:sexmale 0.4212 0.2981 1.413 0.157670
## moodsad:sexmale 0.4437 0.2042 2.172 0.029819 *
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1.1113e+02 on 5 degrees of freedom
## Residual deviance: 3.9968e15 on 0 degrees of freedom
## AIC: 48.254
##
## Number of Fisher Scoring iterations: 3
If you unfold the raw R output, I’ve included summary(full)
so that you can see the raw regression coefficients. Being a loglinear model, these are the percentage increasein (y) over and above the intercept if that category obtains.
Here are links to other sources who have exposed bits and pieces of this puzzle, including many further equivalences not covered here:
Most advanced stats books (and some introbooks) take the “everything is GLMM” approach as well. However, the “linear model” part often stay at the conceptual level. I wanted to make linear models the tool in a really concise way. Luckily, more beginnierfriendly materials have emerged lately:
Here are my own thoughts on what I’d do. I’ve done parts of this with great success already, but not the whole lot since I’m not assigned to do a full course yet.
I would spend 50% of the time on linear modeling of data (bullet 1 below) since this contains 70% of what students need to know. The rest of the course is just fleshing out what happens if you have one group, two groups, etc.
Note that whereas the understanding of sampling and hypothesis testing is usually the first focus of mainstream stats courses, it is saved for later here to make way for modeling.

Fundamentals of regression:

Recall from highschool: (y = a cdot x + b), and getting a really good intuition about slopes and intercepts. Understanding that this can be written using all variable names, e.g.,
money = profit * time + starting_money
or (y = beta_1x + beta_2*1) or, suppressing the coefficients, asy ~ x + 1
. If the audience is receptive, convey the idea of these models as a solution to differential equations, specifying how (y) changes with (x). 
Extend to a few multiple regression as models. Make sure to include plenty of reallife examples and exercises at this point to make all of this really intuitive. Marvel at how briefly these models allow us to represent large datasets.

Introduce the idea of ranktransforming nonmetric data and try it out.

Teach the three assumptions: independence of data points, normality of residuals, and homoscedasticity.

Confidence/credible intervals on the parameters. Stress that the MaximumLikelihood estimate is extremely unlikely, so intervals are more important.

Briefly introduce (R^2) for the simple regression models above. Mention in passing that this is called the Pearson and Spearman correlation coefficients.


Special case #1: One or two means (ttests, Wilcoxon, MannWhitney):

*One mean:** When there is only one xvalue, the regression model simplifies to (y = b). If (y) is nonmetric, you can ranktransform it. Apply the assumptions (homoscedasticity doesn’t apply since there is only one (x)). Mention in passing that these interceptonly models are called onesample ttest and Wilcoxon Signed Rank test respectively.

Two means: If we put two variables 1 apart on the xaxis, the difference between the means is the slope. Great! It is accessible to our swizz army knife called linear modeling. Apply the assumption checks to see that homoscedasticity reduces to equal variance between groups. This is called an independent ttest. Do a few worked examples and exercises, maybe adding Welch’s test, and do the ranktransformed version, called MannWhitney U.

Paired samples: Violates the independence assumption. After computing pairwise differences, this is equivalent to 2.1 (one intercept), though it is called the paired ttest and Wilcoxon’s matched pairs.


Special case #2: Three or more means (ANOVAs)

Dummy coding of categories: How one regression coefficient for each level of a factor models an intercept for each level when multiplied by a binary indicator. This is just extending what we did in 2.1. to make this data accessible to linear modeling.

Means of one variable: Oneway ANOVA.

Means of two variables: Twoway ANOVA.


Special case #3: Three or more proportions (ChiSquare)

Logarithmic transformation: Making multiplicative models linear using logarithms, thus modeling proportions. See this excellent introduction to the equivalence of loglinear models and ChiSquare tests as models of proportions. Also needs to introduce (log)odds ratios. When the multiplicative model is made summative using logarithms, we just add the dummycoding trick from 3.1, and see that the models are identical to the ANOVA models in 3.2 and 3.3, only the interpretation of the coefficients have changed.

Proportions of one variable: Goodness of fit.

Proportions of two variables: Contingency tables.


Hypothesis testing: Hypothesis testing is the act of choosing between a full model and one where a parameter is set to zero (effectively excluded from the model) instead of being estimated. For example, when set one of the two means in the ttest to be zero, we study how well the remaining mean explains all the data from both groups. If it does a good job, we prefer this model over the twomean model because it is simpler. So hypothesis testing is just comparing linear models to make more qualitative statements than the truly quantitative statements which were covered in bullets 14 above. Therefore, hypothesis testing is less interesting and is mostly covered as an introduction to the general literature. Mention Pvalues (and misconceptions about them), Bayes Factors, and crossvalidation.
I have made a few simplifications for clarity:

I have not covered assumptions in the examples. This will be another post! But all assumptions of all tests come down to the usual three: a) independence of data points, b) normally distributed residuals, and c) homoscedasticity.

I assume that all null hypotheses are the absence of an effect, but everything works the same for nonzero null hypotheses.

I have not discussed inference. I am only including pvalues in the comparisons as a crude way to show the equivalences between the underlying models since people care about pvalues. Parameter estimates will show the same equivalence. How to do inference is another matter. Personally, I’m a Bayesian, but going Bayesian here would render it less accessible to the wider audience. Also, doing robust models would be preferable, but fail to show the equivalence.

Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RMANOVA on
rank(y)
, McNemar, and Binomial/Multinomial. See stuff on these in the section on links to further equivalences. If you think that they should be included here, feel free to submit “solutions” to the github repo of this doc!
<!–
# Visualization of models (in progress)
These will eventually be included under the respective headers and in the infographic.
–>