Stats-in-R/ANOVA.qmd at gh-pages · mgimond/Stats-in-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
---
title: "ANOVA"
editor_options:
  chunk_output_type: console
---

*Last modified on `r Sys.Date()`*

```{r, echo = FALSE, message = FALSE}
source("libs/Common.R")
```

```{r Load fonts, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE}
library(extrafont)
font_import(prompt=FALSE, pattern="[A/a]rchitects")
loadfonts(device="postscript")

```

------------------------------------------------------------------------

Package used in this tutorial:

```{r packages}
library(tidyr) # Used to reshape a data table
library(ggplot2)
```

# Introduction

Analysis of Variance (ANOVA) seeks to compare the means between two or more batches of numbers. You can think of an ANOVA as an extension of the t-test where three or more batches need to be compared. The name may seem misleading since it suggests that we are comparing variances and not some central value, but in fact, we compare the variances (spreads) between batches to assess if the central values are significantly different from one another. For example, let's compare the following batches of numbers:

```{r, echo=FALSE, fig.height=2.5, fig.width=5}
set.seed(3)
x <- sample(c(23.5,24,24.2,25,25.3))
y <- sample(c(24.5, 25.2, 25.8, 26.0, 26.5))
z <- sample(c(23.1,23.6,24.2,24.6,25))

x.x <- seq(0.5,1.3,.2)
y.x <- seq(2,2.8,.2)
z.x <- seq(3.2,4.0,.2)

mean.grand <- mean(c(x,y,z))

OP <- par(mar=c(2,3,1,4))
plot(c(x,y,z)~ jitter(c(rep(1,length(x)),rep(2,length(y)),rep(3,length(z))),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,3.5), ylim=c(23,27))
axis(1, at=c(1,2,3),label=c("Batch 1", "Batch 2", "Batch 3"),col="grey",
     family="Architects Daughter" ,tick=FALSE, cex.axis=0.8)
axis(2, at=c(23,24,25,26,27), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=0.8)
par(OP)
```

Given the very small overlap in spread between batch 2 and the two other batches, it's obvious that batch 2 is significantly different from batches 1 and 3, but can we say with a similar level of certainty that batch 1 is significantly different from batch 3? Probably not, the slight offset in spread between batch 1 and 3 may be due to chance alone.

A one-way ANOVA compares measurement means between a single group of levels or batches. ANOVAs can be extended to include multiple groups (each having different levels). An ANOVA that compares means between two groups (each having their own set of levels) is referred to a two-way ANOVA.

# How a one-way ANOVA is calculated

This section focuses on one group of levels (hence a one-way ANOVA).

## The variance-ratio method

An ANOVA test seeks to compare the spread between the batches (technically referred to as *levels*).

The first step is to sum the square of the *distances* between each value (from all levels) to the *grand* mean computed from *all* values (plotted as a dark dashed line in the following graphic). We'll call this value the *total sum of squares* for the mean ($SSE_{mean}$). It's calculated as follows:

$$
SSE_{mean} = \sum (y - \bar{\bar y})^2
$$

```{r, echo=FALSE}
mean.grand <- mean( c(x,y,z) )
SSE.m <- sum( (c(x,y,z) - mean.grand)^2 ) # Total sum of squares
```

where $\bar{\bar{y}}$ is the mean for *all* values. In this example, $\bar{\bar y}$ equals `r round(mean.grand,1)`. In the following plot, we spread out the values in each level (for clarity) and measure their distances to the grand mean. Each level is assigned a unique color to distinguish the different batches.

```{r, echo=FALSE, fig.height=3, fig.width=5}
OP <- par(mar=c(2,3,1,4))
plot(c(x,y,z)~ jitter(c(x.x,y.x,z.x),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,5.5), ylim=c(23,27))
axis(1, at=c(0.8,2.3,3.75),label=c("Batch 1", "Batch 2", "Batch 3"),
     family="Architects Daughter", tick=FALSE , cex.axis=.8)
axis(2, at=c(23,24,25,26,27), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=.8)
abline(h=mean(c(x,y,z)),lty=2)
for (i in 1:length(x)){
  lines(rep(x.x[i],2), c(x[i],mean.grand), col="red")
}
for (i in 1:length(y)){
  lines(rep(y.x[i],2), c(y[i],mean.grand), col="green")
}
for (i in 1:length(z)){
  lines(rep(z.x[i],2), c(z[i],mean.grand), col="blue")
}
text(4.1,26.5, "SSE-mean\n(Difference between\nall values\nand grand mean)",
     family="Architects Daughter", cex=0.8)
text(4.8, mean.grand+.15, mean.grand, family="Architects Daughter", cex=0.8)

par(OP)
```

Next, we compare the values in each level to their respective level means. Similar to the way we compute the $SSE_{mean}$, we sum the squared differences for each level as follows:

$$
SSE = \sum (y_{batch1} - \bar y_{batch1})^2 + \sum (y_{batch2} - \bar y_{batch2})^2 + \sum (y_{batch3} - \bar y_{batch3})^2
$$

Where $SSE$ is the **error sum of squares**.

```{r, echo=FALSE, fig.height=3, fig.width=5}
b1.mean <- mean(x)
b2.mean <- mean(y)
b3.mean <- mean(z)

OP <- par(mar=c(2,3,1,4))
plot(c(x,y,z)~ jitter(c(x.x,y.x,z.x),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,5.5), ylim=c(23,27))
axis(1, at=c(0.8,2.3,3.75),label=c("Batch 1", "Batch 2", "Batch 3"),
     family="Architects Daughter", tick=FALSE , cex.axis=.8)
axis(2, at=c(23,24,25,26,27), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=.8)

abline(h=mean(c(x,y,z)),lty=2, col="grey40")

lines(c(x.x[1],x.x[5]), c(b1.mean,b1.mean), col="red")
lines(c(y.x[1],y.x[5]), c(b2.mean,b2.mean), col="green")
lines(c(z.x[1],z.x[5]), c(b3.mean,b3.mean), col="blue")

for (i in 1:length(x)){
  lines(rep(x.x[i],2), c(x[i],b1.mean), col="red")
}
for (i in 1:length(y)){
  lines(rep(y.x[i],2), c(y[i],b2.mean), col="green")
}
for (i in 1:length(z)){
  lines(rep(z.x[i],2), c(z[i],b3.mean), col="blue")
}
text(4.0,26.5, "SSE\n(Difference between\nbatch values\nand batch means)",
     family="Architects Daughter", cex=0.8)
text(4.8, mean.grand+.15, mean.grand, family="Architects Daughter", cex=0.8)

text(x.x[1]-.35, b1.mean, b1.mean, family="Architects Daughter", col="red", cex=.8)
text(y.x[1]-.35, b2.mean, b2.mean, family="Architects Daughter", col="green", cex=.8)
text(z.x[1]-.35, b3.mean, b3.mean, family="Architects Daughter", col="blue", cex=.8)


par(OP)
```

If the mean values of all three levels are the same, their horizontal lines should line up with the grand mean and both $SSE_{mean}$ and $SSE$ should be equal, if not, $SSE$ will be less than $SSE_{mean}$ (the distance between the points and their respective level mean will always be equal or shorter than their distances to the overall mean). The difference between $SSE_{mean}$ and $SSE$ is called the **treatment sum of squares** (also referred to as the **model sum of squares**):

$$
SSR = SSE_{mean} - SSE
$$

If SSR is close to 0, then the differences between the levels is small, if SSR is large, then two or more of the levels are significantly different from one another. SSR is, in fact, the squared difference between each group's mean and the grand mean. Think of two competing models to predict the observed values: the grand mean, and each level's mean. If the level means are close to the grand mean, then both the level means and grand mean will generate the same expected values. If they are different, then they will generate different expected values. So the greater the SSR, the more different the predicted values (and therefore the more different the means).

```{r echo=FALSE, fig.height=3, fig.width=5}
OP <- par(mar=c(2,3,1,4))
plot(c(x,y,z)~ jitter(c(x.x,y.x,z.x),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,5.5), ylim=c(23,27))
axis(1, at=c(0.8,2.3,3.75),label=c("Batch 1", "Batch 2", "Batch 3"),
     family="Architects Daughter", tick=FALSE , cex.axis=.8)
axis(2, at=c(23,24,25,26,27), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=.8)

abline(h=mean(c(x,y,z)),lty=2, col="grey40")

lines(c(x.x[1],x.x[5]), c(b1.mean,b1.mean), col="red")
lines(c(y.x[1],y.x[5]), c(b2.mean,b2.mean), col="green")
lines(c(z.x[1],z.x[5]), c(b3.mean,b3.mean), col="blue")

for (i in 1:length(x)){
  lines(rep(x.x[i],2), c(mean.grand,b1.mean), col="red")
}
for (i in 1:length(y)){
  lines(rep(y.x[i],2), c(mean.grand,b2.mean), col="green")
}
for (i in 1:length(z)){
  lines(rep(z.x[i],2), c(mean.grand,b3.mean), col="blue")
}
text(4.0,26.5, "SSR\n(Difference between\nmodeled values\nand grand mean)",
     family="Architects Daughter", cex=0.8)
text(4.8, mean.grand+.15, mean.grand, family="Architects Daughter", cex=.8)

text(x.x[1]-.35, b1.mean, b1.mean, family="Architects Daughter", col="red", cex=.8)
text(y.x[1]-.35, b2.mean, b2.mean, family="Architects Daughter", col="green", cex=.8)
text(z.x[1]-.35, b3.mean, b3.mean, family="Architects Daughter", col="blue", cex=.8)
par(OP)

```

The goal is to compare the variability in $SSE$ to that in $SSR$, however, $SSR$ can take on much larger values than $SSA$ (because $SSR$ is measuring from the group means and not the individual values). To remedy this bias, we compute the **mean squares** from both values by dividing the sum of squares by the **degrees of freedom**. For $SSR$, this gives us: $$
MSR = \frac{SSR}{p-1}
$$

where $MSR$ is the **mean square for treatments** (or **mean square for model**), and $p$ is the number of levels (3 in this example).

Likewise, we can compute the **mean square for error** as:

$$
MSE = \frac{SSE}{n-p}
$$

where $n$ is the total number of observations.

Next, we compute the $F$-statistic (or $F$-ratio) as:

$$
F = \frac{MSR}{MSE}
$$

$F$ gives us the proportion of the overall spread in the measurements that can be explained by the levels vs. the proportion of the overall spread in the measurements *not* explained by the levels. A value of $F$ that approaches $1$ indicates that little to none of the variability in the measurements can be explained by the levels suggesting that differences in their mean may be due to random noise only. If $F$ is much larger than one, then the spreads between levels are quite different (meaning that these differences are large contributors to the overall spread) suggesting that the observed differences in mean values are significant too.

The following graphic shows the spread for *all* values in the data (right-side plot) and its spread broken down by levels (left-side plot). In this case, it seems that a good chunk of the overall spread can be explained by the differences in spread between levels. If this is still unclear, picture the plots with the measurements for batch 2 removed; what you would note is that the overall spread will be noticeably reduced suggesting that the measurements in level two alone were an important contributor to the overall spread.

```{r fig.width=7, fig.height=2.5, fig.show='hold', echo=FALSE}
set.seed(4)
OP <- par(mar=c(2,2,1,1), mfrow=c(1,2))
plot(c(x,y,z)~ jitter(c(rep(1,length(x)),rep(2,length(y)),rep(3,length(z))),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,3.5), col=c(rep("red",5),rep("green",5),rep("blue",5)), ylim=c(23,27))
axis(1, at=c(1,2,3),label=c("Batch 1", "Batch 2", "Batch 3"),
     family="Architects Daughter", tick=FALSE )
axis(2, at=c(23,24,25,26,27), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=.8)

plot(c(x,y,z) ~ jitter(rep(1,length(c(x,y,z))), 2),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,3.5), col=c(rep("red",5),rep("green",5),rep("blue",5)), ylim=c(23,27))
axis(1, at=c(1),label=c("All values"), family="Architects Daughter", tick=FALSE )
text(2,25,"F >> 1", family="Architects Daughter")
par(OP)


```

Contrast this last example with the following dataset where the differences in spread (and mean) between levels are negligible. Removing anyone of the batches from the dataset would have a negligible impact on the overall spread in the measurements. We would say that little of the spread (variability) in the overall measurements can be explained by differences in measurements between levels; this would result in an $F$-ratio close to $1$.

```{r fig.width=7, fig.height=2.5, fig.show='hold', echo=FALSE}
set.seed(4)
x1 <- x
y1 <- y-1.1
z1 <- z

OP <- par(mar=c(2,2,1,1), mfrow=c(1,2))
plot(c(x1,y1,z1)~ jitter(c(rep(1,length(x1)),rep(2,length(y1)),rep(3,length(z1))),0.1),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,3.5), col=c(rep("red",5),rep("green",5),rep("blue",5)), ylim=c(23,26))
axis(1, at=c(1,2,3),label=c("Batch 1", "Batch 2", "Batch 3"),
     family="Architects Daughter", tick=FALSE)
axis(2, at=c(23,24,25,26), family="Architects Daughter", las=2,
     col="grey80", col.axis="grey80", cex.axis=.8)

plot(c(x1,y1,z1) ~ jitter(rep(1,length(c(x1,y1,z1))),2),
     pch=16, ylab=NA,xlab=NA,axes = F,family="Architects Daughter", cex=1.3,
     xlim=c(0,3.5), col=c(rep("red",5),rep("green",5),rep("blue",5)), ylim=c(23,26))
axis(1, at=c(1),label=c("All values"), family="Architects Daughter", tick=FALSE )
text(2,25,"F = 1", family="Architects Daughter")
par(OP)
```

To assess whether a value of $F$ is significantly greater than 1, we must compare our observed $F$ to a distribution of $F$'s we would expect to get if the means between levels were *not* different (we refer to the hypothesized $F$ values as $F_{Ho}$).

In the following plot, a hypothetical $F$ ratio (plotted as a vertical red line) is located to the right of the distribution of $F_{Ho}$ values (delineated in a black line in the following plot). The shaded pink area to the right of the hypothetical $F$ represents the fraction of $F_{Ho}$ that would be more extreme than our observed $F$. The goal is to assess whether or not we feel comfortable in concluding that our observed $F$ is significantly different from an $F$ we would expect to get if the means between batches were all the same.

```{r echo=FALSE, message=FALSE, fig.width=7, fig.height=3.0, warning=FALSE}
OP <- par("mar"=c(2,4,3,1), xpd=NA)
df1 = 7
df2 = 5
Ft  = 2.12
xmin = 0
xmax = 10
p10  = qf(0.10,df1,df2)
p025  = qf(0.025,df1,df2)
p975  = qf(0.975,df1,df2)
ncurve.x = seq(xmin, xmax,length.out= 200)
ncurve.y = df(ncurve.x, df1, df2 )
plot( ncurve.y ~ ncurve.x, type="l", ylab=NA, xlab=NA,axes=FALSE,main=NA)
axis(1, family= "Architects Daughter")
lines(x = c(Ft,Ft), y=c(0,df(Ft,df1,df2)), col="red"  , lty=1, lw=2)
lines(x = c(p975,p975) , y=c(0,df( p975,df1,df2)) , col="grey", lty=2, lw=2)

text(x=Ft+.7, y = df(Ft,df1,df2)+.3, "Distribution of F(Ho)", col="black", pos=3, family= "Architects Daughter", adj=10.5)
text(x=Ft+1.3, y = df(Ft,df1,df2), "Observed F-value", col="red", pos=3, family= "Architects Daughter", adj=10.5)

ncurve.x.mSE <- seq(Ft, xmax, length.out= 100)
ncurve.y.mSE <- df(ncurve.x.mSE, df1, df2 )
ncurve.x.mSE <- c(ncurve.x.mSE, max(ncurve.x.mSE), min(ncurve.x.mSE))
ncurve.y.mSE <- c(ncurve.y.mSE, 0,0)
polygon( ncurve.x.mSE, ncurve.y.mSE, col=rgb(1, 0, 0,.3), border=NA)
par(OP)
```

## An example in R

In this working example, we compare fecal coliform counts (represented as the log10 of organisms per 100 ml) in the Illinois River between seasons (summer, fall, winter and spring) across six years (data from Millard and Neerchal, 2001).

| Year | Summer | Fall | Winter | Spring |
|------|--------|------|--------|--------|
| 1971 | 2.00   | 1.45 | 1.45   | 1.34   |
| 1972 | 2.34   | 2.08 | 1.76   | 1.72   |
| 1973 | 2.48   | 2.32 | 2.08   | 2.04   |
| 1974 | 2.63   | 2.45 | 2.36   | 2.15   |
| 1975 | 2.81   | 2.70 | 2.49   | 2.51   |
| 1976 | 3.20   | 3.04 | 2.70   | 3.11   |

Let's first generate the data. We will create two data frames of the same data. One will be in a wide format, the other in a long format. We will use the former to generate a plot and the latter will be used in the ANOVA analysis.

```{r, tidy=FALSE}
# Create a data frame
dat <- data.frame( Year = c(1971,1972,1973,1974,1975,1976),
                   Summer = c(2.00, 2.34, 2.48, 2.63, 2.81, 3.20),
                   Fall = c(1.45, 2.08, 2.32, 2.45, 2.70, 3.04),
                   Winter = c(1.45, 1.76, 2.08, 2.36, 2.49, 2.70),
                   Spring = c(1.34, 1.72, 2.04, 2.15, 2.51, 3.11))

# However, to run an ANOVA, the table need to be in long form
library(tidyr)
dat.long <- pivot_longer(dat, names_to = "Season", values_to = "Value", -Year)
```

Next, let's plot the points then add the grand mean (black line) and each season's mean (red lines) to the plot.

```{r plot.example1, fig.width=4, fig.height=3, eval=FALSE}

# Create an empty plot
plot( NULL, xlim= c(0.5,4.5), ylim=c(1, 3.5), axes=FALSE, xlab=NA,ylab="Fecal Coliform")
box()
axis(1, labels = names(dat)[-1], at=c(1,2,3,4))
axis(2, las=2, )
points(rep(1,nrow(dat)), dat$Summer, pch=1)
points(rep(2,nrow(dat)), dat$Fall, pch=2)
points(rep(3,nrow(dat)), dat$Winter, pch=3)
points(rep(4,nrow(dat)), dat$Spring, pch=4)

# Add the grand mean
abline(h = mean(dat.long$Value), lty=3)

# Add each group's mean
lines(c(.8,1.2), rep(mean(dat$Summer),2), col="red" )
lines(c(1.8,2.2), rep(mean(dat$Fall),2), col="red" )
lines(c(2.8,3.2), rep(mean(dat$Winter),2), col="red" )
lines(c(3.8,4.2), rep(mean(dat$Spring),2), col="red" )
```

```{r fig.width=4, fig.height=3, eval=TRUE, echo=FALSE}
OP <-par(mar=c(2,3,1,1))
 <<plot.example1>>
par(OP)
```

A quick glance at the plot suggests that the variance within each season is large relative to the differences in means between each seasons. But could summer values be slightly higher than those of other seasons? We check this assumption with an ANOVA test.

### Computing ANOVA the hard way

```{r, tidy=FALSE}
SSE.m <- sum( (dat.long$Value - mean(dat.long$Value))^2 )
SSE <- sum( (dat$Summer - mean(dat$Summer))^2 +
            (dat$Fall - mean(dat$Fall))^2 +
            (dat$Winter - mean(dat$Winter))^2 +
            (dat$Spring - mean(dat$Spring))^2 )
SSR <- SSE.m - SSE
MSR <- SSR / 3   # (p - 1) or 3 degrees of freedom
MSE <- SSE / 20  # (n - p) or 20 degrees of freedom
Fratio <- MSR/MSE
p.val  <- pf(Fratio, 3, 20,lower.tail=FALSE)
```

The function `pf()` compares our $F$-ratio to the distribution of $F_{Ho}$ values one would expect to get if the means between seasons were not different. The two numbers following the $F$-ratio value are the degrees of freedom for the $MSR$ and the $MSE$ calculated from $(p-1)$ and $(n-p)$ respectively.

```{r echo=FALSE, message=FALSE, fig.width=7, fig.height=2.0, warning=FALSE}
OP <- par("mar"=c(2,4,1,1), xpd=NA)
df1 = 3
df2 = 20
Ft  = Fratio
xmin = 0
xmax = 10
ncurve.x = seq(xmin, xmax,length.out= 200)
ncurve.y = df(ncurve.x, df1, df2 )
plot( ncurve.y ~ ncurve.x, type="l", ylab=NA, xlab=NA,axes=FALSE,main=NA)
axis(1, family= "Architects Daughter")
lines(x = c(Ft,Ft), y=c(0,df(Ft,df1,df2)), col="red"  , lty=1, lw=2)
lines(x = c(p975,p975) , y=c(0,df( p975,df1,df2)) , col="grey", lty=2, lw=2)


text(x=Ft+1.2, y = df(Ft,df1,df2), "Our F-value", col="red", pos=3, family= "Architects Daughter", adj=10.5)

ncurve.x.mSE <- seq(Ft, xmax, length.out= 100)
ncurve.y.mSE <- df(ncurve.x.mSE, df1, df2 )
ncurve.x.mSE <- c(ncurve.x.mSE, max(ncurve.x.mSE), min(ncurve.x.mSE))
ncurve.y.mSE <- c(ncurve.y.mSE, 0,0)
polygon( ncurve.x.mSE, ncurve.y.mSE, col=rgb(1, 0, 0,.3), border=NA)
par(OP)
```

The computed statistics are:

| Statistic | Value               |
|-----------|---------------------|
| SSE(mean) | `r round(SSE.m,2)`  |
| SSE       | `r round(SSE,2)`    |
| SSR       | `r round(SSR,2)`    |
| MSR       | `r round(MSR,2)`    |
| MSE       | `r round(MSE,2)`    |
| F         | `r round(Fratio,2)` |
| p         | `r round(p.val,3)`  |

Our calculated p-value is `r round(p.val,3)` indicating that about `r round(p.val * 100)`% of the $F_{Ho}$ values are more extreme than ours. So it would be unwise to dismiss the chance that the means between all four seasons are equal to one another.

### Computing ANOVA the easy way

We can use the `anova` function to compute the $F$-ratio and the $p$-value. The function takes as argument a model (a linear regression model in this case) where the dependent variable $y$ is the measurement value and the independent variable $x$ is the level (or seasons in our example). This implementation of ANOVA requires that the season values be in one column (the $x$ column) and that the measurements be in another column (the $y$ column). This requires that we use the *long* version of our table, `dat.long`, where the $x$ column is labeled `Season` and the $y$ column is labeled `Value`. The first few rows of `dat.long` look like this:

```{r}
head(dat.long)
```

The ANOVA analysis is thus computed as:

```{r}
anova(lm(Value ~ Season, dat.long))
```

The column `Mean Sq` displays the mean sum-of-squares for treatment $SSR$, and the error sum-of-square, $SSE$. The $F$-ratio and $p$-value are the same as those computed in the last subsection. Again, there is no evidence that the seasons have an influence on the mean concentrations of fecal coliform counts.

### ANOVA as a regression

You'll note that this approach in computing the ANOVA makes use of the linear regression function `lm`. This is because a one-way ANOVA is nothing more than a regression between all values in the batches and their levels expressed as categorical values where the number of categorical values is the number of levels minus $1$. In essence the ANOVA is generating the following model:

$Coliform\ count = a + b(Spring) + c(Summer) + d(Winter)$

You'll note the apparent omission of the $Fall$ category. It's actually wrapped in the $a$ coefficient. (Note that the function assigns the category to $a$ based on the alphanumeric order of the category names, or based on level order if the categories are stored as a [factor](https://mgimond.github.io/ES218/data_objects.html#factors_xtr)). So, if a value belongs to the $Fall$ batch, the model looks like this:

$Coliform\ count = a + b(0) + c(0) + d(0)$

If the value belongs to the SUMMER batch, the model looks like this:

$Coliform\ count = a + b(0) + c(1) + d(0)$

It follows that the coefficients $b$, $c$ and $d$ are differences in mean values between fall measurements and spring, summer and winter measurements respectively:

$a$ = `mean(dat$Fall)` = `r round(mean(dat$Fall),2)`

$b$ = `mean(dat$Spring) - mean(dat$Fall)` = `r round(mean(dat$Spring)- mean(dat$Fall),2)`

$c$ = `mean(dat$Summer) - mean(dat$Fall)` = `r round(mean(dat$Summer)- mean(dat$Fall),2)`

$d$ = `mean(dat$Winter) - mean(dat$Fall)` = `r round(mean(dat$Winter)- mean(dat$Fall),2)`

These coefficients can be extracted from the `lm` model via the `coefficients` function.

```{r, tidy=FALSE}
coefficients( lm( Value ~ Season, dat.long) )
```

> Note that R has a built-in function called `aov()` that can be used in lieu of `lm()`. The computed model will be the same but the functions differ slightly in their methods. Examples of its use are presented later in the tutorial.

# Assumptions of ANOVA

There are three assumptions that should be met when computing an ANOVA:

1.  The response variable must be quantitative (i.e. measured on a continuous scale). It should not be a count (i.e. the result of qualitative outcome such as yes/no).
2.  The variance between the batches (homogeneity of variance) should be similar.
3.  Observations should be independent of one another.
4.  The distribution of values within each group should be normally distributed. If a transformation is applied to the data, it should be applied to *all* batches. Note that the distribution of values for all combined batches need not be normally distributed.

You can check these assumptions graphically by plotting the results of the ANOVA analysis.

```{r fig.width=6, fig.height=3}
OP <- par(mfrow=c(1,2))
plot(lm(Value ~ Season, dat.long), 1:2)
par(OP)
```

The *Residuals-Fit* plot checks that the variance is constant across groups. What you want to avoid is a funnel like shape to the data (which may be present in this example). The *Normal Q-Q* plot checks for normality in the residuals--the closer the points fit the diagonal line, the better.

If normality in the residuals seems to be an issue, then you might want to consider re-expressing your data (note that we already re-expressed the coliform count by logging its values--had we worked with the original data the assumption of normality would have probably been violated).

If the assumption of equal variance is violated, you might want to run the analysis using `oneway.test()` (which applies **Welch's F** to the data):

```{r}
oneway.test(Value ~ Season, dat.long)
```

Note the slightly higher F-value and p-value (but not enough to change the analysis outcome)

# Identifying which levels are different

So far, we have seen that the ANOVA test tells us if the means between batches (levels) are similar, but it does not tell us *which* batch(es) of numbers may be different. One approach in assessing which level(s) is(are) significantly different from the mean is to implement a **post hoc** procedure (aka **pairwise comparison**).

There are many *post hoc* procedures to choose from. Two popular methods are presented here: the **Bonferroni** procedure and **Tukey's HSD** procedure. Both tend to generate slightly different results depending on the data characteristics. It's therefore worthwhile to run both to assess if the levels are significantly different from one another. But note that in many cases there is no substitute for a visual assessment of the differences between batches.

## Bonferroni

The Bonferroni method can be implemented using the `pairwise.t.test` function with the first two parameters pointing to the column of values and the column of variables respectively, and the third parameter pointing to the pairwise method.

```{r tidy=FALSE}
pairwise.t.test( dat.long$Value, dat.long$Season, p.adjust.method = "bonferroni")
```

The output is a matrix of $p$-values for different pairs of levels (Fall-Summer, Winter-Summer, etc...). A low $P$-value indicates significant differences between levels. In our working example, all $P$-values are relatively high indicating that the means between levels are not statistically significant.

## Tukey's HSD

The Tukey method is implemented using the `TukeyHSD` method. The first parameter is an `aov` object (`aov` is nearly identical to the `anova(lm(...))` method used thus far), and the second variable is the *levels* column from the data.

```{r tidy=FALSE}
TukeyHSD( aov( Value ~ Season, dat.long), "Season")
```

The Tukey method generates a table of $P$-values (as opposed to a matrix) for all pairs of levels. Note the slightly lower $P$-values than those generated with the Bonferroni method. The Tukey HSD method tends to be more conservative than the Bonferri method and is thus more likely to reject the null hypothesis (that the means are equal) than the Bonferri method. This comes with an advantage, however, in that the Tukey test tends to have greater statistical *power* when there are many different levels (or batches) in our dataset.

The Tukey test also generates the 95% confidence intervals (`lwr` = lower bound and `upr` = upper bound). If the lower bound is less than $0$ and the upper bound is greater than $0$ we cannot say that the means between both levels are not significantly different at a confidence level of 0.05.

## Assumptions of the post hoc test

The *post* hoc procedure assumes that we are only interested in knowing which effect (if any) is different from the grand mean. It makes no assumption about the direction of this difference (i.e. if the level mean is greater than or less than the overall mean). This is analogous to implementing a two-tailed test.

If knowing whether the effect mean is greater than or less than the overall mean, a **planned contrast** procedure (aka **planned comparison** procedure) should be adopted instead. This procedure is not covered in this tutorial, but can be found in most introductory stats books.

# Another working example

The coliform dataset was an example of effects that did not differ. Here, we'll look at an example with differences in effect means. We'll use a dataset that comes installed with `R`, the `ChickWeight` data. This dataset tracks the weights of chicks using different diet types. We will grab a subset of the data and only focus on day 21 measurements. We will also restrict the columns to `Weight` and `Diet`.

```{r}
dat2 <- ChickWeight[ChickWeight$Time==21,c("weight", "Diet")]
```

The group column is `Diet` which has four levels. These levels are identified as numbers ($1$ through $4$). One needs to be careful when running an ANOVA using levels encoded as numbers since the `lm` model may interpret such values as continuous and not categorical. It's therefore good practice to force the numbers as `factors` by using the `as.factor` function:

```{r}
dat2$Diet <- as.factor(dat2$Diet)
```

Note that in this particular case, the conversion was not necessary since the `chickweight` table from which we extracted the values stored the `Diet` field as a factor (and this attribute was carried over to our `dat` data frame).

Next, let's plot the points along with the grand mean (dashed line) and each levels' mean value (symbolized as a filled circle).

```{r fig.width=4, fig.height=4}
plot(weight ~ as.numeric(Diet), dat2, col=Diet, pch=3,cex=.6)
abline( h = mean(dat2$weight), lty=2)
points( unique(dat2$Diet) , by(dat2$weight, dat2[,"Diet"], mean) , pch=19, col=unique(dat2$Diet))
```

One level, diet 3, seems to be a bit different from the others, but we will need to run and ANOVA to make sure of this.

```{r}
anova(lm(weight ~ Diet, dat2))
```

The small $p$-value suggests that not all levels have the same mean value. We can run a *post hoc* test to identify which level may be different

```{r}
pairwise.t.test( dat2$weight, dat2$Diet, p.adjust.method = "bonferroni")
TukeyHSD( aov( weight ~ Diet, dat2), "Diet")
```

Both tests suggest that diet 3 is significantly different from diet 1, but the differences between the other levels are not significant.

# Two-way ANOVA

A two-way ANOVA adds another factor (explanatory variable) to the analysis. For example, in the earlier fecal coliform count dataset (Section 2.2), coliform counts were only compared across season yet, another piece of information is made available to us: year. We could choose to treat year as a second factor. This would add a *second* hypothesis to our analysis, that being *"can **year** help explain the variability in coliform count"* (the first hypothesis asks if seasons can explain the variability in the coliform count).

While a one-way ANOVA does not make sense if the data are devoid of replicates, a two-way ANOVA can be conducted if the data do not have replicates (referred to as an **unreplicated two-way ANOVA**). The implementation does not differ between both scenarios except when testing for interaction (described further below).

The `dat.long` dataframe is a *replicatless* dataset. What follows is a *replicated* version of the dataset (used later in demonstrating for interactions). Note that this dataset is used for instructional purposes only and is *not* the way you would create replicate measurements!

```{r}
set.seed(321)
datr.long <- data.frame( Year   = rep(rep(c(1971,1972,1973,1974,1975,1976),each=3),4),
                         Season = rep(c("Summer","Fall","Winter","Spring"),each=18),
                         Value  = rep(dat.long$Value,each=3) + rnorm(72,0,0.1))
```

In the following examples, we will use the `dat.long` dataset but note that the same code chunks can be applied to the `datr.long` dataset.

First we'll visualize the variability across each factor via boxplots:

```{r fig.height=2.3, echo=2:3}
OP <- par(mar=c(3,3,1,1), mfrow=c(1,2))
boxplot(Value ~ Year, dat.long)
boxplot(Value ~ Season, dat.long)
par(OP)
```

The differences in (logged) coliform counts across years appear significant from the plots. More so than the differences across seasons.

A two-way ANOVA test assesses how much of an effect (if any) each factor has on the variable. The implementation in R is simple--just add the `Year` variable to the one-way ANOVA function shown in Section 2.2.2.

```{r}
anova(lm(Value ~ Season + as.factor(Year), dat.long))
```

Note that we are wrapping the `Year` variable with the `as.factor` function to force R to treat that variable as a factor and not a continuous variable (it's currently stored in R as a `numeric` and *not* a `factor`).

The p-values appear significant for *both* factors (you'll recall that an earlier one-way analysis suggested that seasons had little influence on the variability on coliform counts). This discrepancy can be partially explained by looking at a boxplot of the *residuals* vs. `Season` (i.e. the residuals that remain after accounting for `Year` alone in the model).

```{r fig.height=2.3, fig.width=4, echo=2:3}
OP <- par(mar=c(3,3,1,1))
res <- residuals(lm(Value ~ as.factor(Year), dat.long))
boxplot(res ~ dat.long$Season)
par(OP)
```

Accounting for variability across years reveals a greater spread in values across seasons (hence the drop in that factor's p-value in a two-way analysis).

The two-factor model coefficients can be extracted as follows:

```{r}
coefficients(lm(Value ~ Season + as.factor(Year), dat.long))
```

The model is thus,

$Coliform_{log} = 1.60 - 0.19 Spring +0.24 Summer -0.20 Winter + \\ 0.42 Year_{1972} +0.67 Year_{1973} +0.840 Year_{1974} + 1.07 Year_{1975} + 1.45 Year_{1976}$

So if we want to predict (log) Coliform counts for the Fall season of 1976, the model would take the form:

$Coliform_{log} = 1.60 - 0.19(0) +0.24(0) -0.20(0) + 0.42 (0) + 0.67(0) +0.840(0) + 1.07(0) + 1.45(1)$

or

$Coliform_{log} = 1.60 + 1.45(1)$

Note that the Fall coefficient, like the 1971 coefficient, is wrapped in the intercept term.

## Checking for interaction (non-additive model)

> Note that it makes no sense to test for interactivity if you do not have replicates (i.e. if you have just one case per factor level combinations) since you may not be able to distinguish the error terms from the residuals. In the examples that follow, we'll make use of the `datr.long` dataset which *has* (simulated) replicates.

We need to check that the additive model we've chosen is appropriate for this dataset. A graphical approach involves plotting the values as a function of one factor using a line graph (where each line represents the levels of the other factor). This is sometimes referred to as an *interactive plot*. We'll make use of the built-in `interaction.plot` function where one of the factors (`Year`) will be mapped to the x-axis and the other factor (`Season`) will be mapped to the line plots. This function does not have a `data=...` parameter telling it which data frame to work off of so we'll wrap the function with the `with()` function which defines the data frame from which the variables are to be read from.

```{r fig.height=2.5, fig.width=5, echo=2}
OP <- par(mar=c(3,3,1,3))
with(datr.long, interaction.plot(as.factor(Year), Season, Value))
par(OP)
```

We are seeking parallel lines. The more *non-parallel* the lines, the more unlikely an additive model is a proper fit. If the lines are clearly not parallel, then there is interaction between the factors and an interaction term is probably needed in the model. If we chose to add an interaction component to our model, we would simply augment the formula by adding `as.factor(Year):Season`:

```{r}
anova(lm(Value ~ Season + as.factor(Year) + as.factor(Year):Season, datr.long))
```

The p-value associated with the interactive term `as.factor(Year):Season` is small suggesting that the interaction should not be ignored. Note that had the two-way ANOVA been run on the replicate-free version of the data (`dat.long`), the code chunk would not have generated p-values.

Coefficients can be extracted using the `coefficients()` function.

```{r}
coefficients(lm(Value ~ Season + as.factor(Year) + as.factor(Year):Season, datr.long))
```

Note that one disadvantage in having so many terms in the model is that you reduce the *power* of the model. In this example we now have 23 variables and a little more than three times as many measurements (72 to be exact). It's always good practice to seek the most parsimonious model possible. This may involve, for example, grouping factor levels if theory supports it.

## Extracting effects for each level

The regression model will wrap one level's effect from each factor into the intercept term however, you might wish to know each level's effect on the measured variable. This requires decomposing the model's intercept into an overall mean measurement and two level effects (one for each factor). This can be done by running the ANOVA analysis using `aov()` then wrapping its output with the `model.tables()` function as follows:

```{r}
M1 <- aov(Value ~ Season + as.factor(Year), dat.long)
model.tables(M1)
```

The overall mean (and new model intercept) can be extracted from the original table as follows:

```{r}
mean(dat.long$Value)
```

While this approach expands the model by adding two more variables, it does allow one to quantify the effect each level has in explaining the coliform counts. For example, the model suggests that the overall mean (log) coliform count is `2.3` but the Spring seasons lower that value by about `-0.155`. Likewise, we can state that the year 1975 increases the overall mean by about `0.327`. It's also clear from the output that 1971 had the greatest effect on coliform count by dropping the overall value by `-0.7404`.

Note, however, that if the model has a significant cross-factor interaction effect, you should not attempt to draw anything too meaningful from the main level effects. In such a scenario, the most meaningful result to glean from the model is the interaction between the factors.

# References

Millard S.P, Neerchal N.K., *Environmental Statistics with S-Plus*, 2001.\
McClave J.T., Dietrich F.H., *Statistics*, 4th edition, 1988.\
Vik P., *Regression, ANOVA, and the General Linear Model: A Statistics Primer*, 2013.