islr/islr.qmd at main · jeffreyCarlLong/islr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to Statistcal Learning"
subtitle: "with Applications in R (ISLR) abstracted to Biomarker Discovery and Prediction"
author: "Jeffrey Long"
format: html
editor: visual
---

## A Quarto Experiment

Quarto enables you to weave together content and executable code into a finished document.

ISLR second edition is available for the author in [a free pdf format](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf).

This is my biomarker experiment with Quarto and ISLR.

## History

The Elements of Statistical Learning (ESL, by Hastie, Tibshirani, and Friedman) --- was published in 2001. The authors of ISLR2 (the resource of this experiment) are Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

## Contents

1.  Introduction
2.  Statistical Learning
3.  Linear Regression
4.  Classification
5.  Resampling Methods
6.  Linear Model Selection and Regularization
7.  Moving Beyond Linearity
8.  Tree-Based Methods
9.  Support Vector Machines
10. Deep Learning
11. Survival Analysis and Censored Data
12. Unsupervised Learning
13. Multiple Testing

## Introduction

Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for pre- dicting, or estimating, an output based on one or more inputs. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy. With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and struc- ture from such data.

**The `Wage` data** involves predicting a continuous or quantitative output value. This is often referred to as a regression problem. However, in certain cases we may instead wish to predict a non-numerical value---that is, a categorical or qualitative output.

An example of predicting non-numerical values, such as up/down is explored with **the `Smarket` data** are the Standard & Poor's 500 (S&P) stock index over a 5-year period between 2001 and 2005.

Imagine the first steps of a search for biomarkers in gene expression data. We might have demographic information for a number of patients with an indication. We may wish to understand which types of patients are similar to each other by grouping individuals according to observed characteristics. This is a *clustering* problem. Unlike `Wage` and `Smarket` data, here we are not trying to predict an output variable. We will use **the NCI60 data set** to determine groups or clusters based on 6,830 gene expression measurements for 64 cell lines.

In this particular data set, it turns out that the cell lines correspond to 14 different types of cancer. There is clear evidence that cell lines with the same cancer type tend to be located near each other in clustering methods.

We will use *n* to represent the number of distinct data points, or observations, in our sample. We will let *p* denote the number of variables that are available for use in making predictions. For example, the `Wage` data set consists of 11 variables for 3,000 people, so we have *n* = 3,000 observations and *p* = 11 variables (such as year, age, race, and more). We indicate variable names using the font: `Variable Name`.

We will let $x_{ij}$ represent the value of the *j*th variable for the *i*th observation, where *i* = 1,2,...,*n* and *j* = 1,2,...,*p*. Throughout this experiment, *i* will be used to index the samples or observations (from 1 to *n*) and j will be used to index the variables (from 1 to *p*). We let $\chi$ denote an $n × p$ matrix whose (*i*, *j*)th element is $x_{ij}$ .

The product of **A** and **B** is denoted **AB**. The (*i*,*j*)th element of **AB** is computed by multiplying each element of the *i*th row of **A** by the corresponding element of the *j*th column of **B**. That is, $(AB)_{ij} = \Sigma^d_{k=1} a_{ik}b_{kj}$. As an example,

consider $A = \begin{pmatrix}1 & 2\\3 & 4\end{pmatrix}$ and $B = \begin{pmatrix}5 & 6\\7 & 8\end{pmatrix}$ .

$AB = \begin{pmatrix}1 & 2\\3 & 4\end{pmatrix}\begin{pmatrix}5 & 6\\7 & 8\end{pmatrix} = \begin{pmatrix}1*5+2*7 & 1*6+2*8\\3*5+4*7 & 3*6+4*8\end{pmatrix} = \begin{pmatrix}19 & 22\\43 & 50\end{pmatrix}$

Note that this operation produces an $r × s$ matrix. It is only possible to compute **AB** if the number of columns of **A** is the same as the number of rows of **B**.

```{r}
# install.packages("ISLR2")
library(ISLR2)
packageDescription("ISLR2")

```

The website for ISLR2 is [www.statlearning.com](https://www.statlearning.com/).

## Statistical Learning

The input variables are typically denoted using the symbol X, with a subscript to distinguish them. The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables, and is typically denoted using the symbol Y.

Suppose that we observe a quantitative response Y and p different predictors, X1, X2, ... , Xp. We assume that there is some relationship between Y and X = (X1,X2,...,Xp), which can be written in the very general form

$Y =f(X)+\epsilon$.

Here f is some fixed but unknown function of $X_1, \hdots, X_p$ , and $\epsilon$ is a random error term, which is independent of $X$ and has mean zero. In this formulation, $f$ represents the systematic information that $X$ provides about $Y$ . However, the function $f$ that connects the input variable to the output variable is in general unknown. In this situation one must estimate $f$ based on the observed points. Overall, the errors have approximately mean zero. The function $f$ may involve more than one input variable.

### Why Estimate f?

There are two main reasons that we may wish to estimate f: prediction and inference. We discuss each in turn.

#### Prediction

In many situations, a set of inputs $X$ are readily available, but the output $Y$ cannot be easily obtained. In this setting, since the error term averages to zero, we can predict $Y$ using

$\hat{Y} = f(X)$ ,

where $\hat{f}$ represents our estimate for $f$ , and $\hat{Y}$ represents the resulting prediction for $Y$ . In this setting, $\hat{f}$ is often treated as a *black box*, in the sense that one is not typically concerned with the exact form of $\hat{f}$ , provided that it yields accurate predictions for $Y$ .

The accuracy of $\hat{Y}$ as a prediction for $Y$ depends on two quantities, which we will call the *reducible error* and the *irreducible error*. In general, $\hat{f}$ will not be a perfect estimate for \$f\$, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of $\hat{f}$ by using the most appropriate statistical learning technique to estimate $f$ . However, even if it were possible to form a perfect estimate for $f$ , so that our estimated response took the form $\hat{Y} = f(X)$ , our prediction would still have some error in it! This is because $Y$ is also a function of $\epsilon$ , which, by definition, cannot be predicted using $X$ . Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions. This is known as the irreducible error, because no matter how well we estimate $f$ , we cannot reduce the error introduced by $\epsilon$ .

#### Inference

We are often interested in understanding the association between $Y$ and $X_1,...,X_p$ . In this situation we wish to estimate $f$ , but our goal is not necessarily to make predictions for $Y$ . Now $\hat{f}$ cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

• Which predictors are associated with the response? It is often the case that only a small fraction of the available predictors are substantially associated with $Y$ . Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application.

• What is the relationship between the response and each predictor? Some predictors may have a positive relationship with $Y$ , in the sense that larger values of the predictor are associated with larger values of $Y$ . Other predictors may have the opposite relationship. Depending on the complexity of $f$ , the relationship between the response and a given predictor may also depend on the values of the other predictors.

• Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated? Historically, most methods for estimating $f$ have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.

### How Do We Estimate $f$ ?

Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function $f$ .

-   training data - observations and responses

-   parametric methods - model-based approach that makes an assumption about $f$ i.e. *linear model*, then fits or trains the model i.e. *least squares*. Beware of overfitting.

-   non-parametric - methods do not make explicit assumptions about the functional form of $f$ . Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for $f$ , they have the potential to accurately fit a wider range of possible shapes for $f$ . Any parametric approach brings with it the possibility that the functional form used to estimate $f$ is very different from the true $f$ , in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of $f$ is made. But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating $f$ to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$ .

### The Trade-Off Between Prediction Accuracy and Model Interpretability

Some methods such as linear regression are less flexible, or more restrictive, in the sense that they can produce just a relatively small range of shapes to estimate $f$ , such as linear function. Other methods such as thin plate splines are considerably more flexible because they can generate a much wider range of possible shapes to estimate $f$ .

![A representation of the tradeoff between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.](interpretability_v_flexibility.png)

However, if we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between $Y$ and $X_1,X_2,…,X_p$ .

Least squares linear regression, is relatively inflexible but is quite interpretable. The lasso, relies upon the linear model but uses an alternative fitting procedure for estimating the coefficients $\beta_0, \beta_1, . . . , \beta_p$ . The *lasso* is more restrictive in es- timating the coefficients, and sets a number of them to exactly zero. Hence in this sense the lasso is a less flexible approach than linear regression. It is also more interpretable than linear regression, because in the final model the response variable will only be related to a small subset of the predictors---namely, those with nonzero coefficient estimates. Generalized additive models (GAMs) instead extend the linear model to allow for certain non-linear relationships. Consequently, GAMs are more flexible than linear regression. They are also somewhat less interpretable than linear regression, because the relationship between each predictor and the response is now modeled using a curve. Finally, fully non-linear methods such as bagging, boosting, support vector machines with non-linear kernels, and neural networks (deep learning), are highly flexible approaches that are harder to interpret.

When prediction is the goal, one may suspect the most flexible method is the best choice as it would give the most accurate prediction. However, overfitting often leads us to less flexible methods with improved performance.

### Supervised Versus Unsupervised Learning

In supervised learning problems, for each observation of the predictor measurement(s) $x_i, i = 1, . . . , n$ there is an associated response measurement $y_i$ . We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference). Many classical statistical learning methods such as linear regression and logistic regression,as well as more modern approaches such as GAM, boosting, and support vector machines, operate in the supervised learning domain.

Unsupervised learning describes the somewhat more challenging situation in which for every observation $i = 1,…,n$ , we observe a vector of measurements $x_i$ but no associated response $y_i$ . It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis.

One statistical learning tool that we may use in this setting is cluster analysis, or clustering. The goal of cluster analysis is to ascertain, on the basis of $x_1,…,x_n$ , whether the observations fall into relatively distinct groups. Identifying such groups can be of interest because it might be that the groups differ with respect to some property of interest.

### Regression Versus Classification Problems

Variables can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values. In contrast, qualitative variables take on values in one of K different classes, or categories. We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or binary) response. K-nearest neighbors and boosting can be used in the case of either quantitative or qualitative responses.

We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important.

## Assessing Model Accuracy

No one method dominates all others over all possible data sets. Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

### Measuring Quality of Fit

In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by

$MSE = \frac{1}{n}\Sigma^n_{i=1}(y_i-\hat{f}(x_i))^2$ ,

where $\hat{f}(x_i)$ is the prediction that $\hat{f}$ gives for the $i$th observation.

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

The MSE in is computed using the training data that was used tofit the model, and so should more accurately be referred to as the training MSE. But in general, we do not really care how well the method works training on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. Suppose that we have clinical measurements (e.g. weight, blood pressure, height, age, family history of disease) for a number of patients, as well as information about whether each patient has diabetes. We can use these patients to train a statistical learning method to predict risk of diabetes based on clinical measurements. In practice, we want this method to accurately predict diabetes risk for future patients based on their clinical measurements. We are not very interested in whether or not the method accurately predicts diabetes risk for patients used to train the model, since we already know which of those patients have diabetes.

We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE. If we had a large number of test observations, we could compute

$Ave(y_0 - \hat{f}(x_0))^2$ ,

the average squared prediction error for these test observations $(x_0,y_0)$ . We'd like to select the model for which this quantity is as small as possible.

The degrees of freedom is a quantity that summarizes the flexibility of a curve. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. Regardless of whether or not overfitting has occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.

Plotting of MSE versus degrees of freedom as model flexibility can guide model selection. Cross-validation can estimate the minimum test MSE using the training data.

### The Bias-Variance Trade-Off

It is possible to show that the **expected test MSE**, for a given value $x_0$ , can always be decomposed into the sum of three fundamental quantities: the variance of \$f(x_0)\$, the squared bias of $f(x_0)$ and the variance of the error terms $\epsilon$ .

$E(y_0-\hat{f}(x_0))2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))^2 + Var(\epsilon)$ .

we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below $Var(\epsilon)$ , the irreducible error.

Variance refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. In general, more flexible statistical methods have higher variance. Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.

![Squared bias (blue curve), variance (orange curve), Var(ε) (dashed line), and test MSE (red curve) for the three data sets. The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE.](var_bias_error.png)

In a real-life situation in which $f$ is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the **bias-variance trade-off** in mind.

### The Classification Setting

When response variable $y$ is qualitative, the most common approach for quantifying the accuracy of our estimate $\hat{f}$ is the training error rate, the proportion of mistakes that are made if we apply our estimate $\hat{f}$ to the training observations:

$\frac{1}{n}\Sigma^n_{i=1}I(y_i!=\hat{y}_i)$ .

Here $\hat{y}_i$ is the predicted class label for the $i$ th observation using $f$ . And $I(y_i != \hat{y}_i)$ is an *indicator variable* that equals 1 if $y_i ̸!= \hat{y}_i$ and zero if $y_i = \hat{y}_i$ . If $I(y_i ̸!= \hat{y}_i) = 0$ then the $i$ th observation was classified correctly by our classification method; otherwise it was misclassified. Hence the training error computes the fraction of incorrect classifications.

The test error rate associated with a set of test observations of the form $(x_0, y_0)$ is given by

$Ave(I(y_0 ̸!=ˆ\hat{y}_0))$ ,

where $\hat{y}_0$ is the predicted class label that results from applying the classifier to the test observation with predictor $x_0$. A good classifier is one for which the test error is smallest.

### The Bayes Classifier

The test error rate is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with predictor vector $x_0$ to the class $j$ for which

$Pr(Y = j|X = x_0)$

is largest. This conditional probability is the probability that $Y = j$ , given the observed predictor vector $x_0$ . This very simple classifier is called the *Bayes classifier*. In a two-class problem where there are only two possible response values, say class 1 or class 2, the Bayes classifier corresponds to predicting class one if $Pr(Y = 1|X = x_0) > 0.5$ , and class two otherwise.

![A simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in which a test observation will be assigned to the orange class, and the blue background grid indicates the region in which a test observation will be assigned to the blue class.](bayes_classifier.png)

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class for which is largest, the error rate will be $1−max_j Pr(Y = j|X = x_0) at X = x_0$ . In general, the overall Bayes error rate is given by

$1−E(max_jPr(Y =j|X))$ ,

where the expectation averages the probability over all possible values of $X$ . If the Bayes error rate is greater than zero, the classes overlap in the true population so $max_j Pr(Y = j|X = x_0) < 1$ for some values of $x_0$ . The Bayes error rate is analogous to the irreducible error.

### K-Nearest Neighbors

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of $Y$ given $X$ , and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods. Many approaches attempt to estimate the conditional distribution of $Y$ given $X$ , and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier. Given a positive integer $K$ and a test observation $x_0$ , the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$ , represented by $N_0$ . It then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$ :

$Pr(Y =j|X =x_0)= \frac{1}{K} \Sigma_{i\element N_0} I(y_i =j)$ .

Finally, KNN classifies the test observation $x_0$ to the class with the largest probability.

![Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). -from Wikipedia.](KnnClassification.svg)

![The black curve indicates the KNN decision boundary on data using K = 10. The Bayes decision boundary is shown as a purple dashed line. The KNN and Bayes decision boundaries are very similar.](knn_k10.png)

![A comparison of the KNN decision boundaries (solid black curves) obtained using K = 1 and K = 100 on data. With K = 1, the decision boundary is overly flexible, while with K = 100 it is not sufficiently flexible. The Bayes decision boundary is shown as a purple dashed line.](knn_k1_k100.png)

![The KNN training error rate (blue, 200 observations) and test error rate (orange, 5,000 observations) on the data, as the level of flexibility (assessed using 1/K on the log scale) increases, or equivalently as the number of neighbors K decreases. The black dashed line indicates the Bayes error rate. The jumpiness of the curves is due to the small size of the training data set.](knn_training_error_rate.png)

The test error exhibits a characteristic U-shape, declining at first (with a minimum at approximately K = 10) before increasing again when the method becomes excessively flexible and overfits.

## Lab in R

Posit (formerly known as RStudio), provides an R integrated development environment (IDE).

```{r}
# Basic Commands
x <- c(1, 3, 2, 5)
x = c(1, 6, 2)
y = c(1, 4, 3)
length(x)
length(y)
x+y
ls()
rm(x, y)
ls()
x = c(1, 6, 2)
y = c(1, 4, 3)
rm(list = ls())
?matrix
x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
sqrt(x)
x <- rnorm(50)
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x, y)
set.seed(1303)
rnorm(50)
set.seed(3)
y <- rnorm(100)
mean(y)
var(y)
sqrt(var(y)) # SD
sd(y)
# Graphics
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)
plot(x, y, xlab = "this is the x-axis",
ylab = "this is the y-axis", main = "Plot of X vs Y")
pdf("Figure.pdf")
plot(x, y, col = "green")
dev.off()
x <- seq(1, 10)
x
x <- 1:10
x
x <- seq(-pi, pi, length = 50)
x
y <- x
f <- outer(x, y, function(x, y) cos(y) / (1 + x^2))
contour(x, y, f)
fa <- (f - t(f)) / 2
contour(x, y, fa, nlevels = 15)
image(x, y, fa)
persp(x, y, fa)
persp(x, y, fa, theta = 30)
persp(x, y, fa, theta = 30, phi = 20)
persp(x, y, fa, theta = 30, phi = 70)
persp(x, y, fa, theta = 30, phi = 40)
# Indexing Data
A <- matrix(1:16, 4, 4)
A
A[2,3]
A[c(1, 3), c(2, 4)]
A[1:3, 2:4]
A[1:2, ]
A[-c(1, 3), ]
dim(A)
# Loading Data
library(ISLR2)
head(Auto)
write.table(Auto, "Auto.data")
Auto <- read.table("Auto.data")
Auto <- read.table("Auto.data", header = T, na.strings = "?", stringsAsFactors = T)
# Auto <- read.csv("Auto.csv", na.strings = "?", stringsAsFactors = T)
Auto <- na.omit(Auto)
names(Auto)
# Additional Graphical and Numerical Summaries
plot(Auto$cylinders, Auto$mpg)
attach(Auto)
plot(cylinders, mpg)
cylinders <- as.factor(cylinders)
plot(cylinders, mpg)
plot(cylinders, mpg, col = "red")
plot(cylinders, mpg, col = "red", varwidth = T)
plot(cylinders, mpg, col = "red", varwidth = T, horizontal = T)
plot(cylinders, mpg, col = "red", varwidth = T, xlab = "cylinders", ylab = "MPG")
hist(mpg)
hist(mpg, col = 2)
hist(mpg, col = 2, breaks = 15)
plot(horsepower, mpg)
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species",
      pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
pairs(iris, log = 1:4, # log the first four
      main = "Lengths and Widths in [log]", line.main=1.5, oma=c(2,2,3,2))
names(iris)
plot(iris$Sepal.Length,iris$Petal.Length)
summary(iris)

college <- College
names(college)
rownames(college)
summary(college)
pairs(college[,1:5])
?Boston
```

## Linear Regression

Linear regression is a useful tool for predicting a quantitative response $Y$ on the basis of a single predictor variable $X$ . It assumes that there is approximately a linear relationship between $X$ and $Y$ .

$Y ≈ β_0 + β_1X$ .

Patient survival rate may be approximately modeled as response to biomarker A quantity. Then we regress patient survival rate onto biomarker A quantity by fitting the model

$patient\_survival\_rate ≈ β_0 + β_1 * biomarker\_A\_quantity$ .

$β_0$ and $β_1$ are the intercept and slope terms, unknown constants. They are the model coefficients or parameters. These will be derived by the training data, then we may predict patient survival rate of a biomarker A quantity by

$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x$ ,

where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$.

### Estimating Coefficients

In practice, $\beta_0$ and $\beta_1$ are unknown. So we must use data to estimate the coefficients. Let

$(x_1,y_1), (x_2,y_2),…, (x_n,y_n)$

represent n observation pairs, each of which consists of a measurement of $X$ and a measurement of $Y$ . Our goal is to obtain coefficient estimates $\hat{\beta_0}$ and $\hat{\beta_1}$ such that the linear model fits the available data well---that is, so that $y_i ≈ \hat{\beta_0} + \hat{\beta_1}x_i$ for \$i = 1, . . . , n\$. In other words, we want to find an intercept $\hat{\beta_0}$ and a slope $\hat{\beta_1}$ such that the resulting line is as close as possible to the data points. There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.

Let $\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i$ be the prediction for $Y$ based on the $i$th value of $X$ . Then $e_i = y_i − \hat{y_i}$ represents the $i$ th residual---this is the difference between the $i$ th observed response value and the $i$ th response value that is predicted by our linear model. We define the residual sum of squares (RSS) as

$RSS=e^2_1 +e^2_2 +···+e^2_n$ ,

or equivalently as

$RSS=(y_1−\hat{\beta_0}−\hat{\beta_1}x_1)^2+(y_2−\hat{\beta_0}−\hat{\beta_1}x_2)^2+···+(y_n−\hat{\beta_0}−\hat{\beta_1}x_n)^2$.

The least squares approach chooses $\hat{\beta_0}$ and $\hat{\beta_1}$ to minimize the RSS. Using some calculus, one can show that the minimizers are

$\hat{\beta_1}=\frac{\Sigma^n_{i=1}(x_i − \bar{x})(y_i − \bar{y})}{\Sigma^n_{i=1}(x_i - \bar{x})^2}$ ,

$\hat{\beta_0} = \bar{y}-\hat{\beta_1}\bar{x}$ ,

where $\bar{y} = \frac{1}{n}\Sigma^n_{i=1}y_i$ and $\bar{x} = \frac{1}{n}\Sigma^n_{i=1}x_i$ are the sample means. In other words, this defines the least squares coefficient estimates for simple linear regression.

### Assessing the Accuracy of the Coefficient Estimates

We assume that the \_true\_ relationship between $X$ and $Y$ takes the form $Y = f(X) + \epsilon$ for some unknown function \$f\$, where $\epsilon$ is a mean-zero random error term. If $f$ is to be approximated by a linear function, then we can write this relationship as

$Y = \beta_0 + \beta_1 X + \epsilon$ .

Here $\beta_0$ is the intercept term---that is, the expected value of $Y$ when $X = 0$ , and $\beta_1$ is the slope---the average increase in $Y$ associated with a one-unit increase in $X$ . The error term is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in $Y$ , and there may be measurement error. We typically assume that the error term is independent of $X$ .

This model defines the *population regression line*, which is the best linear approximation to the true relationship between $X$ and $Y$ . The least squares regression coefficient estimates characterize the least squares line.

The true relationship is generally not known for real data, but the least squares line can always be computed using the coefficient estimates. In other words, in real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.

A reasonable estimate of the population mean $\hat{\mu} = \bar{y}$ , is the sample mean, where $\bar{y} = \frac{1}{n}\Sigma^n_{i=1}y_i$ . In the same way, the unknown coefficients $\beta_0$ and $\beta_1$ in linear regression define the population regression line. We seek to estimate these unknown coefficients using $\hat{\beta_0}$ and $\hat{\beta_1}$ . These coefficient estimates define the least squares line.

To quantify how accurate the sample mean $\hat{\mu}$ 's estimation is of $\mu$ , we compute the standard error of $SE(\hat{\mu})$ .

$Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$ ,

where $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$ . This holds provided that the $n$ observations are uncorrelated. The standard error tells us the average amount that this estimate $\hat{\mu}$ differs from the actual value of $\mu$ . Also note, this deviation shrinks with $n$ ---the more observations we have, the smaller the standard error of $\hat{\mu}$ . In a similar vein, we can wonder how close $\hat{\beta_0}$ and $\hat{\beta_1}$ are to the true values $\beta_0$ and $\beta_1$ . To compute the standard errors associated with $\hat{\beta_0}$ and $\hat{\beta_1}$ , we use the following formulas:

$SE(\hat{\beta_0})^2 = \sigma^2\big[\frac{1}{n} + \frac{\bar{x}^2}{\Sigma^n_{i=1}(x_i-\bar{x})^2}\big]$ ,

$SE(\hat{\beta_1})^2 = \sigma^2\big[\frac{1}{n} + \frac{\sigma^2}{\Sigma^n_{i=1}(x_i-\bar{x})^2}\big]$ ,

where $\sigma^2 = Var(\epsilon)$ . For these formulas to be strictly valid, we need to assume that the errors $\epsilon_i$ for each observation have common variance $\sigma^2$ and are uncorrelated. Notice in the formula that $SE(\hat{\beta_1})$ is smaller when the $x_i$ are more spread out; intuitively we have more *leverage* to estimate a slope when this is the case. We also see that $SE(\hat{\beta_0})$ would be the same as $SE(\hat{\mu})$ if $\bar{x}$ were zero (in which case $\hat{\beta_0}$ would be equal to $\bar{y}$ . In general, $\sigma^2$ is not known, but can be estimated from the data. This estimate of $\sigma$ is known as the *residual standard error* , and is given by $RSE = \sqrt{RSS/(n-2)}$ .

Standard errors can be used to compute *confidence intervals*. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. The range is defined in terms of lower and upper limits computed from the sample of data. A 95% confidence interval has the following prop- erty: if we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter. For linear regression, the 95% confidence interval for $\beta_1$ approximately takes the form

$\hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1})$

That is there is approximately a 95% chance that the interval

$\big[\hat{\beta_1} - 2 \cdot SE(\hat{\beta_1}), \hat{\beta_1} + 2 \cdot SE(\hat{\beta_1)}\big]$

will contain the true value of $\beta_1$ . Approximately for several reasons. The equation relies on the assumption that the errors are Gaussian. Also, the factor of 2 in front of the $SE(\hat{\beta_1})$ term will vary slightly depending on the number of observations $n$ in the linear regression. To be precise, rather than the number 2, the equation should contain the 97.5% quantile of a t-distribution with $n−2$ degrees of freedom. Similarly, a confidence interval for $\beta_0$ approximately takes the form

$\hat{\beta_0} \pm 2 \cdot SE(\hat{\beta_0})$ .

Standard errors can also be used to perform *hypothesis tests* on the coefficients. The most common hypothesis test involves testing the *null hypothesis* of

$H_0$ : There is no relationship between $X$ and $Y$

versus the alternative hypothesis

$H_a$ : There is some relationship between $X$ and $Y$ . Mathematically, this corresponds to testing

$H_0 : \beta_1 = 0$

versus $H_a : \beta_1 \neq 0$ ,

since if $\beta_1 =0$ then the model reduces to $Y = \beta_0+\epsilon$ ,and $X$ is not associated with $Y$ . To test the null hypothesis, we need to determine whether $\hat{\beta_1}$ , our estimate for $\beta_1$ , is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. How far is far enough? This of course depends on the accuracy of $\hat{\beta_1}$ ---that is, it depends on $SE(\hat{\beta}_1)$ . If $SE(\hat{\beta}_1)$ is small, then even relatively small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \neq 0$ , and hence that there is a relationship between $X$ and $Y$ . In contrast, if $SE(\hat{\beta}_1)$ is large, then $\hat{\beta}_1$ must be large in absolute value in order for us to reject the null hypothesis. In practice, we compute a t-statistic, given by

$t = \frac{\hat{\beta}_1 − 0} {SE(\hat{\beta}_1 )}$ ,

which measures the number of standard deviations that $\hat{\beta}_1$ is away from 0. If there really is no relationship between $X$ and $Y$ , then we expect a t-distribution with $n − 2$ degrees of freedom. The t-distribution has a bell shape and for values of $n$ greater than approximately 30 it is quite similar to the standard normal distribution. Consequently, it is a simple matter to compute the probability of observing any number equal to $|t|$ or larger in absolute value, assuming $\beta_1 = 0$ . We call this probability the *p-value*. Roughly speaking, we interpret the *p-value* as follows: a small *p-value* indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small *p-value*, then we can infer that there is an association between the predictor and the response. We *reject the null hypothesis*---that is, we declare a relationship to exist between $X$ and $Y$ ---if the *p-value* is small enough. Typical *p-value* cutoffs for rejecting the null hypothesis are 5% or 1%. When n = 30, these correspond to t-statistics of around 2 and 2.75, respectively.

\## Assessing the Accuracy of the Model

Once we have rejected the null hypothesis (3.12) in favor of the alternative hypothesis (3.13), it is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R2 statistic. R2

Table 3.2 displays the RSE, the R2 statistic, and the F-statistic (to be described in Section 3.2.2) for the linear regression of number of units sold on TV advertising budget.

Residual Standard Error

Recall from the model (3.5) that associated with each observation is an error term ε. Due to the presence of these error terms, even if we knew the true regression line (i.e. even if β0 and β1 were known), we would not be able to perfectly predict Y from X. The RSE is an estimate of the standard deviation of ε. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula

􏰘 1 􏰙􏰚􏰚 1 􏰓n

RSE= n−2RSS=􏰛n−2 (yi −yˆi)2. (3.15)

i=1

Note that RSS was defined in Section 3.1.1, and is given by the formula

􏰓n i=1

In the case of the advertising data, we see from the linear regression output in Table 3.2 that the RSE is 3.26. In other words, actual sales in each market deviate from the true regression line by approximately 3,260 units, on average. Another way to think about this is that even if the model were correct and the true values of the unknown coefficients β0 and β1 were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average. Of course, whether or not 3,260 units is an acceptable prediction error depends on the problem context. In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3,260/14,000 = 23 %.

The RSE is considered a measure of the lack of fit of the model (3.5) to the data. If the predictions obtained using the model are very close to the true outcome values---that is, if yˆi ≈ yi for i = 1, . . . , n---then (3.15) will be small, and we can conclude that the model fits the data very well. On the other hand, if yˆi is very far from yi for one or more observations, then the RSE may be quite large, indicating that the model doesn't fit the data well.

R2 Statistic

The RSE provides an absolute measure of lack of fit of the model (3.5) to the data. But since it is measured in the units of Y , it is not always clear what constitutes a good RSE. The R2 statistic provides an alternative measure of fit. It takes the form of a proportion---the proportion of variance explained---and so it always takes on a value between 0 and 1, and is independent of the scale of Y .

To calculate R2, we use the formula

R2 = TSS − RSS = 1 − RSS (3.17)

where TSS = 􏰎(yi − y ̄)2 is the total sum of squares, and RSS is defined total sum of in (3.16). TSS measures the total variance in the response Y , and can be squares thought of as the amount of variability inherent in the response before the

regression is performed. In contrast, RSS measures the amount of variability

TSS TSS

that is left unexplained after performing the regression. Hence, TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance σ2 is high, or both. In Table 3.2, the R2 was 0.61, and so just under two-thirds of the variability in sales is explained by a linear regression on TV.

The R2 statistic (3.17) has an interpretational advantage over the RSE (3.15), since unlike the RSE, it always lies between 0 and 1. However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model (3.5) is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R2 value well below 0.1 might be more realistic!

The R2 statistic is a measure of the linear relationship between X and Y . Recall that correlation, define􏰎d as

ni=1(xi − x)(yi − y)

Cor(X, Y ) = 􏰔􏰎ni=1(xi − x)2􏰔􏰎ni=1(yi − y)2 , (3.18)

is also a measure of the linear relationship between X and Y .5 This sug- gests that we might be able to use r = Cor(X, Y ) instead of R2 in order to assess the fit of the linear model. In fact, it can be shown that in the simple linear regression setting, R2 = r2. In other words, the squared correlation and the R2 statistic are identical. However, in the next section we will discuss the multiple linear regression problem, in which we use several pre- dictors simultaneously to predict the response. The concept of correlation between the predictors and the response does not extend automatically to this setting, since correlation quantifies the association between a single pair of variables rather than between a larger number of variables. We will see that R2 fills this role.

Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. For example, in the Advertising data, we have examined the relationship between sales and TV advertising. We also have data for the amount of money spent advertising on the radio and in newspapers, and we may want to know whether either of these two media is associated with sales. How can we extend our analysis of the advertising data in order to accommodate these two additional predictors?

One option is to run three separate simple linear regressions, each of which uses a different advertising medium as a predictor. For instance, we can fit a simple linear regression to predict sales on the basis of the amount spent on radio advertisements. Results are shown in Table 3.3 (top table). We find that a \$1,000 increase in spending on radio advertising is associated with an increase in sales of around 203 units. Table 3.3 (bottom table) contains the least squares coefficients for a simple linear regression of sales onto newspaper advertising budget. A \$1,000 increase in newspaper advertising budget is associated with an increase in sales of approximately 55 units.

However, the approach of fitting a separate simple linear regression model for each predictor is not entirely satisfactory. First of all, it is unclear how to make a single prediction of sales given the three advertising media budgets, since each of the budgets is associated with a separate regression equation. Second, each of the three regression equations ignores the other two media in forming estimates for the regression coefficients. We will see shortly that if the media budgets are correlated with each other in the 200 markets in our data set, then this can lead to very misleading estimates of the association between each media budget and sales.

Instead of fitting a separate simple linear regression model for each pre- dictor, a better approach is to extend the simple linear regression model (3.5) so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model. In general, suppose that we have p distinct predictors. Then the multiple linear regression model takes the form

Y =β0 +β1X1 +β2X2 +···+βpXp +ε, (3.19)

where Xj represents the jth predictor and βj quantifies the association between that variable and the response. We interpret βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed. In the advertising example, (3.19) becomes

sales=β0 +β1 ×TV+β2 ×radio+β3 ×newspaper+ε. (3.20) 3.2.1 Estimating the Regression Coefficients

As was the case in the simple linear regression setting, the regression coef- ficients β0, β1, . . . , βp in (3.19) are unknown, and must be estimated. Given estimates βˆ0, βˆ1, . . . , βˆp, we can make predictions using the formula

yˆ = βˆ0 + βˆ1x1 + βˆ2x2 + ··· + βˆpxp. (3.21)

The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression. We choose β0, β1, . . . , βp to minimize the sum of squared residuals

RSS = =

􏰓n i=1

􏰓n i=1

(yi−yˆi)2

(yi −βˆ0 −βˆ1xi1 −βˆ2xi2 −···−βˆpxip)2.

(3.22)

The values βˆ0 , βˆ1 , . . . , βˆp that minimize (3.22) are the multiple least squares regression coefficient estimates. Unlike the simple linear regression esti- mates given in (3.4), the multiple regression coefficient estimates have somewhat complicated forms that are most easily represented using ma- trix algebra. For this reason, we do not provide them here. Any statistical software package can be used to compute these coefficient estimates, and later in this chapter we will show how this can be done in R. Figure 3.4 illustrates an example of the least squares fit to a toy data set with p = 2 predictors.

Table 3.4 displays the multiple regression coefficient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales using the Advertising data. We interpret these results as follows: for a given amount of TV and newspaper advertising, spending an additional \$1,000 on radio advertising is associated with approximately 189 units of additional sales. Comparing these coefficient estimates to those displayed in Tables 3.1 and 3.3, we notice that the multiple regression coefficient estimates for TV and radio are pretty similar to the simple linear regression coefficient estimates. However, while the newspaper regression coefficient estimate in Table 3.3 was significantly non-zero, the coefficient estimate for newspaper in the multiple regression model is close to zero, and the corresponding p- value is no longer significant, with a value around 0.86. This illustrates that the simple and multiple regression coefficients can be quite different. This difference stems from the fact that in the simple regression case, the slope term represents the average increase in product sales associated with a \$1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio. By contrast, in the multiple regression setting, the coefficient for newspaper represents the average increase in product sales associated with increasing newspaper spending by \$1,000 while holding TV and radio fixed.

Intercept

TV

radio

newspaper

Coefficient 2.939 0.046 0.189 −0.001

Std. error 0.3119 0.0014 0.0086 0.0059

t-statistic 9.42 32.81 21.89 −0.18

p-value \< 0.0001 \< 0.0001 \< 0.0001

0.8599

TABLE 3.4. For the Advertising data, least squares coefficient estimates of the multiple linear regression of number of units sold on TV, radio, and newspaper advertising budgets.

Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable, displayed in Table 3.5. Notice that the correlation between radio and newspaper is 0.35. This indicates that markets with high newspaper advertising tend to also have high ra- dio advertising. Now suppose that the multiple regression is correct and newspaper advertising is not associated with sales, but radio advertising is associated with sales. Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same mar- kets. Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be as- sociated with higher values of sales, even though newspaper advertising is not directly associated with sales. So newspaper advertising is a surrogate for radio advertising; newspaper gets "credit" for the association between radio on sales.

This slightly counterintuitive result is very common in many real life situations. Consider an absurd example to illustrate the point. Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper. Of course no one has (yet) suggested that ice creams should be banned at beaches to reduce shark attacks. In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks. A multiple regression of shark attacks onto ice cream sales and temperature reveals that, as intuition implies, ice cream sales is no longer a significant predictor after adjusting for temperature.

3.2.2 Some Important Questions

When we perform multiple linear regression, we usually are interested in answering a few important questions.

1\. Is at least one of the predictors X1,X2,...,Xp useful in predicting the response?

2\. Do all the predictors help to explain Y , or is only a subset of the predictors useful?

3\. How well does the model fit the data?

4\. Given a set of predictor values, what response value should we predict,

and how accurate is our prediction?

We now address each of these questions in turn.

One: Is There a Relationship Between the Response and Predictors?

Recall that in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether β1 = 0. In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = ··· = βp = 0. As in the simple linear regression setting, we use a hypothesis test to answer this question. We test the null hypothesis,

H0 :β1 =β2 =···=βp =0

versus the alternative

Ha : at least one βj is non-zero.

This hypothesis test is performed by computing the F-statistic,

TV

radio

newspaper

sales

1.0000 0.3541

1.0000 0.2283

3.2 Multiple Linear Regression 75

0.5762 1.0000

F = (TSS − RSS)/p , (3.23) RSS/(n−p−1)

w􏰎here, as with simple linear regression, TSS = 􏰎(yi − y ̄)2 and RSS = (yi −yˆi)2. If the linear model assumptions are correct, one can show that

E{RSS/(n − p − 1)} = σ2 and that, provided H0 is true,

E{(TSS − RSS)/p} = σ2.

Hence, when there is no relationship between the response and predictors, one would expect the F -statistic to take on a value close to 1. On the other hand, if Ha is true, then E{(TSS − RSS)/p} \> σ2, so we expect F to be greater than 1.

The F-statistic for the multiple linear regression model obtained by re- gressing sales onto radio, TV, and newspaper is shown in Table 3.6. In this example the F-statistic is 570. Since this is far larger than 1, it provides compelling evidence against the null hypothesis H0. In other words, the large F-statistic suggests that at least one of the advertising media must be related to sales. However, what if the F -statistic had been closer to 1? How large does the F-statistic need to be before we can reject H0 and conclude that there is a relationship? It turns out that the answer depends on the values of n and p. When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small. When H0 is true and the errors εi have a normal distribution, the F-statistic follows an F-distribution.6 For any given value of n and p, any statistical software package can be used to compute the p-value associated with the F -statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0. For the advertising data, the p-value associated with the F -statistic in Table 3.6 is essentially zero, so we have extremely strong evidence that at least one of the media is associated with increased sales.

In (3.23) we are testing H0 that all the coefficients are zero. Sometimes we want to test that a particular subset of q of the coefficients are zero. This corresponds to a null hypothesis

H0 : βp−q+1 =βp−q+2 =···=βp =0,

where for convenience we have put the variables chosen for omission at the end of the list. In this case we fit a second model that uses all the variables except those last q. Suppose that the residual sum of squares for that model is RSS0. Then the appropriate F-statistic is

F = (RSS0 − RSS)/q . (3.24) RSS/(n−p−1)

Notice that in Table 3.4, for each individual predictor a t-statistic and a p-value were reported. These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors. It turns out that each of these is exactly equivalent7 to the F- test that omits that single variable from the model, leaving all the others in---i.e. q=1 in (3.24). So it reports the partial effect of adding that variable to the model. For instance, as we discussed earlier, these p-values indicate that TV and radio are related to sales, but that there is no evidence that newspaper is associated with sales, when TV and radio are held fixed.

Given these individual p-values for each variable, why do we need to look at the overall F-statistic? After all, it seems likely that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed, especially when the number of predictors p is large.

For instance, consider an example in which p = 100 and H0 : β1 = β2 = · · · = βp = 0 is true, so no variable is truly associated with the response. In this situation, about 5% of the p-values associated with each variable (of the type shown in Table 3.4) will be below 0.05 by chance. In other words, we expect to see approximately five small p-values even in the absence of any true association between the predictors and the response.8 In fact, it is likely that we will observe at least one p-value below 0.05 by chance! Hence, if we use the individual t-statistics and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship. However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors. Hence, if H0 is true, there is only a 5% chance that the F-statistic will result in a p- value below 0.05, regardless of the number of predictors or the number of observations.

The approach of using an F-statistic to test for any association between the predictors and the response works when p is relatively small, and cer- tainly small compared to n. However, sometimes we have a very large num- ber of variables. If p \> n then there are more coefficients βj to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F- statistic cannot be used, and neither can most of the other concepts that we have seen so far in this chapter. When p is large, some of the approaches discussed in the next section, such as forward selection, can be used. This high-dimensional setting is discussed in greater detail in Chapter 6.

Two: Deciding on Important Variables

As discussed in the previous section, the first step in a multiple regression analysis is to compute the F-statistic and to examine the associated p- value. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones! We could look at the individual p-values as in Table 3.4, but as discussed (and as further explored in Chapter 13), if p is large we are likely to make some false discoveries.

It is possible that all of the predictors are associated with the response, but it is more often the case that the response is only associated with a subset of the predictors. The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection. The variable selection problem is studied extensively in Chapter 6, and so here we will provide only a brief outline of some classical approaches.

Ideally, we would like to perform variable selection by trying out a lot of different models, each containing a different subset of the predictors. For instance, if p = 2, then we can consider four models: (1) a model contain- ing no variables, (2) a model containing X1 only, (3) a model containing X2 only, and (4) a model containing both X1 and X2. We can then se- lect the best model out of all of the models that we have considered. How do we determine which model is best? Various statistics can be used to judge the quality of a model. These include Mallow's Cp, Akaike informa- tion criterion (AIC), Bayesian information criterion (BIC), and adjusted R2. These are discussed in more detail in Chapter 6. We can also deter- mine which model is best by plotting various model outputs, such as the residuals, in order to search for patterns.

Unfortunately, there are a total of 2p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subsetofthepredictorsisinfeasible.Forinstance,wesawthatifp=2,then there are 22 = 4 models to consider. But if p = 30, then we must consider 230 = 1,073,741,824 models! This is not practical. Therefore, unless p is very small, we cannot consider all 2p models, and instead we need an automated

high- dimensional and efficient approach to choose a smaller set of models to consider. There are three classical approaches for this task:

• Forward selection. We begin with the null model---a model that con- tains an intercept but no predictors. We then fit p simple linear re- gressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.

• Backward selection. We start with all variables in the model, and remove the variable with the largest p-value---that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

• Mixed selection. This is a combination of forward and backward se- lection. We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. We con- tinue to add variables one-by-one. Of course, as we noted with the Advertising example, the p-values for variables can become larger as new predictors are added to the model. Hence, if at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We con- tinue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.

Backward selection cannot be used if p \> n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

Three: Model Fit

Two of the most common numerical measures of model fit are the RSE and R2, the fraction of variance explained. These quantities are computed and interpreted in the same fashion as for simple linear regression.

Recall that in simple regression, R2 is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals Cor(Y, Yˆ )2, the square of the correlation between the response and the fitted linear model; in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.

An R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable. As an example, we saw in Table 3.6

forward selection null model

backward selection

mixed selection

3.2 Multiple Linear Regression 79

80 3. Linear Regression

that for the Advertising data, the model that uses all three advertising me- dia to predict sales has an R2 of 0.8972. On the other hand, the model that uses only TV and radio to predict sales has an R2 value of 0.89719. In other words, there is a small increase in R2 if we include newspaper advertising in the model that already contains TV and radio advertising, even though we saw earlier that the p-value for newspaper advertising in Table 3.4 is not significant. It turns out that R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable always results in a decrease in the residual sum of squares on the training data (though not necessarily the testing data). Thus, the R2 statistic, which is also computed on the training data, must increase. The fact that adding newspaper advertising to the model containing only TV and radio advertising leads to just a tiny increase in R2 provides addi- tional evidence that newspaper can be dropped from the model. Essentially, newspaper provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting.

By contrast, the model containing only TV as a predictor had an R2 of 0.61 (Table 3.2). Adding radio to the model leads to a substantial improve- ment in R2. This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertis- ing. We could further quantify this improvement by looking at the p-value for the radio coefficient in a model that contains only TV and radio as predictors.

The model that contains only TV and radio as predictors has an RSE of 1.681, and the model that also contains newspaper as a predictor has an RSE of 1.686 (Table 3.6). In contrast, the model that contains only TV has an RSE of 3.26 (Table 3.2). This corroborates our previous conclusion that a model that uses TV and radio expenditures to predict sales is much more accurate (on the training data) than one that only uses TV spending. Furthermore, given that TV and radio expenditures are used as predictors, there is no point in also using newspaper spending as a predictor in the model. The observant reader may wonder how RSE can increase when newspaper is added to the model given that RSS must decrease. In general RSE is defined as 􏰘 1

RSE = n − p − 1 RSS, (3.25)

which simplifies to (3.15) for a simple linear regression. Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

In addition to looking at the RSE and R2 statistics just discussed, it can be useful to plot the data. Graphical summaries can reveal problems with a model that are not visible from numerical statistics. For example, Figure 3.5 displays a three-dimensional plot of TV and radio versus sales. We see that some observations lie above and some observations lie below the least squares regression plane. In particular, the linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio. It underestimates sales for instances where the budget was split between the two media. This pro- nouncednon-linearpatternsuggestsasynergyorinteractioneffectbetween the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium. In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.

Four: Predictions

Once we have fit the multiple regression model, it is straightforward to apply (3.21) in order to predict the response Y on the basis of a set of values for the predictors X1 , X2 , . . . , Xp . However, there are three sorts of uncertainty associated with this prediction.

1\. The coefficient estimates βˆ0, βˆ1, . . . , βˆp are estimates for β0, β1, . . . , βp. That is, the least squares plane

Yˆ = βˆ 0 + βˆ 1 X 1 + · · · + βˆ p X p

2\.

3\.

is only an estimate for the true population regression plane f(X)=β0 +β1X1 +···+βpXp.

The inaccuracy in the coefficient estimates is related to the reducible error from Chapter 2. We can compute a confidence interval in order to determine how close Yˆ will be to f(X).

Of course, in practice assuming a linear model for f(X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias. So when we use a linear model, we are in fact estimating the best linear approximation to the true surface. However, here we will ignore this discrepancy, and operate as if the linear model were correct.

Even if we knew f(X)---that is, even if we knew the true values for β0,β1,...,βp---the response value cannot be predicted perfectly because of the random error ε in the model (3.20). In Chapter 2, we referred to this as the irreducible error. How much will Y vary from Yˆ? We use prediction intervals to answer this question. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).

We use a confidence interval to quantify the uncertainty surrounding the average sales over a large number of cities. For example, given that \$100,000 is spent on TV advertising and \$20,000 is spent on radio advertising in each city, the 95 % confidence interval is \[10,985, 11,528\]. We interpret this to mean that 95 % of intervals of this form will contain the true value of f(X).9 On the other hand, a prediction interval can be used to quantify the uncertainty surrounding sales for a particular city. Given that \$100,000 is spent on TV advertising and \$20,000 is spent on radio advertising in that city the 95% prediction interval is \[7,930, 14,580\]. We interpret this to mean that 95 % of intervals of this form will contain the true value of Y for this city. Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Other Considerations in the Regression Model

3.3.1 Qualitative Predictors

In our discussion so far, we have assumed that all variables in our linear regression model are quantitative. But in practice, this is not necessarily the case; often some predictors are qualitative.

For example, the Credit data set displayed in Figure 3.6 records variables for a number of credit card holders. The response is balance (average credit card debt for each individual) and there are several quantitative predictors: age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating). Each panel of Figure 3.6 is a scatterplot for a pair of variables whose iden- tities are given by the corresponding row and column labels. For example, the scatterplot directly to the right of the word "Balance" depicts balance versus age, while the plot directly to the right of "Age" corresponds to age versus cards. In addition to these quantitative variables, we also have four qualitative variables: own (house ownership), student (student status), status (marital status), and region (East, West or South).

Predictors with Only Two Levels

Suppose that we wish to investigate differences in credit card balance be-

tween those who own a house and those who don't, ignoring the other vari-

ables for the moment. If a qualitative predictor (also known as a factor)

only has two levels, or possible values, then incorporating it into a regres- sionmodelisverysimple.Wesimplycreateanindicatorordummyvariable dummy that takes on two possible numerical values.10 For example, based on the

own variable, we can create a new variable that takes the form

xi = 􏰜1 if ith person owns a house (3.26)

0 if ith person does not own a house,

and use this variable as a predictor in the regression equation. This results

in the model

yi =β0+β1xi+εi =􏰜β0+β1+εi ifithpersonownsahouse (3.27)

β0 + εi if ith person does not.

Now β0 can be interpreted as the average credit card balance among those who do not own, β0 + β1 as the average credit card balance among those who do own their house, and β1 as the average difference in credit card balance between owners and non-owners.

Table 3.7 displays the coefficient estimates and other information asso- ciated with the model (3.27). The average credit card debt for non-owners is estimated to be \$509.80, whereas owners are estimated to carry \$19.73 in additional debt for a total of \$509.80 + \$19.73 = \$529.53. However, we notice that the p-value for the dummy variable is very high. This indicates that there is no statistical evidence of a difference in average credit card balance based on house ownership.

The decision to code owners as 1 and non-owners as 0 in (3.27) is ar- bitrary, and has no effect on the regression fit, but does alter the inter- pretation of the coefficients. If we had coded non-owners as 1 and own- ers as 0, then the estimates for β0 and β1 would have been 529.53 and −19.73, respectively, leading once again to a prediction of credit card debt of \$529.53 − \$19.73 = \$509.80 for non-owners and a prediction of \$529.53 for owners. Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable

xi = 􏰜1 if ith person owns a house

−1 if ith person does not own a house

and use this variable in the regression equation. This results in the model yi =β0+β1xi+εi =􏰜β0+β1+εi ifithpersonownsahouse

β0 − β1 + εi if ith person does not own a house.

Now β0 can be interpreted as the overall average credit card balance (ig- noring the house ownership effect), and β1 is the amount by which house owners and non-owners have credit card balances that are above and below the average, respectively. In this example, the estimate for β0 is \$519.665, halfway between the non-owner and owner averages of \$509.80 and \$529.53. The estimate for β1 is \$9.865, which is half of \$19.73, the average difference between owners and non-owners. It is important to note that the final pre- dictions for the credit balances of owners and non-owners will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Qualitative Predictors with More than Two Levels

When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the region variable we create two dummy variables. The first could be

xi1 = 􏰜1 0

and the second could be

xi2 = 􏰜1 0

if ith person is from the South

if ith person is not from the South,

if ith person is from the West

if ith person is not from the West.

(3.28)

(3.29)

Then both of these variables can be used in the regression equation, in

order to obtain the model

⎧⎪⎨⎪⎩β0+β1+εi yi = β0+β1xi1+β2xi2+εi = β0+β2+εi

β0+εi

if ith person is from the South if ith person is from the West if ith person is from the East.

(3.30) Now β0 can be interpreted as the average credit card balance for individuals from the East, β1 can be interpreted as the difference in the average balance between people from the South versus the East, and β2 can be interpreted as the difference in the average balance between those from the West versus the East. There will always be one fewer dummy variable than the number of levels. The level with no dummy variable---East in this example---is

known as the baseline.

From Table 3.8, we see that the estimated balance for the baseline, East,

is \$531.00. It is estimated that those in the South will have \$18.69 less debt than those in the East, and that those in the West will have \$12.50 less debt than those in the East. However, the p-values associated with the coefficient estimates for the two dummy variables are very large, suggesting no statistical evidence of a real difference in average credit card balance between South and East or between West and East.11 Once again, the level selected as the baseline category is arbitrary, and the final predictions for each group will be the same regardless of this choice. However, the coefficients and their p-values do depend on the choice of dummy variable coding. Rather than rely on the individual coefficients, we can use an F -test to test H0 : β1 = β2 = 0; this does not depend on the coding. This F-test has a p-value of 0.96, indicating that we cannot reject the null hypothesis that there is no relationship between balance and region.

Using this dummy variable approach presents no difficulties when in- corporating both quantitative and qualitative predictors. For example, to regress balance on both a quantitative variable such as income and a qual- itative variable such as student, we must simply create a dummy variable for student and then fit a multiple regression model using income and the dummy variable as predictors for credit card balance.

There are many different ways of coding qualitative variables besides the dummy variable approach taken here. All of these approaches lead to equivalent model fits, but the coefficients are different and have different interpretations,andaredesignedtomeasureparticularcontrasts.Thistopic is beyond the scope of the book.

3.3.2 Extensions of the Linear Model

The standard linear regression model (3.19) provides interpretable results and works quite well on many real-world problems. However, it makes sev- eral highly restrictive assumptions that are often violated in practice. Two of the most important assumptions state that the relationship between the predictors and response are additive and linear. The additivity assumption means that the association between a predictor Xj and the response Y does not depend on the values of the other predictors. The linearity assumption states that the change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj. In later chapters of this book, we examine a number of sophisticated methods that relax these two assumptions. Here, we briefly examine some common classical approaches for extending the linear model.

Removing the Additive Assumption

In our previous analysis of the Advertising data, we concluded that both TV and radio seem to be associated with sales. The linear models that formed the basis for this conclusion assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. For example, the linear model (3.20) states that the average increase in sales associated with a one-unit increase in TV is always β1, regardless of the amount spent on radio.

However, this simple model may be incorrect. Suppose that spending money on radio advertising actually increases the effectiveness of TV ad- vertising, so that the slope term for TV should increase as radio increases. In this situation, given a fixed budget of \$100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio. In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect. Figure 3.5 sug- gests that such an effect may be present in the advertising data. Notice that when levels of either TV or radio are low, then the true sales are lower than predicted by the linear model. But when advertising is split between the two media, then the model tends to underestimate sales.

Consider the standard linear regression model with two variables, Y =β0 +β1X1 +β2X2 +ε.

According to this model, a one-unit increase in X1 is associated with an average increase in Y of β1 units. Notice that the presence of X2 does not alter this statement---that is, regardless of the value of X2, a one- unit increase in X1 is associated with a β1-unit increase in Y . One way of extending this model is to include a third predictor, called an interaction term, which is constructed by computing the product of X1 and X2. This results in the model

Y =β0 +β1X1 +β2X2 +β3X1X2 +ε. (3.31) How does inclusion of this interaction term relax the additive assumption?

Notice that (3.31) can be rewritten as

Y = β0 +(β1 +β3X2)X1 +β2X2 +ε (3.32)

= β0 +β ̃1X1 +β2X2 +ε

where β ̃1 = β1 + β3X2. Since β ̃1 is now a function of X2, the association between X1 and Y is no longer constant: a change in the value of X2 will change the association between X1 and Y . A similar argument shows that a change in the value of X1 changes the association between X2 and Y .

For example, suppose that we are interested in studying the productiv- ity of a factory. We wish to predict the number of units produced on the basis of the number of production lines and the total number of workers. It seems likely that the effect of increasing the number of production lines will depend on the number of workers, since if no workers are available to operate the lines, then increasing the number of lines will not increase production. This suggests that it would be appropriate to include an inter- action term between lines and workers in a linear model to predict units. Suppose that when we fit the model, we obtain

units ≈ 1.2 + 3.4 × lines + 0.22 × workers + 1.4 × (lines × workers) = 1.2 + (3.4 + 1.4 × workers) × lines + 0.22 × workers.

In other words, adding an additional line will increase the number of units produced by 3.4 + 1.4 × workers. Hence the more workers we have, the stronger will be the effect of lines.

We now return to the Advertising example. A linear model that uses radio, TV, and an interaction between the two to predict sales takes the form

sales = β0 + β1 ×TV+ β2 ×radio+ β3 × (radio×TV) + ε

= β0 + (β1 + β3 ×radio) ×TV+ β2 ×radio+ ε. (3.33)

The results in Table 3.9 strongly suggest that the model that includes the interaction term is superior to the model that contains only main effects. The p-value for the interaction term, TV×radio, is extremely low, indicating that there is strong evidence for Ha : β3 ̸= 0. In other words, it is clear that the true relationship is not additive. The R2 for the model (3.33) is 96.8 %, compared to only 89.7 % for the model that predicts sales using TV and radio without an interaction term. This means that (96.8 − 89.7)/(100 − 89.7) = 69 % of the variability in sales that remains after fitting the ad- ditive model has been explained by the interaction term. The coefficient estimates in Table 3.9 suggest that an increase in TV advertising of \$1,000 is associated with increased sales of (βˆ1 +βˆ3 ×radio)×1,000 = 19+1.1×radio units. And an increase in radio advertising of \$1,000 will be associated with an increase in sales of (βˆ2 + βˆ3 × TV) × 1,000 = 29 + 1.1 × TV units.

In this example, the p-values associated with TV, radio, and the interac- tion term all are statistically significant (Table 3.9), and so it is obvious that all three variables should be included in the model. However, it is sometimes the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not. The hier- archical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. In other words, if the interaction be- tween X1 and X2 seems important, then we should include both X1 and X2 in the model even if their coefficient estimates have large p-values. The rationale for this principle is that if X1 × X2 is related to the response, then whether or not the coefficients of X1 or X2 are exactly zero is of lit- tle interest. Also X1 × X2 is typically correlated with X1 and X2, and so leaving them out tends to alter the meaning of the interaction.

In the previous example, we considered an interaction between TV and radio, both of which are quantitative variables. However, the concept of interactions applies just as well to qualitative variables, or to a combination of quantitative and qualitative variables. In fact, an interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation. Consider the Credit data set from Section 3.3.1, and suppose that we wish to predict balance using the income (quantitative) and student (qualitative) variables. In the absence of an interaction term, the model takes the form

balancei

≈ β0 + β1 × incom􏰜ei + 􏰜β2 0 β0 + β2 = β1 ×incomei +

β0

if ith person is a student

if ith person is not a student

if ith person is a student

if ith person is not a student.

Notice that this amounts to fitting two parallel lines to the data, one for students and one for non-students. The lines for students and non-students have different intercepts, β0 + β2 versus β0, but the same slope, β1. This is illustrated in the left-hand panel of Figure 3.7. The fact that the lines are parallel means that the average effect on balance of a one-unit increase in income does not depend on whether or not the individual is a student. This represents a potentially serious limitation of the model, since in fact a change in income may have a very different effect on the credit card balance of a student versus a non-student.

This limitation can be addressed by adding an interaction variable, cre- ated by multiplying income with the dummy variable for student. Our model now becomes

β􏰜 + β × income + 􏰜β2 + β3 × incomei if student balance ≈

i01i

0 if not student

Once again, we have two different regression lines for the students and the non-students. But now those regression lines have different intercepts, β0+β2 versus β0, as well as different slopes, β1+β3 versus β1. This allows for the possibility that changes in income may affect the credit card balances of students and non-students differently. The right-hand panel of Figure 3.7 shows the estimated relationships between income and balance for students and non-students in the model (3.35). We note that the slope for students is lower than the slope for non-students. This suggests that increases in income are associated with smaller increases in credit card balance among students as compared to non-students.

Non-linear Relationships

As discussed previously, the linear regression model (3.19) assumes a linear relationship between the response and predictors. But in some cases, the true relationship between the response and the predictors may be non- linear. Here we present a very simple way to directly extend the linear model to accommodate non-linear relationships, using polynomial regression. In later chapters, we will present more complex approaches for performing non-linear fits in more general settings.

Consider Figure 3.8, in which the mpg (gas mileage in miles per gallon) versus horsepower is shown for a number of cars in the Auto data set. The orange line represents the linear regression fit. There is a pronounced rela- tionship between mpg and horsepower, but it seems clear that this relation- ship is in fact non-linear: the data suggest a curved relationship. A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors. For example, the points in Figure 3.8 seem to have a quadratic shape, suggesting that a model of the form

mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε (3.36)

may provide a better fit. Equation 3.36 involves predicting mpg using a non-linear function of horsepower. But it is still a linear model! That is, (3.36) is simply a multiple linear regression model with X1 = horsepower and X2 = horsepower2 . So we can use standard linear regression software to estimate β0, β1, and β2 in order to produce a non-linear fit. The blue curve in Figure 3.8 shows the resulting quadratic fit to the data. The quadratic fit appears to be substantially better than the fit obtained when just the linear term is included. The R2 of the quadratic fit is 0.688, compared to 0.606 for the linear fit, and the p-value in Table 3.10 for the quadratic term is highly significant.

If including horsepower2 led to such a big improvement in the model, why not include horsepower3 , horsepower4 , or even horsepower5 ? The green curve in Figure 3.8 displays the fit that results from including all polynomials up to fifth degree in the model (3.36). The resulting fit seems unnecessarily wiggly---that is, it is unclear that including the additional terms really has led to a better fit to the data.

The approach that we have just described for extending the linear model to accommodate non-linear relationships is known as polynomial regres- sion, since we have included polynomial functions of the predictors in the regression model. We further explore this approach and other non-linear extensions of the linear model in Chapter 7.

Potential Problems

When we fit a linear regression model to a particular data set, many prob- lems may occur. Most common among these are the following:

1\. Non-linearity of the response-predictor relationships.

2\. Correlation of error terms.

3\. Non-constant variance of error terms. 4. Outliers.

5\. High-leverage points.

6\. Collinearity.

In practice, identifying and overcoming these problems is as much an art as a science. Many pages in countless books have been written on this topic. Since the linear regression model is not our primary focus here, we will provide only a brief summary of some key points.

1\. Non-linearity of the Data

The linear regression model assumes that there is a straight-line rela- tionship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity.

residual plot Given a simple linear regression model, we can plot the residuals, ei =

yi − yˆi, versus the predictor xi. In the case of a multiple regression model, since there are multiple predictors, we instead plot the residuals versus

the predicted (or fitted) values yˆi. Ideally, the residual plot will show no fitted discernible pattern. The presence of a pattern may indicate a problem with

some aspect of the linear model.

The left panel of Figure 3.9 displays a residual plot from the linear re- gression of mpg onto horsepower on the Auto data set that was illustrated in Figure 3.8. The red line is a smooth fit to the residuals, which is displayed in order to make it easier to identify any trends. The residuals exhibit a clear U-shape, which provides a strong indication of non-linearity in the data. In contrast, the right-hand panel of Figure 3.9 displays the residual plot that results from the model (3.36), which contains a quadratic term. There appears to be little pattern in the residuals, suggesting that the quadratic term improves the fit to the data.

If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as logX, √X, and X2, in the regression model. In the later chapters of this book, we will discuss other more advanced non-linear approaches for addressing this issue.

2\. Correlation of Error Terms

An important assumption of the linear regression model is that the error terms, ε1,ε2,...,εn, are uncorrelated. What does this mean? For instance, if the errors are uncorrelated, then the fact that εi is positive provides little or no information about the sign of εi+1. The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p- values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

As an extreme example, suppose we accidentally doubled our data, lead- ing to observations and error terms identical in pairs. If we ignored this, our standard error calculations would be as if we had a sample of size 2n, when in fact we have only n samples. Our estimated parameters would be the same for the 2n samples as for the n samples, but the confidence intervals would be narrower by a factor of √2!

Why might correlations among the error terms occur? Such correlations frequently occur in the context of time series data, which consists of ob- servations for which measurements are obtained at discrete points in time.

In many cases, observations that are obtained at adjacent time points will have positively correlated errors. In order to determine if this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no discernible pat- tern. On the other hand, if the error terms are positively correlated, then we may see tracking in the residuals---that is, adjacent residuals may have similar values. Figure 3.10 provides an illustration. In the top panel, we see the residuals from a linear regression fit to data generated with uncorre- lated errors. There is no evidence of a time-related trend in the residuals. In contrast, the residuals in the bottom panel are from a data set in which adjacent errors had a correlation of 0.9. Now there is a clear pattern in the residuals---adjacent residuals tend to take on similar values. Finally, the center panel illustrates a more moderate case in which the residuals had a correlation of 0.5. There is still evidence of tracking, but the pattern is less clear.

Many methods have been developed to properly take account of corre- lations in the error terms in time series data. Correlation among the error terms can also occur outside of time series data. For instance, consider a study in which individuals' heights are predicted from their weights. The assumption of uncorrelated errors could be violated if some of the indi- viduals in the study are members of the same family, eat the same diet, or have been exposed to the same environmental factors. In general, the assumption of uncorrelated errors is extremely important for linear regres- sion as well as for other statistical methods, and good experimental design is crucial in order to mitigate the risk of such correlations.

3\. Non-constant Variance of Error Terms

Another important assumption of the linear regression model is that the error terms have a constant variance, Var(εi) = σ2. The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption.

Unfortunately, it is often the case that the variances of the error terms are non-constant. For instance, the variances of the error terms may increase with the value of the response. One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in theresidualplot.Anexampleisshownintheleft-handpanelofFigure3.11, in which the magnitude of the residuals tends to increase with the fitted values. When faced with this problem, one possible solution is√to trans- form the response Y using a concave function such as log Y or Y . Such a transformation results in a greater amount of shrinkage of the larger re- sponses, leading to a reduction in heteroscedasticity. The right-hand panel of Figure 3.11 displays the residual plot after transforming the response using log Y . The residuals now appear to have constant variance, though there is some evidence of a slight non-linear relationship in the data.

Sometimes we have a good idea of the variance of each response. For example, the ith response could be an average of ni raw observations. If each of these raw observations is uncorrelated with variance σ2, then their average has variance σi2 = σ2/ni. In this case a simple remedy is to fit our model by weighted least squares, with weights proportional to the inverse variances---i.e. wi = ni in this case. Most linear regression software allows for observation weights.

4\. Outliers

Y

−4 −2 0 2 4 6

Residuals

−1 0 1 2 3 4

An outlier is a point for which yi is far from the value predicted by the

outlier model. Outliers can arise for a variety of reasons, such as incorrect recording

of an observation during data collection.

The red point (observation 20) in the left-hand panel of Figure 3.12 illustrates a typical outlier. The red solid line is the least squares regression fit, while the blue dashed line is the least squares fit after removal of the outlier. In this case, removing the outlier has little effect on the least squares line: it leads to almost no change in the slope, and a miniscule reduction in the intercept. It is typical for an outlier that does not have an unusual predictor value to have little effect on the least squares fit. However, even if an outlier does not have much effect on the least squares fit, it can cause other problems. For instance, in this example, the RSE is 1.09 when the outlier is included in the regression, but it is only 0.77 when the outlier is removed. Since the RSE is used to compute all confidence intervals and p-values, such a dramatic increase caused by a single data point can have implications for the interpretation of the fit. Similarly, inclusion of the outlier causes the R2 to decline from 0.892 to 0.805.

Residual plots can be used to identify outliers. In this example, the out- lier is clearly visible in the residual plot illustrated in the center panel of Figure 3.12. But in practice, it can be difficult to decide how large a resid- ual needs to be before we consider the point to be an outlier. To address this problem, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual ei by its estimated standard error. Observations whose studentized residuals are greater than 3 in abso- lute value are possible outliers. In the right-hand panel of Figure 3.12, the outlier's studentized residual exceeds 6, while all other observations have studentized residuals between −2 and 2.

If we believe that an outlier has occurred due to an error in data collec- tion or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a

deficiency with the model, such as a missing predictor.

5\. High Leverage Points

studentized residual

We just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage high have an unusual value for xi. For example, observation 41 in the left-hand

leverage panel of Figure 3.13 has high leverage, in that the predictor value for this observation is large relative to the other observations. (Note that the data displayed in Figure 3.13 are the same as the data displayed in Figure 3.12,

but with the addition of a single high leverage observation.) The red solid line is the least squares fit to the data, while the blue dashed line is the fit produced when observation 41 is removed. Comparing the left-hand panels of Figures 3.12 and 3.13, we observe that removing the high leverage observation has a much more substantial impact on the least squares line than removing the outlier. In fact, high leverage observations tend to have a sizable impact on the estimated regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire fit. For this reason, it is important to identify high leverage observations.

In a simple linear regression, high leverage observations are fairly easy to identify, since we can simply look for observations for which the predictor value is outside of the normal range of the observations. But in a multiple linear regression with many predictors, it is possible to have an observation that is well within the range of each individual predictor's values, but that is unusual in terms of the full set of predictors. An example is shown in the center panel of Figure 3.13, for a data set with two predictors, X1 and X2. Most of the observations' predictor values fall within the blue dashed ellipse, but the red observation is well outside of this range. But neither its value for X1 nor its value for X2 is unusual. So if we examine just X1 or just X2, we will fail to notice this high leverage point. This problem is more pronounced in multiple regression settings with more than two predictors, because then there is no simple way to plot all dimensions of the data simultaneously.

In order to quantify an observation's leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage

statistic

leverage. For a simple linear regression,

1 ( x i − x ̄ ) 2

hi=n+􏰎ni′=1(xi′ −x ̄)2. (3.37)

It is clear from this equation that hi increases with the distance of xi from x ̄. There is a simple extension of hi to the case of multiple predictors, though we do not provide the formula here. The leverage statistic hi is always between 1/n and 1, and the average leverage for all the observations is always equal to (p + 1)/n. So if a given observation has a leverage statistic that greatly exceeds (p+1)/n, then we may suspect that the corresponding point has high leverage.

The right-hand panel of Figure 3.13 provides a plot of the studentized residuals versus hi for the data in the left-hand panel of Figure 3.13. Ob- servation 41 stands out as having a very high leverage statistic as well as a high studentized residual. In other words, it is an outlier as well as a high leverage observation. This is a particularly dangerous combination! This plot also reveals the reason that observation 20 had relatively little effect on the least squares fit in Figure 3.12: it has low leverage.

6\. Collinearity

Collinearityreferstothesituationinwhichtwoormorepredictorvariables are closely related to one another. The concept of collinearity is illustrated in Figure 3.14 using the Credit data set. In the left-hand panel of Fig- ure 3.14, the two predictors limit and age appear to have no obvious rela- tionship. In contrast, in the right-hand panel of Figure 3.14, the predictors limit and rating are very highly correlated with each other, and we say that they are collinear. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the indi- vidual effects of collinear variables on the response. In other words, since limit and rating tend to increase or decrease together, it can be difficult to determine how each one separately is associated with the response, balance.

Figure 3.15 illustrates some of the difficulties that can result from collinear- ity. The left-hand panel of Figure 3.15 is a contour plot of the RSS (3.22) associated with different possible coefficient estimates for the regression of balance on limit and age. Each ellipse represents a set of coefficients that correspond to the same RSS, with ellipses nearest to the center tak- ing on the lowest values of RSS. The black dots and associated dashed lines represent the coefficient estimates that result in the smallest possible RSS---in other words, these are the least squares estimates. The axes for limit and age have been scaled so that the plot includes possible coeffi- cient estimates that are up to four standard errors on either side of the least squares estimates. Thus the plot includes all plausible values for the coefficients. For example, we see that the true limit coefficient is almost certainly somewhere between 0.15 and 0.20.

In contrast, the right-hand panel of Figure 3.15 displays contour plots of the RSS associated with possible coefficient estimates for the regression of balance onto limit and rating, which we know to be highly collinear. Now the contours run along a narrow valley; there is a broad range of values for the coefficient estimates that result in equal values for RSS. Hence a small change in the data could cause the pair of coefficient values that yield the smallest RSS---that is, the least squares estimates---to move anywhere along this valley. This results in a great deal of uncertainty in the coefficient estimates. Notice that the scale for the limit coefficient now runs from roughly −0.2 to 0.2; this is an eight-fold increase over the plausible range of the limit coefficient in the regression with age. Interestingly, even though the limit and rating coefficients now have much more individual uncertainty, they will almost certainly lie somewhere in this contour valley. For example, we would not expect the true value of the limit and rating coefficients to be −0.1 and 1 respectively, even though such a value is plausible for each coefficient individually.

Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for βˆj to grow. Recall that the t-statistic for each predictor is calculated by dividing βˆj by its standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the presence of collinearity, we may fail to reject H0 : βj = 0. This means that the power of the hypothesis test---the probability of correctly detecting a non-zero coefficient---is reduced by collinearity.

Table 3.11 compares the coefficient estimates obtained from two separate multiple regression models. The first is a regression of balance on age and limit, and the second is a regression of balance on rating and limit. In the first regression, both age and limit are highly significant with very small p- values. In the second, the collinearity between limit and rating has caused the standard error for the limit coefficient estimate to increase by a factor of 12 and the p-value to increase to 0.701. In other words, the importance of the limit variable has been masked due to the presence of collinearity.

To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model.

A simple way to detect collinearity is to look at the correlation matrix of the predictors. An element of this matrix that is large in absolute value indicates a pair of highly correlated variables, and therefore a collinearity problem in the data. Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinear- ity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess multi- collinearity is to compute the variance inflation factor (VIF). The VIF is the ratio of the variance of βˆj when fitting the full model divided by the variance of βˆj if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. The VIF for each variable can be computed using the formula

multi- collinearity

variance inflation factor

VIF(βˆj) = 1 , 1−R2

Xj \|X−j

where R2 is the R2 from a regression of Xj onto all of the other

Xj\|X−j 2

predictors. If RXj\|X−j is close to one, then collinearity is present, and so

the VIF will be large.

In the Credit data, a regression of balance on age, rating, and limit