Regression/Inference.Rmd at master · shravan-kuchkula/Regression · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Inference"
author: "Shravan Kuchkula"
date: "10/19/2017"
output:
  github_document:
    toc: yes
  html_document:
    keep_md: yes
    theme: cosmo
    toc: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
---

## Foundations of inference
One of the foundational aspects of statistical analysis is inference, or the process of drawing conclusions about a larger population from a sample of data. Although counter intuitive, the standard practice is to attempt to disprove a research claim that is not of interest. For example, to show that one medical treatment is better than another, we can assume that the two treatments lead to equal survival rates only to then be disproved by the data. Additionally, we introduce the idea of a p-value, or the degree of disagreement between the data and the hypothesis. We also dive into confidence intervals, which measure the magnitude of the effect of interest (e.g. how much better one treatment is than another).

### Introduction to ideas of inference
In this chapter, you will investigate how repeated samples taken from a population can vary. It is the variability in samples that allow you to make claims about the population of interest. It is important to remember that the research claims of interest focus on the population while the information available comes only from the sample data.

> What is statistical inference ?
The process of making claims about a population based on information from a sample.

Typically, the data represent only a small portion of a larger group that we would like to summarize.

## Randomized distributions
The idea behind statistical inference is to understand samples from a hypothetical population where the null hypothesis is true. For example, for east and west coasts, the cola preference is the same. As a way of summarizing each of the null samples, we calculate 'one statistic' from each sample. Here, the statistic is the difference in the proportion of west coast people who prefer cola and the proportion of east coast people who prefer cola. Where each of the proportion is denoted by p-hat. The difference in p-hat's changes with each sample. It could be zero, positive, negative. The point is, that it keeps changing.

We can build a distribution of the differences in proportions (assuming the null hypothesis) i.e assuming that there is no link b/w location and soda preference, is true.

> Generating a distribution of the statistic from the null population gives information about whether the observed data are inconsistent with the null hypothesis.


## NHANES dataset
Thoughout this chapter, you will use the `NHANES` dataset from the `NHANES` R package. These data are collected by the CDC and can be thought of as a random sample of US residents.

Before moving on to investigate particular variables, you'll have an opportunity to briefly explore the data in this exercise.

```{r message = FALSE, warning= FALSE}
# Load all the libraries
installRequiredPackages <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[,"Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

libs <- c("readr", "dplyr", "tidyr", "ggplot2",
          "magrittr", "markdown", "knitr", "yaml",
          "corrplot", "GGally", "broom", "psych", "car",
          "NHANES")

installRequiredPackages(libs)
```

```{r}
names(NHANES)
```

As you can see there are far too many variables in this dataset.

## Data set description
This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960's. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).

You can get more info about this dataset by using `?NHANES`

## Understanding the NULL DISTRIBUTION

Use R to randomly permute the observations and calculate a difference in `proportions` that could arise from a null distribution. Using the NHANES dataset, let's investigate the relationship between gender and home ownership.

HomeOwn: One of Home, Rent, or Other indicating whether the home of study participant or someone in their family is owned, rented or occupied by some other arrangement.

Gender: Gender (sex) of study participant	coded as male or female

Subset the NHANES dataset to consider only individuals whose home ownership status is either "Own" or "Rent". Save the result to homes.
```{r}
# Subset the data: homes
homes <- NHANES %>%
  select(Gender, HomeOwn) %>%
  filter(HomeOwn %in% c("Own", "Rent"))

nrow(homes)
nrow(subset(homes, Gender == "female"))
nrow(subset(homes, Gender == "male"))
```

Perform a single permutation to evaluate whether home ownership status (i.e. HomeOwn) differs between the "female" and "male" groups:

Let's start simple. First, let's take a permutation of HomeOwn variable and store it in HomeOwn_perm. Here, HomeOwn is the original data and HomeOwn_perm is just 1 permutation of this original data. Remember, we are trying to understand the logic of inference here.

```{r}
homes %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  head(10)
```

Next, let's group them by `Gender`.

```{r}
homes %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(Gender) %>%
  summarise(n = n())
```

Next, group them by gender and this time calculate the proportion of homeowners in the `HomeOwn_perm` column and the original column `HomeOwn`. Remember for homeowners the value of `HomeOwn` (and `HomeOwn_perm`) will be "Own".

In both the original data and in the permuted data, compute the proportion of individuals who own a home. Note that this will calculate proportions for both genders since you've grouped by the Gender variable in the line before it.

```{r}
# Perform one permutation
homes %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(Gender) %>%
  summarize(prop_own_perm = mean(HomeOwn_perm == "Own"),
            prop_own = mean(HomeOwn == "Own"))
```

Now the question was: "Perform a single permutation to evaluate whether home ownership status (i.e. HomeOwn) differs between the "female" and "male" groups"

Take the difference b/w female and male groups:

Using the diff() function, calculate the difference in proportion of home ownership for both prop_own_perm, the permuted data, and prop_own, the original data.

```{r}
homes %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(Gender) %>%
  summarize(prop_own_perm = mean(HomeOwn_perm == "Own"),
            prop_own = mean(HomeOwn == "Own")) %>%
  summarize(diff_perm = diff(prop_own_perm),
            diff_orig = diff(prop_own))
```

> Keep in mind, this is just a single random permutation. Next, you'll look at several permuted differences to see how they compare with the observed difference.

> A side note, this difference in proportion is nothing but our statistic.

Natural variability can be modeled from shuffling observations around to remove any relationship that might exist in the population. This is done with the `rep_sample_n()` function from the `oilabs` package. Within it, you must specify arguments for the data (tbl), the sample size, the number of samples to take (reps), and whether sampling should be done with or without replacement (replace). The output includes a new column, replicate, which indicates the sample number. For example,

```{r}
# Source: https://github.com/OpenIntroOrg/oilabs-r-package/blob/master/R/rep_sample_n.R

#' Repeating sampling.
#'
#' @param tbl tbl of data.
#' @param size The number of rows to select.
#' @param replace Sample with or without replacement?
#' @param reps The number of samples to collect.
#' @return A tbl_df that aggregates all created samples, with the addition of a \code{replicate} column that the tbl_df is also grouped by
#' @export

rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
    n <- nrow(tbl)
    i <- unlist(replicate(reps, sample.int(n, size, replace = replace), simplify = FALSE))

    rep_tbl <- cbind(replicate = rep(1:reps,rep(size,reps)), tbl[i,])

    dplyr::group_by(rep_tbl, replicate)
}
```

Use the above function.
```{r}
homes %>%
  rep_sample_n(size = 5, reps = 3)
```

will return three samples of 5 observations from the homes dataset you created in the last exercise. The first 5 rows will have a value of 1 in the replicate column, the next 5 rows will have a value of 2, and so on. Note that the default value for the replace argument is FALSE.

The rep_sample_n() function is useful here because it adds the replicate column. This ensures that you can keep many different random samples in one table without getting them confused. For example, grouping by the values in the replicate column and using a summarise() call lets you effectively do a calculation on every one of the shuffled datasets that you've made. You can see why this makes our lives easier when we're making more than one or two samples!

In this exercise, you will permute the home ownership variable 10 times. By doing so, you will ensure that there is no relationship between home ownership and gender, so any difference in home ownership proportion for female versus male will be due only to natural variability.

```{r}
# Perform 10 permutations
homeown_perm <- homes %>%
  rep_sample_n(size = nrow(homes), reps = 10) %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(replicate, Gender) %>%
  summarize(prop_own_perm = mean(HomeOwn_perm == "Own"),
            prop_own = mean(HomeOwn == "Own")) %>%
  summarize(diff_perm = diff(prop_own_perm),
            diff_orig = diff(prop_own)) # male - female
```

print differences to the console
```{r}
print(homeown_perm)
```

Using geom_dotplot(), plot the differences in proportions obtained by shuffling the HomeOwn variable. Adjust the size of the dots by including binwidth = .001 in your call to geom_dotplot()

```{r}
# Dotplot of 10 permuted differences in proportions
ggplot(homeown_perm, aes(x = diff_perm)) +
  geom_dotplot(binwidth = .001)
```


By permuting the home ownership variable multiple times, you generate differences in proportions that are consistent with the assumption that the variables are unrelated.

```{r}
# Perform 100 permutations
homeown_perm <- homes %>%
  rep_sample_n(size = nrow(homes), reps = 100) %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(replicate, Gender) %>%
  summarize(prop_own_perm = mean(HomeOwn_perm == "Own"),
            prop_own = mean(HomeOwn == "Own")) %>%
  summarize(diff_perm = diff(prop_own_perm),
            diff_orig = diff(prop_own)) # male - female

# Dotplot of 100 permuted differences in proportions
ggplot(homeown_perm, aes(x = diff_perm)) +
  geom_dotplot(binwidth = .001)
```

### Randomization density
Using 100 repetitions allows you to understand the mechanism of permuting. However, 100 is not enough to observe the full range of likely values for the null differences in proportions.

In this exercise, you'll repeat the process 1000 times to get a sense for the complete distribution of null differences in proportions.

Generate 1000 differences in proportions by shuffling the HomeOwn variable and following the same procedure as before.
```{r}
# Perform 1000 permutations
homeown_perm <- homes %>%
  rep_sample_n(size = nrow(homes), reps = 1000) %>%
  mutate(HomeOwn_perm = sample(HomeOwn)) %>%
  group_by(replicate, Gender) %>%
  summarize(prop_own_perm = mean(HomeOwn_perm == "Own"),
            prop_own = mean(HomeOwn == "Own")) %>%
  summarize(diff_perm = diff(prop_own_perm),
            diff_orig = diff(prop_own)) # male - female

# Density plot of 1000 permuted differences in proportions
ggplot(homeown_perm, aes(x = diff_perm)) +
  geom_density(adjust=2)
```

Recall, that the logic of statistical inference is to compare the observed statistic to the distribution of statistics that come from a Null Distribution. You have now seen how to create the distribution with your own R code. The next question to ask is, how do we use the information in the Null Distribution. Remember that each dot is from a different permutation of the data. The goal is to show that our observed data are not consistent with the differences generated. We want our observed data to be different from the null, so we can claim the alternative research hypothesis is true.

> IS DATA CONSISTENT WITH NULL ?
> HOW EXTREME ARE THE OBSERVED DATA ?

38% of the data are more extreme than the observed data. This suggests that the data are consistent with the permutated distribution. Thus we say that we have no evidence that rates of cola differ by coast.

Do the data come from the population ?

Recall that the observed difference (i.e. the difference in proportions in the homes dataset, shown as the red vertical line) was around -0.0078, which seems to fall below the bulk of the density of shuffled differences. It is important to know, however, whether any of the randomly permuted differences were as extreme as the observed difference.

In this exercise, you'll re-create this dotplot as a density plot and count the number of permuted differences that were to the left of the observed difference.

Using geom_density(), plot the permuted differences. Add a vertical red line with geom_vline() where the observed statistic falls.

```{r}
# Plot permuted differences
ggplot(homeown_perm, aes(x = diff_perm)) +
  geom_density(adjust=2) +
  geom_vline(aes(xintercept = diff_orig),
          col = "red")
```

Count the number of permuted differences that were less than or equal to the observed difference.


```{r}
# Compare permuted differences to observed difference
homeown_perm %>%
  summarize(sum(diff_orig >= diff_perm))
```

220 permuted differences are more extreme than the observed difference. This only represents 22.0% of the null statistics, so you can conclude that the observed difference is consistent with the permuted distribution.

We have learned that our data is consistent with the hypothesis of no difference in home ownership across gender.

In this case, we say that the observed statistic was consistent with the null statistics. That is, 220 of the 1000 permutations were smaller than the original value. There is no evidence that the data are inconsistent with the null hypothesis


## Example 2: Gender discrimination
As the first step of any analysis, you should look at and summarize the data. Categorical variables are often summarized using proportions, and it is always important to understand the denominator of the proportion.

Do you want the proportion of women who were promoted or the proportion of promoted individuals who were women? Here, you want the first of these, so in your R code it's necessary to group_by() the sex variable.

```{r}
# Read in the dataset
disc <- readRDS("/Users/Shravan/Downloads/disc_new.rds")
```

Using `table()` function, summarize the data as a contingency table.

```{r}
table(disc)
```

Summarize the data by using group_by() on the sex variable and finding the proportion who were promoted. Call this variable promoted_prop. Note that with binary variables, the proportion of either value can be found using the mean() function (e.g. mean(variable == "value") )

```{r}
# Find proportion of each sex who were promoted
disc %>%
  group_by(sex) %>%
  summarise(promoted_prop = mean(promote == "promoted"))
```

Okay, so the difference in proportions promoted is almost 0.3.

To help you understand the code used to create the randomization distribution, this exercise will walk you through each step, one at a time.

Remember that when using the pipe notation (%>%) with so-called tidy functions like mutate() and summarize() from the dplyr package, each input is a data frame and each output is also a data frame. To keep the output manageable, we will use only 5 replicates here.

```{r}
# Sample the entire data frame 5 times
disc %>%
  rep_sample_n(size = nrow(disc), reps = 5)
```

Within each of the 5 replicates of the original data, shuffle the promote variable to break any relationship between promotion and gender.

```{r}
# Shuffle the promote variable within replicate
disc %>%
  rep_sample_n(size = nrow(disc), reps = 5) %>%
  mutate(prom_perm = sample(promote))
```

Building on the previous data frame, group by replicate and sex then find the proportion of individuals promoted in each grouped category.

```{r}
# Find the proportion of promoted in each replicate and sex
disc %>%
  rep_sample_n(size = nrow(disc), reps = 5) %>%
  mutate(prom_perm = sample(promote)) %>%
  group_by(replicate, sex) %>%
  summarize(prop_prom_perm = mean(prom_perm == "promoted"),
            prop_prom = mean(promote == "promoted"))
```


Continuing, find the difference in proportion of "promoted" across sex grouped by replicate.

```{r}
# Difference in proportion of promoted across sex grouped by gender
disc %>%
  rep_sample_n(size = nrow(disc), reps = 5) %>%
  mutate(prom_perm = sample(promote)) %>%
  group_by(replicate, sex) %>%
  summarize(prop_prom_perm = mean(prom_perm == "promoted"),
            prop_prom = mean(promote == "promoted"))  %>%
  summarize(diff_perm = diff(prop_prom_perm),
            diff_orig = diff(prop_prom))  # male - female
```

Let's quickly recap what you just did. Using rep_sample_n(), you took 5 repeated samples (i.e. Replications) of the disc data, then shuffled these using sample() to break any links between gender and getting promoted. Then for each replication, you calculated the proportions of promoted males and females in the dataset along with the difference in proportions.

### Randomizing gender discrimination

In this exercise, you'll create a randomization distribution of the null statistic with 1000 replicates as opposed to just 5 in the previous exercise. As a reminder, the statistic of interest is the difference in proportions promoted between genders (i.e. proportion for males minus proportion for females).

```{r}
# Create a data frame of differences in promotion rates
disc_perm <- disc %>%
  rep_sample_n(size = nrow(disc), reps = 1000) %>%
  mutate(prom_perm = sample(promote)) %>%
  group_by(replicate, sex) %>%
  summarize(prop_prom_perm = mean(prom_perm == "promoted"),
            prop_prom = mean(promote == "promoted")) %>%
  summarize(diff_perm = diff(prop_prom_perm),
            diff_orig = diff(prop_prom))  # male - female


# Histogram of permuted differences
ggplot(disc_perm, aes(x = diff_perm)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = diff_orig), col = "red")
```

> Remember the main point: We are interested in whether observed statistic is different from values obtained by shuffling.

One way to measure how far the observed statistic is from the null values is to calculate the `quantiles` of the null statistics. After we have generated a 100 different shuffles of the original data, we can see that the 5% quantile is -0.292, that is, 5% of the observations are at -0.292 or below.

> quantile : The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
Syntax : quantile(x, probs)
x : numeric vector whose sample quantiles are wanted
probs : numeric vector of probabilities with values in [0,1]

Similarly, the 95% quantile is 0.208, that is, 95% of the NULL observations are at 0.208 or below. Meaning that our observed statistic of 0.29 is larger than 95% of the NULL statistics (or observations). Using R, we can get the same quantile measurement that allow the comparison of the NULL statistics with the observed statistic. Given the earlier simulations, we can see that 95% of the observations are 0.208 or lower.

```{r}
disc_perm %>%
  summarize(q.05 = quantile(diff_perm, p = 0.05),
            q.95 = quantile(diff_perm, p = 0.95))
```

These simulations give further evidence that the observed statistic is consistent with the bulk of the null statistics.

Often the quantiles describing the tails of the NULL distribution determine what is called the `critical region`. The critical region determines which observed statistics are consistent with the NULL distribution.

## Why 0.05 ?

In the previous exercises you determined the quantiles of the Null distribution. The cut-off values allow you to know how big or how small the observed statistic should be, in order to reject the null hypothesis. The choice of 0.05 for the quantile cut-off is somewhat arbitrary. However, the reasons to use it are historical.

## Effect of sample size

Notice that the observed difference of 0.2917 is in the extreme right tail of the permuted differences. If the sample was ten times larger but the sample statistic was exactly the same (i.e. 0.2917), how would the distribution of permuted differences change ?

Answer: The statistic of 0.2917 would be much farther to the right of the permuted differences (completely off of the distribution)

The observed difference is consistent with differences you would see by chance if the sample size was small. The observed difference would virtually never be observed by chance if the sample size was big.


## p-value
Remember, our interest is how consistent the observed data are with the data taken from a population where gender and promotion are de-linked. By permuting the data repeatedly we can quantify how likely the observed data are in a situation where the null hypothesis is true. The null hypothesis is that promotion rates in the population do not differ for men and women.

> p-value is the probability of observing data as or more extreme than what we actually got given the null hypothesis is true.

In this instance, the p-value is the probability of observing a difference of 0.2917 or greater when promotion rates do not vary across gender. The p-value is 0.504


```{r}
# Calculate the p-value for the original dataset
disc_perm %>%
  summarize(mean(diff_orig <= diff_perm))
```