---
title: "Statistical foundations exercises"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "September 2, 2017"
output:
  html_document:
    fig_height: 5
    fig_width: 7
  pdf_document:
    fig_height: 5
    fig_width: 7
  word_document:
    fig_height: 3
    fig_width: 5
---

```{r, setup, include=FALSE}
library(mdsr)   # Load additional packages here 


# Some customization.  You can alter or delete as desired (if you know what you are doing).
trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice
knitr::opts_chunk$set(
  tidy=FALSE,     # display code as typed
  size="small")   # slightly smaller font for code
```

## Introduction
These exercises are taken from the statistical foundations chapter from **Modern Data Science with R**: http://mdsr-book.github.io.  Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.


## Gestation
Calculate and interpret a 95% confidence interval for the mean age of mothers from the
classic `Gestation` data set from the `mosaicData` package.

SOLUTION:

```{r}
library(mdsr)   
glimpse(Gestation)
# solution goes here
```


## More gestation
Use the bootstrap to generate and interpret a 95% confidence interval for the median age of mothers
for the classic `Gestation` data set from the `mosaicData` package.

SOLUTION:

```{r}
library(mdsr)   
# solution goes here
```


## Even more gestation
Use the bootstrap to generate a 95% confidence interval for the regression parameters in
a model for weight as a function of age for the `Gestation` data frame from the `mosaicData`
package.

SOLUTION:

```{r}
library(mdsr)   
# solution goes here
```


## Confidence intervals
We saw that a 95% confidence interval for a mean was constructed by taking the estimate and adding and subtracting
two standard deviations.  How many standard deviations should be used if a 99% confidence interval
is desired?  (Hint: see `xqnorm()`.)

SOLUTION:

```{r}
library(mdsr)   
# solution goes here
```


## Twins
In 2010, the Minnesota Twins played their first season at Target Field. However, up
through 2009, the Twins played at the Metrodome (an indoor stadium). In the Metrodome, air ventilator fans are used both to keep the roof up and to ventilate the stadium. Typically, the air is  blown from all directions into the center of the stadium.

According to a retired supervisor in the Metrodome, in the late innings
of some games the fans would be modified so that the ventilation
air would blow out from home plate toward the outfield. The idea is that the
air flow might increase the length of a fly ball. To see if manipulating
the fans could possibly make any difference, a group of students at the
University of Minnesota and their professor built a 'cannon' that used
compressed air to shoot baseballs. They then did the following experiment.


1. Shoot balls at angles around 50 degrees with velocity of around 150 feet per second.
2. Shoot balls under two different settings: headwind (air blowing from outfield toward
home plate) or tailwind (air blowing from home plate toward outfield).
3. Record other variables: weight of the ball (in grams), diameter of the ball (in cm), and
distance of the ball's flight (in feet).


Background: People who know little or nothing about baseball might find these basic facts useful.  The batter stands near "home plate" and tries to hit the ball toward the outfield.  A "fly ball" refers to a ball that is hit into the air.  It is desirable to hit
the ball as far as possible.  For reasons of basic physics, the distance is maximized when the ball is hit at an intermediate angle steeper than 45 degrees from the horizontal.

The variables are described in the following table.

```
Cond: the wind conditions, a categorical variable with levels Headwind, Tailwind
Angle: the angle of ball's trajectory
Velocity: velocity of ball in feet per second
BallWt: weight of ball in grams
BallDia: diameter of ball in inches
Dist: distance in feet of the flight of the ball
```

Here is the output of several models.
```
> lm1 <- lm(Dist ~ Cond, data=ds)  # FIRST MODEL
```
```
> summary(lm1)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  350.768      2.179 160.967   <2e-16 ***
CondTail       5.865      3.281   1.788   0.0833 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.499 on 32 degrees of freedom
Multiple R-squared: 0.0908,     Adjusted R-squared: 0.06239
F-statistic: 3.196 on 1 and 32 DF,  p-value: 0.0833
```

```
> confint(lm1)
                2.5 %    97.5 %
(Intercept) 346.32966 355.20718
CondTail     -0.81784  12.54766
```
```
> # SECOND MODEL
> lm2 <- lm(Dist ~ Cond + Velocity + Angle + BallWt + BallDia, data=ds)
> summary(lm2)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 181.7443   335.6959   0.541  0.59252
CondTail      7.6705     2.4593   3.119  0.00418 **
Velocity      1.7284     0.5433   3.181  0.00357 **
Angle        -1.6014     1.7995  -0.890  0.38110
BallWt       -3.9862     2.6697  -1.493  0.14659
BallDia     190.3715    62.5115   3.045  0.00502 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.805 on 28 degrees of freedom
Multiple R-squared: 0.5917,     Adjusted R-squared: 0.5188
F-statistic: 8.115 on 5 and 28 DF,  p-value: 7.81e-05
```

```
> confint(lm2)
                   2.5 %     97.5 %
(Intercept) -505.8974691 869.386165
CondTail       2.6328174  12.708166
Velocity       0.6155279   2.841188
Angle         -5.2874318   2.084713
BallWt        -9.4549432   1.482457
BallDia       62.3224999 318.420536
```

Consider the results from the model of `Dist` as a function
of `Cond` (first model).
Briefly summarize what this model says about the relationship between the wind conditions and the distance travelled by the ball.  Make sure to say something sensible about the strength of evidence that there is any relationship at all.

SOLUTION:


## More Twins
Briefly summarize the model that has `Dist` as the response variable and includes the other variables as explanatory variables (second model) by reporting and interpretating the `CondTail` parameter.
This second model suggests a somewhat different result for the
relationship between `Dist` and `Cond`  Summarize the differences and explain in statistical terms why the inclusion of the other explanatory variables has affected the results.

SOLUTION:


## Smoking and mortality
The `Whickham` data set in the `mosaicData` package includes data on age, smoking, and mortality from a one-in-six survey of the electoral roll in Whickham, a mixed urban and rural district near Newcastle upon Tyne, in the United Kingdom. The survey was conducted in 1972-1974 to study heart disease and thyroid disease. A follow-up on those in the survey was conducted twenty years later.
Describe the association between smoking status and mortality in this study.  Be sure to consider the role of age as a possible confounding factor.

SOLUTION:

```{r}
library(mdsr)   
Whickham <- mutate(Whickham,
  agegrp = cut(age, breaks=c(1, 44, 64, 100),
  labels=c("18-44", "45-64", "65+")))
glimpse(Whickham)
# solution goes here
```


## Missing income
A data scientist working for a company that sells mortgages for new home purchases might be interested in determining what factors might be predictive of defaulting on the loan.  Some of the mortgagees have missing income
in their data set.  Would it be reasonable for the analyst to drop these loans from their analytic data set?  Explain.

SOLUTION:

## Missing data and NHANES
The `NHANES` data set in the `NHANES` package includes survey data collected by the U.S. National Center for Health Statistics (NCHS), which has conducted a series of health and nutrition surveys since the early 1960s.
An investigator is interested in fitting a model to predict the probability that a female subject will have a diagnosis
of diabetes.  Predictors for this model include age and BMI.  Imagine that only 1/10 of the data are available but that these data
are sampled randomly from the full set of observations (this mechanism is called "Missing Completely at Random", or MCAR).  What implications will this sampling have on the results?

SOLUTION:

```{r}
library(mdsr)   
library(NHANES)
# solution goes here
```


## Missing data and NHANES 2
Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that
missingness depends on age, with older subjects less likely to be observed than younger subjects.  
(this mechanism is called "Covariate Dependent Missingness", or CDM).  What implications will this sampling have on the results?


SOLUTION:


## Missing data and NHANES 3
Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that
missingness depends on diabetes status
(this mechanism is called ``Non-Ignorable Non-Response", or NINR).  What implications will this sampling have on the results?

SOLUTION: