--- title: "Statistical foundations exercises" author: "Nicholas Horton (nhorton@amherst.edu)" date: "September 2, 2017" output: html_document: fig_height: 5 fig_width: 7 pdf_document: fig_height: 5 fig_width: 7 word_document: fig_height: 3 fig_width: 5 --- ```{r, setup, include=FALSE} library(mdsr) # Load additional packages here # Some customization. You can alter or delete as desired (if you know what you are doing). trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice knitr::opts_chunk$set( tidy=FALSE, # display code as typed size="small") # slightly smaller font for code ``` ## Introduction These exercises are taken from the statistical foundations chapter from **Modern Data Science with R**: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there. ## Gestation Calculate and interpret a 95% confidence interval for the mean age of mothers from the classic `Gestation` data set from the `mosaicData` package. SOLUTION: ```{r} library(mdsr) glimpse(Gestation) # solution goes here ``` ## More gestation Use the bootstrap to generate and interpret a 95% confidence interval for the median age of mothers for the classic `Gestation` data set from the `mosaicData` package. SOLUTION: ```{r} library(mdsr) # solution goes here ``` ## Even more gestation Use the bootstrap to generate a 95% confidence interval for the regression parameters in a model for weight as a function of age for the `Gestation` data frame from the `mosaicData` package. SOLUTION: ```{r} library(mdsr) # solution goes here ``` ## Confidence intervals We saw that a 95% confidence interval for a mean was constructed by taking the estimate and adding and subtracting two standard deviations. How many standard deviations should be used if a 99% confidence interval is desired? (Hint: see `xqnorm()`.) SOLUTION: ```{r} library(mdsr) # solution goes here ``` ## Twins In 2010, the Minnesota Twins played their first season at Target Field. However, up through 2009, the Twins played at the Metrodome (an indoor stadium). In the Metrodome, air ventilator fans are used both to keep the roof up and to ventilate the stadium. Typically, the air is blown from all directions into the center of the stadium. According to a retired supervisor in the Metrodome, in the late innings of some games the fans would be modified so that the ventilation air would blow out from home plate toward the outfield. The idea is that the air flow might increase the length of a fly ball. To see if manipulating the fans could possibly make any difference, a group of students at the University of Minnesota and their professor built a 'cannon' that used compressed air to shoot baseballs. They then did the following experiment. 1. Shoot balls at angles around 50 degrees with velocity of around 150 feet per second. 2. Shoot balls under two different settings: headwind (air blowing from outfield toward home plate) or tailwind (air blowing from home plate toward outfield). 3. Record other variables: weight of the ball (in grams), diameter of the ball (in cm), and distance of the ball's flight (in feet). Background: People who know little or nothing about baseball might find these basic facts useful. The batter stands near "home plate" and tries to hit the ball toward the outfield. A "fly ball" refers to a ball that is hit into the air. It is desirable to hit the ball as far as possible. For reasons of basic physics, the distance is maximized when the ball is hit at an intermediate angle steeper than 45 degrees from the horizontal. The variables are described in the following table. ``` Cond: the wind conditions, a categorical variable with levels Headwind, Tailwind Angle: the angle of ball's trajectory Velocity: velocity of ball in feet per second BallWt: weight of ball in grams BallDia: diameter of ball in inches Dist: distance in feet of the flight of the ball ``` Here is the output of several models. ``` > lm1 <- lm(Dist ~ Cond, data=ds) # FIRST MODEL ``` ``` > summary(lm1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 350.768 2.179 160.967 <2e-16 *** CondTail 5.865 3.281 1.788 0.0833 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.499 on 32 degrees of freedom Multiple R-squared: 0.0908, Adjusted R-squared: 0.06239 F-statistic: 3.196 on 1 and 32 DF, p-value: 0.0833 ``` ``` > confint(lm1) 2.5 % 97.5 % (Intercept) 346.32966 355.20718 CondTail -0.81784 12.54766 ``` ``` > # SECOND MODEL > lm2 <- lm(Dist ~ Cond + Velocity + Angle + BallWt + BallDia, data=ds) > summary(lm2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 181.7443 335.6959 0.541 0.59252 CondTail 7.6705 2.4593 3.119 0.00418 ** Velocity 1.7284 0.5433 3.181 0.00357 ** Angle -1.6014 1.7995 -0.890 0.38110 BallWt -3.9862 2.6697 -1.493 0.14659 BallDia 190.3715 62.5115 3.045 0.00502 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.805 on 28 degrees of freedom Multiple R-squared: 0.5917, Adjusted R-squared: 0.5188 F-statistic: 8.115 on 5 and 28 DF, p-value: 7.81e-05 ``` ``` > confint(lm2) 2.5 % 97.5 % (Intercept) -505.8974691 869.386165 CondTail 2.6328174 12.708166 Velocity 0.6155279 2.841188 Angle -5.2874318 2.084713 BallWt -9.4549432 1.482457 BallDia 62.3224999 318.420536 ``` Consider the results from the model of `Dist` as a function of `Cond` (first model). Briefly summarize what this model says about the relationship between the wind conditions and the distance travelled by the ball. Make sure to say something sensible about the strength of evidence that there is any relationship at all. SOLUTION: ## More Twins Briefly summarize the model that has `Dist` as the response variable and includes the other variables as explanatory variables (second model) by reporting and interpretating the `CondTail` parameter. This second model suggests a somewhat different result for the relationship between `Dist` and `Cond` Summarize the differences and explain in statistical terms why the inclusion of the other explanatory variables has affected the results. SOLUTION: ## Smoking and mortality The `Whickham` data set in the `mosaicData` package includes data on age, smoking, and mortality from a one-in-six survey of the electoral roll in Whickham, a mixed urban and rural district near Newcastle upon Tyne, in the United Kingdom. The survey was conducted in 1972-1974 to study heart disease and thyroid disease. A follow-up on those in the survey was conducted twenty years later. Describe the association between smoking status and mortality in this study. Be sure to consider the role of age as a possible confounding factor. SOLUTION: ```{r} library(mdsr) Whickham <- mutate(Whickham, agegrp = cut(age, breaks=c(1, 44, 64, 100), labels=c("18-44", "45-64", "65+"))) glimpse(Whickham) # solution goes here ``` ## Missing income A data scientist working for a company that sells mortgages for new home purchases might be interested in determining what factors might be predictive of defaulting on the loan. Some of the mortgagees have missing income in their data set. Would it be reasonable for the analyst to drop these loans from their analytic data set? Explain. SOLUTION: ## Missing data and NHANES The `NHANES` data set in the `NHANES` package includes survey data collected by the U.S. National Center for Health Statistics (NCHS), which has conducted a series of health and nutrition surveys since the early 1960s. An investigator is interested in fitting a model to predict the probability that a female subject will have a diagnosis of diabetes. Predictors for this model include age and BMI. Imagine that only 1/10 of the data are available but that these data are sampled randomly from the full set of observations (this mechanism is called "Missing Completely at Random", or MCAR). What implications will this sampling have on the results? SOLUTION: ```{r} library(mdsr) library(NHANES) # solution goes here ``` ## Missing data and NHANES 2 Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on age, with older subjects less likely to be observed than younger subjects. (this mechanism is called "Covariate Dependent Missingness", or CDM). What implications will this sampling have on the results? SOLUTION: ## Missing data and NHANES 3 Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on diabetes status (this mechanism is called ``Non-Ignorable Non-Response", or NINR). What implications will this sampling have on the results? SOLUTION: