This document contains R versions of the boxed examples from Chapter 9 of the “Analysis and Interpretation of Freshwater Fisheries Data” book. Some sections build on descriptions from previous sections, so each section may not stand completely on its own. More thorough discussions of the following items are available in linked vignettes:

The following additional packages are required to complete all of the examples (with the required functions noted as a comment and also noted in the specific examples below).

> library(FSA)           # Summarize, fitPlot, addSigLetters, residPlot
> library(NCStats)       # chisqPostHoc
> library(lattice)       # bwplot, xyplot
> library(multcomp)      # glht, mcp
> library(nlme)          # lme
> library(pgirmess)      # kruskalmc
> library(TeachingDemos) # chisq.detail

In addition, external tab-delimited text files are used to hold the data required for each example. These data are loaded into R in each example with read.table(). Before using read.table() the working directory of R must be set to where these files are located on your computer. The working directory for all data files on my computer is set with

> setwd("C:/aaaWork/Web/fishR/BookVignettes/aiffd2007")

In addition, I prefer to not show significance stars for hypothesis test output, reduce the margins on plots, alter the axis label positions, and reduce the axis tick length. In addition, contrasts are set in such a manner as to force R output to match SAS output for linear model summaries. All of these options are set with

> options(width=90,continue=" ",show.signif.stars=FALSE,
contrasts=c("contr.sum","contr.poly"))
> par(mar=c(3.5,3.5,1,1),mgp=c(2.1,0.4,0),tcl=-0.2)

## 9.1 Testing for Differences in Mean Length by Means of Analysis of Variance (ANOVA)

### 9.1.1 Preparing Data

The Box9_1.txt data file is read and the structure is observed below.

> d1 <- read.table("data/Box9_1.txt",header=TRUE)
> str(d1)
'data.frame':   360 obs. of  2 variables:
$lake : Factor w/ 3 levels "Island","Mitchell",..: 1 1 1 1 1 1 1 1 1 1 ...$ length: int  122 126 129 130 130 132 132 135 135 136 ...

### 9.1.2 ANOVA Results

The one-way ANOVA model is fit in R with lm() where the first argument is a formula of the form response~factor and the data= argument set equal to the data frame containing the variables. The ANOVA table with type I SS is then extracted from the lm object with anova().

> lm1 <- lm(length~lake,data=d1)
> anova(lm1)
           Df Sum Sq Mean Sq F value    Pr(>F)
lake        2  49324 24662.1   81.73 < 2.2e-16
Residuals 357 107725   301.7
Total     359 157049                          

### 9.1.3 Multiple Comparisons – Tukey Method

The Tukey multiple comparisons results are obtained by submitting the lm object as the first argument to glht(), from the multcomp package. This function requires a second argument that indicates which type of multiple comparison procedure to use. This second argument uses mcp() which requires the factor variable set equal to the word “Tukey” to perform the Tukey multiple comparison procedure. The saved glht object is submitted to summary() to get the difference in means with a corresponding hypothesis test p-value among each pair of groups and to confint() to get the corresponding confidence intervals for the difference in means. In addition, submitting the saved glht object to cld() will produce “significance letters” to indicate which means are different (different letters mean different means).

> mc1 <- glht(lm1,mcp(lake="Tukey"))
> summary(mc1)
                         Estimate Std. Error t value Pr(>|t|)
Mitchell - Island == 0     28.315      2.371  11.944   <1e-04
Thompson - Island == 0      6.105      2.232   2.735   0.0179
Thompson - Mitchell == 0  -22.210      2.191 -10.138   <1e-04
> confint(mc1)
                         Estimate lwr      upr
Mitchell - Island == 0    28.3151  22.7375  33.8926
Thompson - Island == 0     6.1052   0.8536  11.3569
Thompson - Mitchell == 0 -22.2098 -27.3642 -17.0555
> cld(mc1)
  Island Mitchell Thompson
"a"      "c"      "b" 

A graphic of the model results is obtained with fitPlot(), from the FSA package, and the significance letters are placed on the means plot with addSigLetters(). (addSigLetters() is from the NCStats package. You should examine the help for this function to see what each of the arguments is used for).

> fitPlot(lm1,ylab="Total Length (mm)",xlab="Lake",main="")
> addSigLetters(lm1,lets=c("c","a","b"),pos=c(2,2,4))

### 9.1.4 Summary Table

The summary table shown at the bottom of thebox (I would have preferred doing this at the beginning) is obtained with Summarize() from the FSA package.

> Summarize(length~lake,data=d1,digits=2)
      lake   n   mean    sd min  Q1 median    Q3 max percZero
1   Island 104 159.43 15.89 122 149    160 172.2 195        0
2 Mitchell 111 187.75 13.71 145 177    192 197.5 218        0
3 Thompson 145 165.54 20.59 123 150    165 180.0 216        0

## 9.2 Testing for Differences among Length-Frequency Distributions by Means of the Kolmogorov-Smirnov Two-Sample Test

### 9.2.1 Preparing Data

The Box9_1.txt data file used here is the same file used in Box 9.1 and is not re-read here. However, as the Kolmogorov-Smirnov method described in thebox is a two-sample method. Thus, three new data frames, each of which contains only one of the lakes, must be constructed. This is most easily accomplished with Subset() (from the FSA package) which requires the original data frame as the first argument and a conditioning statement as the second argument.

> d1I <- Subset(d1,lake=="Island")     # only Island
> d1M <- Subset(d1,lake=="Mitchell")   # only Mitchell
> d1T <- Subset(d1,lake=="Thompson")   # only Thompson

### 9.2.2 Kolmogorov-Smirnov Tests

The Kolmogorov-Smirnov Test is performed in R with ks.test(). This function requires the quantitative variable from one “group” (i.e., lake) as the first argument and the quantitative variable from the second “group” as the second argument. The Komogorov-Smirnov results in the same order as presented in thebox are shown below. You will notice that R gives a warning about computing p-values because the Kolmogorov-Smirnov Test is used to compare two continuous distributions in which it would theoretically be impossible to have tied values. The discrete nature of length measurements violates this assumption.

> ks.test(d1M$length,d1T$length)
Warning in ks.test(d1M$length, d1T$length): p-value will be approximate in the
presence of ties
Two-sample Kolmogorov-Smirnov test with d1M$length and d1T$length
D = 0.5306, p-value = 8.882e-16
alternative hypothesis: two-sided 
> ks.test(d1I$length,d1T$length)
Warning in ks.test(d1I$length, d1T$length): p-value will be approximate in the
presence of ties
Two-sample Kolmogorov-Smirnov test with d1I$length and d1T$length
D = 0.1565, p-value = 0.1029
alternative hypothesis: two-sided 
> ks.test(d1I$length,d1M$length)
Warning in ks.test(d1I$length, d1M$length): p-value will be approximate in the
presence of ties
Two-sample Kolmogorov-Smirnov test with d1I$length and d1M$length
D = 0.6498, p-value < 2.2e-16
alternative hypothesis: two-sided 

### 9.2.3 Length Frequency Histograms

Given that the test above is attempting to compare the distribution of lengths among the three lakes it would be a good idea to look at these distributions. Histograms for each lake can be easily constructed with a formula in hist() as illustrated below.

> hist(length~lake,data=d1,xlab="Total Length (mm)")

Alternatively, one can look at the empirical cumulative distribution functions for each lake superimposed upon each other. The ecdf() function is used to find the empirical cumulative distribution function and add=TRUE is used to superimpose a subsequent plot on a previous plot.

> plot(ecdf(d1I$length),xlim=c(100,240),verticals=TRUE,pch=".",main="", xlab="Total Length (mm)",lwd=2) > plot(ecdf(d1M$length),col="blue",verticals=TRUE,pch=".",lwd=2,add=TRUE)
> plot(ecdf(d1T$length),col="red",verticals=TRUE,pch=".",lwd=2,add=TRUE) > legend(100,0.99,legend=c("Island","Mitchell","Thopmson"),col=c("black","blue","red"), pch=".",lwd=2,lty=1,cex=0.75) ## 9.3 Testing for Differences among Length-Frequency Distributions by Means of the Kruskal-Wallis Test ### 9.3.1 Preparing Data The Box9_1.txt data file used here is the same file used in Box 9.1 and is not re-read here. ### 9.3.2 Kruskal-Wallis Test The Kruskal-Wallis Test is performed in R with kruskal.test(). This function requires the response variable as the first argument and the grouping factor variable as the second argument. > kruskal.test(d1$length,d1$lake) Kruskal-Wallis rank sum test with d1$length and d1$lake Kruskal-Wallis chi-squared = 118.5671, df = 2, p-value < 2.2e-16 Histograms for each lake should be constructed as shown in Box 9.2. ## 9.4 Performing Multiple Comparisons of Length-Frequency Data As a continuation of Box 9.3, multiple comparisons following the significant Kruskal-Wallis test can be easily computed with kruskalmc(), from the pgirmess package. This function requires the response variable as the first argument and the factor variable as the second argument. > kruskalmc(d1$length,d1$lake) Multiple comparison test after Kruskal-Wallis p.value: 0.05 Comparisons obs.dif critical.dif difference Island-Mitchell 143.47375 33.99975 TRUE Island-Thompson 28.70153 32.01354 FALSE Mitchell-Thompson 114.77223 31.42022 TRUE ## 9.5 Using Contingency Tables to Test for Differences in Length-Frequency Distributions ### 9.5.1 Preparing Data As the data are presented in summarized form, it is easiest to just enter them directly into an R matrix. Doing this requires matrix() with a list of the values as the first argument, the nrow= argument to indicate the number of rows in the matrix, and the byrow=TRUE argument to tell R that the values should be placed in the matrix by rows and then by columns. The columns and rows can be named with colnames() and rownames() respectively. > d5 <- matrix(c(85,77,44,124,34,251),nrow=3,byrow=TRUE) > colnames(d5) <- c("Q","S-Q") > rownames(d5) <- c("1996","1997","1998") > d5  Q S-Q 1996 85 77 1997 44 124 1998 34 251 ### 9.5.2 Chi-Square Test I Chi-square tests are performed in R with chisq.test(). For this analysis, this function requires a matrix of the data as the only argument. The expected values and residuals ($$\frac{observed-expected}{\sqrt{expected}}$$) are obtained by appending $expected and $residuals to the saved chisq.test object. > ( chi1 <- chisq.test(d5) ) Pearson's Chi-squared test with d5 X-squared = 87.154, df = 2, p-value < 2.2e-16 > chi1$expected
            Q      S-Q
1996 42.93659 119.0634
1997 44.52683 123.4732
1998 75.53659 209.4634
> chi1$residuals  Q S-Q 1996 6.41934584 -3.85491990 1997 -0.07895125 0.04741149 1998 -4.77916601 2.86996567 The final test statistic and p-value, cell contributions to the chi-square test statistic, and a combined table of observed and expected values can also be constructed with chisq.detail(), from the TeachingDemos package. > chisq.detail(d5)  observed expected Q S-Q Total 1996 85 77 162 42.94 119.06 1997 44 124 168 44.53 123.47 1998 34 251 285 75.54 209.46 Total 163 452 615 Cell Contributions Q S-Q 1996 41.21 + 14.86 + 1997 0.01 + 0.00 + 1998 22.84 + 8.24 = 87.15 df = 2 P-value = 0  ### 9.5.3 Chi-Square Tests II The authors of the box discuss but do not show the 2x2 chi-square tests used to identify differences between pairs of years. These three tests can be constructed with chisqPostHoc(), from the NCStats package. This function requires the saved chisq.test() object as the first argument and a method to use for adjusting p-values for inflation due to multiple comparisons in the control= argument (see ?p.adjust for more discussion on the different methods for controlling the error rate with multiple comparisons). Finally, if the populations or groups to be compared were not in the rows of the original observed table (the groups in this example, i.e., the years, do form the rows so this argument is not required) then use the popsInRows=FALSE argument. The results below indicate that the PSD differs significantly among all years in the study. > chisqPostHoc(chi1,digits=6) Adjusted p-values used the fdr method.  comparison raw.p adj.p 1 1996 vs. 1997 0.000002 0.000003 2 1996 vs. 1998 0.000000 0.000000 3 1997 vs. 1998 0.000174 0.000174 ## 9.6 Testing for Differences in Size Structure by Treating Groups of Fish Caught in Each Unit of Effort as Samples ### 9.6.1 Preparing Data The Box9_6.txt data file is read and the structure is observed below. The authors create a new variable, LOGIT, that is the log of the ratio of PREF to QUAL after 0.5 had been added to each value to account for zeroes in the data. > d6 <- read.table("data/Box9_6.txt",header=TRUE) > str(d6) 'data.frame': 24 obs. of 3 variables:$ METHOD: Factor w/ 2 levels "ELEC","TOURN": 2 2 2 2 2 2 2 2 2 2 ...
$QUAL : int 3 28 13 8 61 76 38 49 62 43 ...$ PREF  : int  2 4 1 0 16 12 10 12 24 10 ...
> d6$LOGIT <- log((d6$PREF+0.5)/(d6$QUAL+0.5)) > d6  METHOD QUAL PREF LOGIT 1 TOURN 3 2 -0.3364722 2 TOURN 28 4 -1.8458267 3 TOURN 13 1 -2.1972246 4 TOURN 8 0 -2.8332133 5 TOURN 61 16 -1.3156768 6 TOURN 76 12 -1.8115621 7 TOURN 38 10 -1.2992830 8 TOURN 49 12 -1.3762440 9 TOURN 62 24 -0.9364934 10 TOURN 43 10 -1.4213857 11 TOURN 59 18 -1.1682056 12 TOURN 24 5 -1.4939250 13 ELEC 23 12 -0.6312718 14 ELEC 22 2 -2.1972246 15 ELEC 35 8 -1.4294665 16 ELEC 6 2 -0.9555114 17 ELEC 11 1 -2.0368819 18 ELEC 15 7 -0.7259370 19 ELEC 12 3 -1.2729657 20 ELEC 9 6 -0.3794896 21 ELEC 25 5 -1.5339304 22 ELEC 25 8 -1.0986123 23 ELEC 9 1 -1.8458267 24 ELEC 7 1 -1.6094379 ### 9.6.2 Summary Statistics The summary statistics of the LOGIT values for each collection type is computed with Summarize() with the first argument containing a formula of the form response~factor and the argument data= set to the data frame containing the variables (these simple statistics are different from what is presented in the box. I do not know why, as the raw data show above and the results of the linear model shown below perfectly match the results in the box). > Summarize(LOGIT~METHOD,data=d6,digits=4)  METHOD n mean sd min Q1 median Q3 max percZero 1 ELEC 12 -1.3097 0.5718 -2.197 -1.669 -1.351 -0.8981 -0.3795 0 2 TOURN 12 -1.5030 0.6294 -2.833 -1.820 -1.399 -1.2670 -0.3365 0 ### 9.6.3 Model Fitting The model fit in the box can be fit in R with lm() using the same formula and data= arguments used in Summarize() as above. The authors used a regression weighted on the number of quality fish collected. These weights are used in lm() by setting the weights= argument to QUAL. The ANOVA table is extracted from the saved lm object with anova(). > lm1 <- lm(LOGIT~METHOD,data=d6,weights=QUAL) > anova(lm1)  Df Sum Sq Mean Sq F value Pr(>F) METHOD 1 1.391 1.3908 0.2612 0.6144 Residuals 22 117.151 5.3250 Total 23 118.542  ## 9.7 Using Repeated-Measures ANOVA to Test for Size Structure Differences with Time-Dependent Data ### 9.7.1 Preparing Data The Box9_7.txt data file is read and the structure is observed below. The level names in sizegrp are observed with levels() and the site and year variables were converted to factors with factor() > d7 <- read.table("data/Box9_7.txt",header=TRUE) > str(d7) 'data.frame': 303 obs. of 4 variables:$ year   : int  1988 1988 1988 1988 1988 1988 1988 1988 1988 1988 ...
$site : int 1 1 1 2 2 2 3 3 3 4 ...$ sizegrp: Factor w/ 3 levels "age0","slot",..: 1 3 2 1 3 2 1 3 2 1 ...
$count : int 2 4 1 13 9 2 5 11 1 2 ... > levels(d7$sizegrp)
[1] "age0" "slot" "und" 
> d7$fsite <- factor(d7$site)
> d7$fyear <- factor(d7$year)

The authors then created a new variable, period, that indicates whether the data came from a year prior to implementation of the management regulation (i.e,. prior to 1990) or after the implementation. This variable is created by first filling the variable with “APRE” and then replacing this name with “BPOST” for all years after 1990. The new variable is then converted to a factor and the new data frame is viewed to see what was accomplished. Finally, as noted by the authors, all age-0 fish were removed from the analysis (using Subset() and noting that != means “not equals”).

> d7$period <- "APRE" # initially fill completely with "APRE" > d7$period[d7$year>1990] <- "BPOST" # then replace post-1990 with "BPOST" > d7$period <- factor(d7$period) # explicitly make a factor > view(d7)  year site sizegrp count fsite fyear period 14 1988 6 und 15 6 1988 APRE 50 1989 8 und 8 8 1989 APRE 63 1990 1 slot 3 1 1990 APRE 72 1990 4 slot 3 4 1990 APRE 235 1995 2 age0 5 2 1995 BPOST 298 1996 11 age0 0 11 1996 BPOST > d7a <- Subset(d7,sizegrp!="age0") This data frame, which has the count “stacked” for both the slot and und group must now be unstacked so that the logit of the ratio of fish in the slot to fish in the undersized category can be computed. The reshape() function is a handy, if not cumbersome, method for converting between the stacked (what reshape() calls “long”) format to unstacked (what reshape() calls “wide”) format. For our purposes, reshape() requires four arguments: • data: the data frame to be converted from (note that this is the first argument) • direction: this is the format to be converted to (i.e., we are converting from “long” to “wide”) • timevar: this is the variable that contains the information on how the “long” data should be split into “wide” data. In this case, we want to have count separated into two columns, one for undersized fish and one for slot length fish. • idvar: this is the variable or variables that will are repeated in the “long” format and should occur only once in the “wide” format. Thus, the appropriate reshape() command for this example is shown below with the resulting data frame viewed. Notice that the counts of under- and slot-sized fish are contained in the count.und and count.slot variables for each year, site, and period combination. > d7b <- reshape(d7a,direction="wide",timevar="sizegrp",idvar=c("site","year","fyear","period")) > view(d7b)  year site fyear period count.und fsite.und count.slot fsite.slot 19 1989 1 1989 APRE 7 1 1 1 35 1989 9 1989 APRE 10 9 3 9 91 1992 3 1992 BPOST 8 3 1 3 111 1993 2 1993 BPOST 8 2 4 2 147 1994 8 1994 BPOST 2 8 1 8 189 1996 6 1996 BPOST 3 6 3 6 Finally, the undert, slott, total, and LOGIT variables, as described in the box, are constructed. > d7b$undert <- d7b$count.und+0.5 > d7b$slott <- d7b$count.slot+0.5 > d7b$total <- d7b$undert + d7b$slott

### 9.7.3 Repeated Measures ANOVA

THIS SECTION HAS NOT YET BEEN CONVERTED

Reproducibility Information
Compiled Date: Thu May 14 2015
Compiled Time: 2:41:55 PM
Code Execution Time: 6.86 s

R Version: R version 3.2.0 (2015-04-16)
System: Windows, i386-w64-mingw32/i386 (32-bit)
Base Packages: base, datasets, graphics, grDevices, methods, stats, utils
Required Packages: FSA, NCStats, lattice, multcomp, nlme, pgirmess,
TeachingDemos and their dependencies (boot, car, codetools, dplyr,
FSAdata, gdata, gplots, graphics, grDevices, grid, Hmisc, knitr,
lmtest, mvtnorm, plotrix, relax, rgdal, sandwich, sciplot, sp, spdep,
splancs, stats, survival, TH.data, utils)
Other Packages: car_2.0-25, FSA_0.6.13, FSAdata_0.1.9, knitr_1.10.5,
lattice_0.20-31, multcomp_1.4-0, mvtnorm_1.0-2, NCStats_0.4.3,
nlme_3.1-120, pgirmess_1.6.0, rmarkdown_0.6.1, survival_2.38-1,
TeachingDemos_2.9, TH.data_1.0-6
Loaded-Only Packages: acepack_1.3-3.3, assertthat_0.1, bitops_1.0-6,
boot_1.3-16, caTools_1.17.1, cluster_2.0.1, coda_0.17-1,
codetools_0.2-11, colorspace_1.2-6, DBI_0.3.1, deldir_0.1-9,
digest_0.6.8, dplyr_0.4.1, evaluate_0.7, foreign_0.8-63, formatR_1.2,
Formula_1.2-1, gdata_2.16.1, ggplot2_1.0.1, gplots_2.17.0, grid_3.2.0,
gridExtra_0.9.1, gtable_0.1.2, gtools_3.4.2, highr_0.5, Hmisc_3.16-0,
htmltools_0.2.6, KernSmooth_2.23-14, latticeExtra_0.6-26,
LearnBayes_2.15, lme4_1.1-7, lmtest_0.9-33, magrittr_1.5, MASS_7.3-40,
Matrix_1.2-0, mgcv_1.8-6, minqa_1.2.4, munsell_0.4.2, nloptr_1.0.4,
nnet_7.3-9, parallel_3.2.0, pbkrtest_0.4-2, plotrix_3.5-11, plyr_1.8.2,
proto_0.3-10, quantreg_5.11, RColorBrewer_1.1-2, Rcpp_0.11.6,
relax_1.3.15, reshape2_1.4.1, rgdal_0.9-2, rpart_4.1-9, sandwich_2.3-3,
scales_0.2.4, sciplot_1.1-0, sp_1.1-0, SparseM_1.6, spdep_0.5-88,
splancs_2.01-37, splines_3.2.0, stringi_0.4-1, stringr_1.0.0,
tools_3.2.0, yaml_2.1.13, zoo_1.7-12