\[ \require{mhchem} \newcommand{\E}{\operatorname{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\erf}{\operatorname{erf}} \]
In this lesson:
To answer these questions, we turn to inferential statistics. We will pose a hypothesis, and let the data tell us whether it may be true or not.
The objective of inferential statistics is to determine underlying mechanisms that generated the data, or draw conclusions regarding the value of metrics or estimates observed given the variability in the data.
Topics:
Two special processes lead to statistical distributions we presented in an earlier lesson.
A normal [or Gaussian] process results when a number of unrelated, continuous random variable are added together. (Ott 1994)
Particles suspended in a fluid are continuously bombarded by the surrounding fluid molecules \(\Rightarrow\) results in a random, irregular motion (“random walk”) of the particles known as Brownian motion. (Hinds 1999)
The spread of particles with time can be determined by solving the one-dimensional equation of diffusion (Fick’s second law) for \(n(x=0,t=0) = n_0\), \[ \frac{\partial n}{\partial t} = D \frac{\partial^2n}{\partial x^2} \] The solution for the concentration distribution is given by the Gaussian form, \[ n(x,t) = \frac{n_0}{2\sqrt{\pi Dt}}\exp\left(\frac{-x^2}{4Dt}\right) \]
The mean square displacement of particles of the particles from \(x=0\) at time \(t\) is given by \[ \langle{x^2}\rangle = \frac{1}{n_0} \int_{-\infty}^\infty x^2 n(x,t) dx = 2 Dt \]
The fractional distribution of particles can be expressed as a function of position \(x\) at time \(t\), \[ f(x,t) = \frac{n(x,t)}{n_0} = \frac{1}{\sqrt{2\pi\langle{x^2}\rangle}} \exp\left(-\frac{x^2}{2\langle{x^2}\rangle}\right) \]
Other normal processes
This leads to the normal distribution that we discussed in our previous lesson:
A lognormal process is one in which the random variable of interest results from the product of many independent random variables multiplied together (Ott 1994)
Theory of successive dilutions (Ott 1990):
\(c_0\) is the initial concentration, \(D_m\) is the \(m\)th dilution factor, and \(C_m\) is the concentration in \(m\)th cell (increasing dilution with increasing \(m\)). \[ \begin{aligned} C_1 &= c_0 D_1 \\ C_2 &= C_1 D_2 = c_0 D_1 D_2 \\ C_m &= c_0 D_1 D_2 \ldots D_m = c_0 \prod_{i=1}^m D_i \\ \log C_m &= \log c_0 + \log D_1 + \log D_2 + \ldots + \log D_m = c_0 \sum_{i=1}^m \log D_i \end{aligned} \] If \(D_i\) is a random variable (e.g., uniformly distributed), r.h.s. of last equation is sum of \(m\) independent random variables \(\Rightarrow\) \(\log C_m\) normally distributed \(\Rightarrow\) \(C_m\), the concentration at point of observation, .
This leads to the lognormal distribution that we discussed in our previous lesson:
In this lesson, we will use this knowledge to
To answer these questions, we turn to a hypothesis testing framework.
Hypothesis testing is a method of statistical inference by which we determine whether a proposed statement is favored over another. We designate these statements as either the null hypothesis \(H_0\) or the alternative hypothesis \(H_a\). The selection and application of statistical tests can lead to several types of errors.
Decision | \(H_0\) True | \(H_a\) |
---|---|---|
reject \(H_0\) | Type I error (\(\alpha\)) | Correct decision |
do not reject \(H_0\) | Correct decision | Type II error (\(\beta\)) |
Random variables
Let \(X\) represent the concentration of an atmospheric pollutant. In the statistical view,
The sample mean, \(\bar{X}\), is an estimator of the mean of \(X\), \(\mu_X\), and also has a probability distribution.
We define the expected value and variance for each distribution with the following symbols:
\[ \begin{aligned} \mu_X &= \E(X)\\ \sigma_X^2 &= \Var(X) \end{aligned} \] | \[ \begin{aligned} \mu_{\bar{X}} &= \E(\bar{X})\\ \sigma_{\bar{X}}^2 &= \Var(\bar{X}) \end{aligned} \] |
Hypothesis: \[ \begin{array}{l} H_0: \mu_{X} = \mu_0\\ H_a: \mu_{X} \neq \mu_0 \end{array} \] The estimator by which we evalute this hypothesis is the sample mean: \[ \bar{X} = \frac{X_1 + X_2 + \ldots + X_n}{n} \]
Example: let \(X\) be uniformly distributed between \([a,b]\) where \(a=0\) and \(b=1\).
If we draw 10\(^6\) samples from this distribution, we can construct a histogram for \(x\):
For a uniform distribution \(X\sim\mathbf{U}(a,b)\), \[ \begin{aligned} \E(X) &= \frac{1}{2}(a+b)\\ \Var(X) &= \frac{1}{12}(b-a)^2 \end{aligned} \]
From a given (single) estimate of \(\bar{x}\), let us test the hypothesis that \(\mu_X \neq 0.5\) vs. \(\mu_X = 0.5\) (so \(\mu_0 = 0.5\) in our hypothesis testing framework). First, we will investigate the properties (distribution) of \(\bar{X}\) so that we can place a single realization, \(\bar{x}\), in context.
Distribution of the sample mean \(\bar{x}\) obtained from repeated sampling
Let \(\bar{X}\) be calculated from \(n=50\) randomly selected samples. If we conduct an experiment where we draw 50 samples from this distribution for \(N=30\) trials, the values would look as shown below. In each trial (shown along rows), we show red points indicating the mean values \(\bar{x}\) for each of the 50 samples (gray points).
If we draw a histogram of the 30 means \(\bar{x}\), we see that the distribution is already approximately normal—even though the distribution of \(X\) is most definitely not normal. An empirical density is overlaid in red.
Now let us repeat this experiment for \(N \to \infty\) trials (actually, 10\(^6\) in this case as \(\infty\) is unachievable in reality) rather than just 30 times:
We obtain the distribution for \(\bar{X}\), also called the sampling distribution. It is normally distributed, with parameters \(\mu_{\bar{X}} = \mu_{X}\) and \(\sigma_{\bar{X}}^2 = \sigma_X^2/n\), as derived in previous lecture. The theoretical normal distribution \(\mathbf{N}(\mu_{\bar{X}},\sigma_{\bar{X}})\) is overlaid in blue.
Parameters of the sampling distribution (example: sample mean)
\(\bar{X}\) is a random variable which characterizes the mean of \(n\) random variables \(\{X_1, X_2, \dots, X_n\}\) with finite variance. \[ \bar{X} = \frac{X_1 + X_2 + \ldots + X_n}{n} = \frac{X_1}{n} + \frac{X_2}{n} + \ldots \frac{X_n}{n} \]
Since \(\bar{X}\) is the sum of a number of other random variables, we can expect that its distribution is a normal. The expected value \(\E(X)\) and variance \(\Var(X)\) of this distribution are \(\mu_{\bar{X}}\) and \(\sigma_{\bar{X}}^2\), respectively.
There are properties of expected values and variances which are useful. For any random variable \(X\) multiplied by a scalar variable \(a\), \[ \begin{aligned} \E(aX) &= a\E(X)\\ \Var(aX) &= a^2 \Var(X) \end{aligned} \]
Therefore, \[ \begin{aligned} \mu_{\bar{X}} &= \frac{1}{n}(n \mu_X) = \mu_X \\ \sigma_{\bar{X}}^2 &= \frac{1}{n^2}(n\sigma_X^2) = \frac{\sigma_X^2}{n} \end{aligned} \]
The standard deviation of the sampling distribution, also called the standard error, is \(\sigma_{\bar{X}} = \sigma_X/\sqrt{n}\).
In reality, we do not have the ability to calculate \(n\) means from \(N\to\infty\) trials (10\(^6\) for what is shown here). BUT, if we know \(\mu_X\) and \(\sigma_X\), then we know \(\mu_{\bar{X}}\) and \(\sigma_{\bar{X}}\) of the sampling distribution for \(\bar{X}\) as shown above.
We expand the scale along the \(x\)-axis for better visibility in the figure below.
If we calculate a mean value after sampling \(n\) values only once from this distribution (what usually happens in environmental sampling), we get a point estimate of \(\bar{X}\), shown by a red vertical line (\(\bar{x}=0.53\)).
Example: sampling means from a lognormal distribution
Below, example lognormal distribution [Lausanne 2013 carbon monoxide, E(X) = 0.40] is shown at top. The three plot below show the distribution of mean values (sampling distribution) when the mean is calculated from \(n\)=10, 30, and 100 samples randomly selected from this distribution (process of random selection repeated 10,000 times for each scenario).
We will see why confidence intervals are useful in the context of hypothesis testing.
Confidence intervals on the mean can be constructed using the standard error. Let \(X\) be a random variable. \[ P\left( \mu_X - z \frac{\sigma_X}{\sqrt{n}} \leq \bar{X} \leq \mu_X + z\frac{\sigma_X}{\sqrt{n}}\right) = p \] \(z\) is the value of the \(z\)-score and \(p\) is the confidence level (usually expressed as a percentage, \(100p\), or \(p\)-value, which is actually \(1-p\)) associated with \(z\). However, we seldom know \(\mu_X\) and \(\sigma_X\) but instead have set of \(n\) observations from which we calculate the sample mean \(\bar{x}\). With a given probability \(p\), the sample mean is bounded by the true mean \(\pm\) \(z=z(p)\) times the standard deviation of the sampling distribution, which can be rewritten to state that the true mean is bounded by the sample mean and an interval derived from the estimated standard error. First statement: \[ P\left( \bar{X} - z \frac{\sigma_X}{\sqrt{n}} \leq \mu_X \leq \bar{X} + z\frac{\sigma_X}{\sqrt{n}}\right) = p \] Additionally, we also often do not know \(\sigma_X\) but estimate \(s\) from the same set of observations, so we use a different expression for the bounds (explained in the next section): \[ P\left( \bar{X} - t \frac{s}{\sqrt{n}} \leq \mu_X \leq \bar{X} + t\frac{s}{\sqrt{n}}\right) = p \] We can write these variations in a general form, \[ P\left( \bar{X}-d \leq \mu_X \leq \bar{X}+d \right) = p \] The range \((\bar{X}-d,\bar{X}+d)\) associated with a particular confidence level \(100p\) is called the confidence interval. Commonly used confidence levels include and 90%, 95%, and 99%, which correspond to \(p\)-values of 0.10, 0.05, and 0.01, respectively.
Conceived by William S. Gosset under the pseudonym, Student, while employed at by Guiness brewery. Applied to statistics of small samples for dealing with chemical characterization of barley.
Test statistic: \[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n-1}} \]
Difference of two means:
Test statistic: \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X-\mu_Y)}{s_{\bar{X}-\bar{Y}}} \] where \[ \begin{aligned} s_{\bar{X}-\bar{Y}} &= \sqrt{s_X^2/n_X + s_Y^2/n_Y}\\ \nu &= \frac{s_X^2/n_X + s_Y^2/n_Y}{\left(s_X^2/n_X\right)^2/(n_X-1) + \left(s_Y^2/n_Y\right)^2/(n_Y-1)} \end{aligned} \] \(\nu\) is rounded to the nearest integer
Formalizing comparisons within the hypothesis testing framework
Let \[ \begin{aligned} \alpha & = P(\text{type I error}) \\ & = P(\text{reject $H_0$ when $H_0$ is true}) \end{aligned} \]
\(\alpha\) is called the “significance level”, and the confidence interval corresponds to \((1-\alpha)\times 100%\). E.g., defining an acceptance region for \(H_0\) at the \(\alpha=0.05\) significance level corresponds to a confidence interval of 95%.
To determine whether to reject \(H_0\) or not, compare computed value of test statistic \(t\) to fixed values \(t_{\alpha,\nu}\) (one-sided) or \(t_{\alpha,\nu/2}\) (two-sided) defined for \(\nu\) (given by the data set) and \(\alpha\) (determined by practioner).
Let \(T\) be a random variable representing the possible values of the \(t\)-statistic.
For the two-sided case, \(t_{\alpha,\nu/2}\) (a positive value) is defined such that \[ \alpha = P(T > t_{\alpha/2,\nu}) + P(T < -t_{\alpha/2,\nu}) \]
For the two-sided case, we would reject the null hypothesis at the \(\alpha\) significance level if \(|t| > t_{\alpha/2,\nu}\).
Alternate method of presentation. With a given set of data and test statistic, we can define the “\(p\)-value” as the smallest level of significance that would lead to rejection of the null hypothesis \(H_0\). For the two-sided case, \(p\)-value corresponds to \[ \text{$p$-value} = P(T > |t|) + P(T < -|t|) \]
We can make the statement that we would reject the null hypothesis at the \(\alpha=\{\text{$p$-value}\}\) according to our observations. If the \(p\)-value is less than our specified value of \(\alpha\), we can also state that we reject the null hypothesis at the \(\alpha\) significance level.
Statistical vs. practical significance
Note that even while there may be a statistical significance between two mean values, engineers must consider where the difference is of practical significance that would influence decision making or interperetation.
The i.i.d. assumption
Always note the assumptions required to estimate confidence intervals for each parameter.
Calculation of estimators often assumes samples are —independent and identically distributed random variables (have same probability distribution and are independently selected). As we saw in a previous lecture, air pollution concentrations often exhibit serial correlation (autocorrelation), so each sample is not truly independent. The implication would be that convergence to the mean will be more gradual than predicted by previously-derived expressions.
However, Ott and Mage (1981) have shown that the extent of autocorrelations in air pollution measurements do not affect estimated confidence intervals to a great extent.
Testing for goodness of fit
Note there are many other (possibly better) goodness-of-fit tests, but the \(\chi^2\) statistic is commonly used for this purpose and draws directly upon our previous discussion of hypothesis testing and \(p\)-values.
Essentially, the \(\chi^2\) test compares sets of observed and expected frequencies: number of observations measured (\(O_j\)) and predicted (\(E_j\)) relative to the number predicted over \(j=1,2,\ldots,m\) intervals. \[ \chi^2 = \sum_{j=1}^m \frac{(O_j-E_j)^2}{E_j} \]
Fitting probability distributions. For comparing against continuous distributions, we bin the observations as for a histogram. However, we target the creation of bin boundaries to contain equal counts of observations, rather than by criteria of even spacing.
This test statistic will have a probability distribution that is approximately \(\chi^2\) (regardless of the underlying distribution of the observations) and the associated \(p\)-value can be estimated for \(m-r-1\) degrees of freedom (\(r\) is the number of parameters of the distribution).
We will use this statistic to test our hypotheses that the distribution is lognormally distributed:
It is important to note that we assume the distribution is lognormal unless we have enough power to reject it. \(p\)-value \(>\) 0.05 means the observations do not deviate enough from lognormally-distributed values to suggest that the true distribution is otherwise.
Interpretation
Why concentrations may be lognormally distributed:
Why concentrations may depart from lognormal distribution:
Further discussion can be found in the academic literature – e.g., Kao and Friedlander (1995), Hadley and Toumi (2003), Morrison and Murphy (2010).
Wider application of hypothesis testing
In this exercise, we examined an example of hypothesis testing of mean values and appropriateness for a particular distribution to describe a random variable.
Such tests can be performed on other parameters: standard deviations, regression coefficients (slope, intercept, etc.), correlation coefficients, and others. Examples:
However, always note the required assumptions before performing tests for statistical inference.
As another exampe, it is often requested that tests of significance be provided for the correlation coefficient between two analyte concentrations (whether it is different from zero).
Hypothesis testing and \(p\)-values are used extensively in the medical and psychological disciplines and is currently under intense scrutiny.
P values are commonly used to test (and dismiss) a ‘null hypothesis’ generally states that there is no difference between two groups, or that there is no correlation between a pair of characteristics. The smaller the \(P\) value, the less likely an observed set of values would occur by chance - assuming that the null hypothesis is true.
A P value of 0.05 does not mean that there is a 95% chance that a given hypothesis is correct. Instead, it signifies that if the null hypothesis is true, and all other assumptions made are valid, there is a 5% chance of obtaining a result at least as extreme as the one observed. And a P value cannot indicate the importance of a finding; for instance, a drug can have a statistically significant effect on patients’ blood glucose levels without having a therapeutic effect.
P values are widely used in science to test null hypotheses. For example, in a medical study looking at smoking and cancer, the null hypothesis could be that there is no link between the two. Many researchers interpret a lower P value as stronger evidence that the null hypothesis is false. Many also accept findings as ‘significant’ if the P value comes in at less than 0.05. But P values are slippery, and sometimes, significant P values vanish when experiments and statistical analyses are repeated (see Nature 506, 150-152; 2014).
A more general problem is that “[a]rbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled.”
Bottom line: the P value is not an absolute value that stands on its own. When reporting P values, it is important to thoroughly document the procedure of your analysis; including effect size and confidence intervals (which inherently includes sample size) to communicate the “magnitude and relative importance of an effect.”
library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
df <- readRDS("data/2013/lau-zue.rds")
Pivot to long format. When providing a character vector of column names, wrap in all_of
.
id.vars <- c("site", "datetime", "year", "month", "day", "hour", "season", "dayofwk", "daytype")
lf <- gather(df, variable, value, -all_of(id.vars))
Create a function to view some metrics:
ComputeStats <- function(x) {
x <- na.omit(x)
m <- mean(x)
s <- sd(x)
n <- length(x)
t <- qt(.975, n - 1)
data.frame(mean=m,
conf.lower=m-t*s/sqrt(n),
conf.upper=m+t*s/sqrt(n),
sd.lower=m-s,
sd.upper=m+s)
}
Apply to ozone concentrations in January and July in Lausanne:
table <- lf %>%
filter(site=="LAU" &
variable=="O3" &
month %in% c("Jan","Jul"))
mystats <- table %>%
group_by(month, daytype) %>%
do(ComputeStats(.[["value"]]))
head(mystats)
## # A tibble: 4 x 7
## # Groups: month, daytype [4]
## month daytype mean conf.lower conf.upper sd.lower sd.upper
## <ord> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jan Weekday 23.7 22.3 25.0 7.84 39.5
## 2 Jan Weekend 19.9 17.8 21.9 5.49 34.2
## 3 Jul Weekday 72.2 69.5 74.8 40.6 104.
## 4 Jul Weekend 87.5 83.2 91.8 57.3 118.
Summarize variability:
ggplot(table)+
geom_boxplot(aes(daytype, value))+
facet_wrap(~month)+
labs(x=expression(O[3]))
ggplot(mystats)+
facet_wrap(~month)+
geom_bar(aes(x=daytype, y=mean), stat="identity", fill="gray")+
geom_errorbar(aes(x=daytype,
ymin=sd.lower,
ymax=sd.upper),
width=0.1)+
scale_y_continuous(expression(O[3]), expand=expansion(mult=c(0, 0.1)))
Uncertainty in calculated means:
ggplot(mystats)+
facet_wrap(~month)+
geom_bar(aes(x=daytype, y=mean),
stat="identity", fill="gray")+
geom_errorbar(aes(x=daytype,
ymin=conf.lower,
ymax=conf.upper),
width=0.1)+
scale_y_continuous(expression("mean"~O[3]), expand=expansion(mult=c(0, 0.1)))
We can also compute the corresponding \(t\)-statistic.
Just taking the month of July for example,
(out <- t.test(filter(table, month=="Jul" & daytype=="Weekend")[["value"]],
filter(table, month=="Jul" & daytype=="Weekday")[["value"]],
alternative="greater"))
##
## Welch Two Sample t-test
##
## data: filter(table, month == "Jul" & daytype == "Weekend")[["value"]] and filter(table, month == "Jul" & daytype == "Weekday")[["value"]]
## t = 5.9985, df = 346.88, p-value = 2.504e-09
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 11.15074 Inf
## sample estimates:
## mean of x mean of y
## 87.54063 72.16145
its \(p\)-value can be retrieved with this syntax:
out[["p.value"]]
## [1] 2.503557e-09
Since the \(p\)-value is less than \(\alpha=0.05\), we can say that we would reject the null hypothesis (that on average, the weekend concentrations are the same as weekday concentrations) at the \(\alpha=0.05\) significance level.
We generally ascribe a weekend increase in \(\ce{O3}\) to reduction in \(\ce{NO_x}\), either through the diminution of the termination reaction (\(\ce{NO2 + OH -> HNO3}\)) or the \(\ce{NO}\) titration reaction (\(\ce{NO + O3 -> NO2 + O2}\)). We know that Lausanne is a near-road site where the \(\ce{NO}\) titration reaction is active; the weekend increase is not observed in the sum of oxidants (\(\ce{O_x}\)).
mystats2 <- lf %>%
filter(site=="LAU" &
variable %in% c("O3", "NO2") &
month %in% c("Jan","Jul")) %>%
spread(variable, value) %>%
mutate(value = O3 + NO2) %>%
group_by(month, daytype) %>%
do(ComputeStats(.[["value"]]))
ggplot(mystats2)+
facet_wrap(~month)+
geom_bar(aes(x=daytype, y=mean),
stat="identity", fill="gray")+
geom_errorbar(aes(x=daytype,
ymin=conf.lower,
ymax=conf.upper),
width=0.1)+
scale_y_continuous(expression("mean"~O[x]), expand=expansion(mult=c(0, 0.1)))
We can extend the \(p\)-value calculation over other seasons and pollutants:
TTest <- function(value, daytype) {
## if all values are missing,
## the t-test will raise an error
wkend <- value[daytype=="Weekend"]
wkday <- value[daytype=="Weekday"]
if(length(na.omit(wkend)) > 0 & length(na.omit(wkday)) > 0) {
out <- t.test(wkend, wkday, alternative="two.sided")
pval <- out[["p.value"]]
} else {
pval <- NA
}
pval
}
pvals <- lf %>%
filter(!variable %in% c("TEMP", "PREC", "RAD")) %>%
## add Ox
spread(variable, value) %>%
mutate(Ox = O3 + NO2) %>%
gather(variable, value, -all_of(id.vars)) %>%
## calculate p-value
group_by(site, season, variable) %>%
summarize(p.value = TTest(value, daytype))
ggplot(pvals) +
facet_grid(site ~ season) +
geom_point(aes(variable, p.value), shape=3, color=4)+
geom_hline(yintercept = .05, linetype = 2) +
scale_y_log10() +
theme(axis.text.x = element_text(angle=45, hjust=1))+
labs(x="", y="p-value")
We can use the fitdistrplus library available in R:
library(fitdistrplus)
The fitdist
function fits a distribution by maximum likelihood by default (see ?fitdist
for more information). Let is fit a lognormal distribution ("lnorm"
), which also produces a plot:
hourly <- lf %>% filter(site=="LAU" & month=="Jul" & variable=="NO2")
concvec <- c(na.omit(hourly[["value"]]))
fit <- fitdist(concvec, "lnorm")
Print output:
print(fit)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters:
## estimate Std. Error
## meanlog 3.4993578 0.02169593
## sdlog 0.5909902 0.01534114
Print Goodness-of-Fit statistics (see ?gofstat
for more information):
print(gofstat(fit))
## Goodness-of-fit statistics
## 1-mle-lnorm
## Kolmogorov-Smirnov statistic 0.09833373
## Cramer-von Mises statistic 2.21727034
## Anderson-Darling statistic 13.68921148
##
## Goodness-of-fit criteria
## 1-mle-lnorm
## Akaike's Information Criterion 6522.233
## Bayesian Information Criterion 6531.452
Extracting \(p\)-value of chi-square test:
(pval.h <- gofstat(fit)[["chisqpvalue"]])
## [1] 9.482199e-24
Plot:
plot(fit)
The P-P plot compares the theoretical and empirical cumulative distribution functions (CDFs).
If the proposed distribution is a “good fit” and the parameters have been estimated correctly, the points generally fall on the 1:1 line. See this page for illustrations of various departures from this condition.
The Q-Q (quantile-quantile) plot compares quantiles (values of \(x\)) from the theoretical and empirical CDFs.
If the proposed distribution is a “good fit” and the parameters have been estimated correctly, the points generally fall on the 1:1 line. See this page for illustrations of various departures from this condition.
The Q-Q plot is generally more popular than the P-P plot, though they can indicate complementary information. For instance, the Q-Q plot is more sensitive to departure from the distribution at the extremes, while the P-P plot highlights departures that occur near the center of the distribution (note that the P-P plot limits are bounded by 0 and 1, while they are not bounded for the Q-Q plot).
If you have negative concentrations
This may prevent you from testing whether your data is lognormally distributed or not (i.e., the function call to fitdist()
fails). The following function will replace values less than or equal to zero with the minimum positive value otherwise observed for each data set:
ReplaceNonpositive <- function(x) {
x <- na.omit(x)
min.positive.value <- min(x[x>0])
replace(x, x < min.positive.value, min.positive.value)
}
Example usage:
fit <- fitdist(ReplaceNonpositive(concvec), "lnorm")
Here we have introduced an approximation to the observed measurements. Alternatively, you may consider replacing values \(< 0\) with 1/2 the minimum detection limit if this information is available. While there exist some studies on the effect of such choices on a variety of estimates or statistical tests, for this specific instance the effect is not known as the sensitivity to various approximations are not explicitly explored. In any case, you should adopt the practice of reporting such “pretreatment” of the data in your analysis.
Devore, Jay L. 2012. Probability and Statistics for Engineering and the Sciences. 8th ed. Cengage Learning.
Hadley, Adam, and Ralf Toumi. 2003. “Assessing Changes to the Probability Distribution of Sulphur Dioxide in the {Uk} Using a Lognormal Model.” Atmospheric Environment 37 (11): 1461–74. https://doi.org/10.1016/S1352-2310(02)01003-8.
Hinds, W. C. 1999. Aerosol Technology: Properties, Behavior, and Measurement of Airborne Particles. Wiley-Interscience. Wiley.
Hoel, P. G. 1984. Introduction to Mathematical Statistics. Wiley Publication in Mathematical Statistics. John Wiley & Sons Canada, Limited.
Kao, Alan S., and Sheldon K. Friedlander. 1995. “Frequency Distributions of Pm10 Chemical Components and Their Sources.” Environmental Science & Technology 29 (1): 19–28. https://doi.org/10.1021/es00001a003.
Morrison, R. D., and B. L. Murphy. 2010. Environmental Forensics: Contaminant Specific Guide. Elsevier Science.
Ott, Wayne R. 1990. “A Physical Explanation of the Lognormality of Pollutant Concentrations.” Journal of the Air & Waste Management Association 40 (10): 1378–83. https://doi.org/10.1080/10473289.1990.10466789.
———. 1994. Environmental Statistics and Data Analysis. Taylor & Francis.
Ott, Wayne R., and David T. Mage. 1981. “Measuring Air Quality Levels Inexpensively at Multiple Locations by Random Sampling.” Journal of the Air Pollution Control Association 31 (4): 365–69. https://doi.org/10.1080/00022470.1981.10465230.
Student. 1908. “The Probable Error of a Mean.” Biometrika 6 (1): 1–25.