Computing the coefficient of determination for the 1:1 line (intercept = 0 and slope = 1).
set.seed(2018)
y <- rnorm(n = 100, mean = 0, sd = 1)
x <- rnorm(n = 100, mean = 0, sd = 1)
plot(y ~ x)
# Add least squares regression line
abline(lm(y ~ x), col = "blue")
The simple linear regression is of the form y = a + b*x + e
. For a 1:1 line the intercept a = 0 and the slope b = 1, so that y = x + e
, where e
are the errors (residuals/deviations from the 1:1 line). In R, a 1:1 line can be simply plotted with abline(0,1)
.
plot(x, y, xlim = c(-2, 3), ylim = c(-2, 3))
# add grid
abline(h = seq(-2, 3, 1),
v = seq(-2, 3, 1),
lty = "dashed",
col = "gray70")
abline(lm(y ~ x), col = "blue") # least squares regression line
abline(0, 1, col = "red", lwd = 2) # 1:1 line
Note that, for the 1:1 line, the errors (residuals) are simply the differences: e = y - x
plot(x, y, xlim = c(-2, 3), ylim = c(-2, 3))
abline(0, 1, col = "red") # 1:1 line
segments(x0 = x,
y0 = y,
x1 = x,
y1 = x, # y1 = y - e = y - y + x = x
col = "red",
lty = "dashed")
The coefficient of determination “is the proportion of the variance in the dependent variable that is predictable from the independent variable”.
Checking the Wikipedia page, we can see that “the most general definition of the coefficient of determination” is given in relation to the unexplained variance - the fraction of variance unexplained (FVU):
R2 = 1 - FVU = 1 - SSres ⁄ SStot
where FVU is the sum of squares of residuals (SSres) divided by the total sum of squares (SStot)
SSres = ∑(yi - ŷi)2 = ∑ei2
SStot = ∑(yi - ȳi)2
As already pointed above, for the 1:1 line, the errors (residuals) are the differences: e = y - x
. Therefore, SSres can be written as:
SSres = ∑ei2 = ∑(yi - xi)2
The R implementation for the 1:1 line is:
## [1] -1.032755
Here we get a value outside of the usual range 0 to 1 because the 1:1 line fits the data worse than just the ȳ horizontal line, that is, the line of intercept = ȳ and slope = 0, for which R2 = 0 because ȳi = ŷi, which makes SSres = SStot.
We can put the above in a simple R function:
r2_1to1 <- function(xx, yy) {
SS_res <- sum((yy - xx) ^ 2)
SS_tot <- sum((yy - mean(yy)) ^ 2)
r2 <- 1 - SS_res / SS_tot
return(r2)
}
r2_1to1(x, y)
## [1] -1.032755
## [1] -1.014109
Note that, if we inverse x and y we get a different coefficient of determination (R squared) for the 1:1 line. However, if fitting the usual simple linear regression (no constraints on intercept or slope), then we get the same coefficient of determination when we switch x with y.
# R squared doesn't change in case of least square line when switching x with y
summary(lm(y ~ x))$r.squared
## [1] 4.091335e-05
## [1] 4.091335e-05
## [1] TRUE
lm(y ~ 0 + offset(x))
Another way of getting to the R squared is to fit a linear model with a fixed slope as explained here:
lm_1to1 <- lm(y ~ 0 + offset(1*x))
# 1* - indicates that the slope is constrained to 1; can simply be offset(x)
# 0 - indicates that there is no intercept (-1 has the same effect)
# Note, if you need a certain value for the intercept, check https://stackoverflow.com/a/7333292/5193830
summary(lm_1to1) # is ok to see "No Coefficients" since they are constrained to 0 and 1
##
## Call:
## lm(formula = y ~ 0 + offset(1 * x))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6403 -1.0285 -0.0701 0.8686 3.5844
##
## No Coefficients
##
## Residual standard error: 1.438 on 100 degrees of freedom
## [1] 0
# Another function to compute R squared
r2 <- function(yy, model) {
SS_res <- sum(model$residuals ^ 2) # residuals are taken from the model
SS_tot <- sum((yy - mean(yy)) ^ 2)
r2 <- 1 - SS_res / SS_tot
return(r2)
}
# This tests that function r2() is working properly
all.equal(summary(lm(y ~ x))$r.squared,
r2(y, lm(y ~ x)))
## [1] TRUE
# This is R squared for the 1:1 fit.
# Result identical with what we did previously using r2_1to1() function.
r2(y, lm_1to1)
## [1] -1.032755
## [1] TRUE