R: To use R, navigate your browser to cran.r-project.org.1 Download. You’re ready to use.
RStudio: Most R users interact with R through an (amazing) IDE called “RStudio”. Navigate to https://www.rstudio.com/products/rstudio/ and download the desktop IDE. Now you’re really ready.
Relative to Stata, R introduces a few new dimensions:
Let’s review these in differences in more depth.
Free to use. Free to update and upgrade. Free to dissimenate. Free for your students to install on their own laptops.
You know the old joke about an economist being told there is a $100 bill lying on the sidewalk? (“Impossible! Someone would have picked it up already.”) Now think about the crazy license fees for proprietary econometrics and modelling software. You can see where this is going…
You might have heard or read something along the lines of: “In R, everything has a name and everything is an object”. This probably sounds very abstract if you’re coming from a language like Stata. However, the key practical implications of this so-called object-oriented (OO) approach are as follows:
a <- 3
(i.e. the object a
has been assigned as a scalar — or single-length vector — equal to 3)b <- matrix(1:4, nrow = 2)
(i.e. the object b
has been assigned as a 2x2 matrix)<-
assignment operator is read aloud as “gets”. You can also use a regular old equal sign if you prefer, e.g. a = 3
.matrix
is a bit different from data.frame
or a vector
. More.All of this might sound simple – and it is! – but one aspect of the OO approach that can trip up new R users (especially those coming from Stata) is that you have to be specific about which object you are referring to.
$
index operator: dataframe1$wage
.\usepackage{foo}
), R also draws upon non-default packages (i.e., library(foo)
).base
installation, which includes the most commonly used packages and functions across all use cases (core probability and statistical operations, linear regression functions, etc.).Install a package: install.packages("package.name")
update.packages()
command to update all of your installed packages at once (see below).Load a package: library(package.name)
Update packages: update.packages(ask=FALSE)
install.packages("package.name")
)If you don’t feel like typing in these commands manually, one of the many advantages of the RStudio IDE is that makes installing and updating packages very easy (autocompletion, package search, etc.). Just click on the “Packages” tab of bottom-right panel:
R is friendly and tries to help if you weren’t specific enough. Consider the following hypothetical OLS regression, where lm()
is just the workhorse function for linear models in R:
lm(wage ~ education + gender, data = dataframe1)
Here, we could use a string variable like gender
(which takes values like "female"
and "male"
) directly in our regression call. R knows what you mean: you want indicator variables for the levels of the variable.4
Mostly, this is a good thing, but sometimes R’s desire to help can hide programming mistakes and idiosyncrasies. So it’s best to be aware, e.g.:
## [1] 2
R
easily (and infinitely) parallelizesParallelization in R is easily done thanks to various packages like parallel
, pbapply
, future
, and foreach
.
Let’s illustrate by way of a simulation. First we’ll create some data (our_data
) and a function (our_reg
), which draws a sample of 10,000 observations and runs a regression.
# Set our seed
set.seed(12345)
# Set sample size
n <- 1e6
# Generate 'x' and 'e'
our_data <- data.frame(x = rnorm(n), e = rnorm(n))
# Calculate 'y'
our_data$y <- 3 + 2 * our_data$x + our_data$e
# Function that draws a sample of 10,000 observations and runs a regression
our_reg <- function(i) {
# Sample the data
sample_data <- our_data[sample.int(n = n, size = 1e4, replace = T),]
# Run the regression
lm(y ~ x, data = sample_data)$coef[2]
}
With our data and function created, let’s run the simulation without parallelization:
library(tictoc) ## For convenient timing
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim1 <- lapply(X = 1:1e4, FUN = our_reg)
toc()
## 73.576 sec elapsed
Now run the simulation with parallelization (12 cores):
library(pbapply) ## Adds progress bar and parallel options
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim2 <- pblapply(X = 1:1e4, FUN = our_reg, cl = 12)
toc()
## 18.125 sec elapsed
Not only was this about four times faster5, but notice how little the syntax changed to run the parallel version. To highlight the differences in bold: pblapply(X = 1:1e4, FUN = our_reg, cl = 12)
.
Here’s another parallel option just to drive home the point. (In R, there are almost always multiple ways to get a particular job done.)
library(future.apply) ## Another option.
plan(multiprocess)
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim3 <- future_lapply(X = 1:1e4, FUN = our_reg)
toc()
## 17.942 sec elapsed
Further, many packages in R default (or have options) to work in parallel. E.g., the regression package lfe
uses the available processing power to estimate fixed-effect models.
Again, all of this extra parallelization functionality comes for free. In contrast, have you looked up the cost of a Stata/MP license recently? (Nevermind that you effectively pay per core!)
Note: This parallelization often means that you move away from for
loops and toward parallelized replacements (e.g., lapply
has many parallelized implementations).6
Because R began its life as a statistical language/environment, it plays very nicely with matrices.
Create a matrix:
## The "c()" stands for "concatenate" and is used to bind together a sequence of numbers or strings.
matrix(data = c(3, 2, 3, 5, 9, 4, 3, 2, 7), ncol = 3)
## [,1] [,2] [,3]
## [1,] 3 5 3
## [2,] 2 9 2
## [3,] 3 4 7
Assign (store) a matrix:
Invert a matrix:
## [,1] [,2] [,3]
## [1,] 0.8088235 -0.33823529 -0.25
## [2,] -0.1176471 0.17647059 0.00
## [3,] -0.2794118 0.04411765 0.25
Notebooks, websites, presentations, etc. can all easily include:
code chunks,
## [1] 32
evaluated code,
## [1] TRUE
normal or mathematical text,
\[\left(\text{e.g., }\dfrac{x^2}{3}\right)\]
and even interactive content like leaflet
maps.
library(leaflet)
leaflet() %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=-123.075, lat=44.045, popup="The University of Oregon")
Yes, Stata 15 has some Markdown support, but the difference in functionality is pretty stark.
Now that you (hopefully) have a better sense of R, let’s head over to the regression intro section to try some hands-on examples.
CRAN stands for Comprehensive R Archive Network. It is the central repository for downloading R itself and (vetted) packages.↩
If you want to get really meta: the pacman
package helps you… manage packages. More.↩
R uses both single ('word'
) and double quotes ("word"
) to reference characters (strings).↩
Variables in R that have different qualitative levels are known as “factors” Behind the scenes, R is converting gender
from a string to a factor for you, although you can also do this explicitly yourself. More.↩
It’s not a full 12 times faster because of the overhead needed to run this code in parallel (among other things). Since this overhead is largely a sunk cost, the relative speed-up will improve as we increase the number of iterations.↩