R Tutorial #1 – Econ 103

Part 1: Try R Tutorial
Part 2: More Commands
Part 3: Exercises

Part 1: Try R Tutorial

In this section, you’ll be completing levels 1-6 on Try R, an online tutorial for learning R.

Go to http://www.codeschool.com/courses/try-r
Complete levels 1-6. After you complete each level, go back to RStudio and solve the corresponding exercises given in Part 4 below to test your understanding. Save your work in an R script.

For your reference, I have posted a complete transcript of the commands introduced in Try R on the course webpage.

Part 2: More Commands

Try R gives a good foundation for starting out with R, but there are some important commands it doesn’t mention. Once you’ve finished Try R, work through this material. Corresponding exercises are given in Part 4 below. You may also want to read this part before attempting the exercises for Try R Section 6.

Coding Style

Just like in writing, there are certain style rules that help improve readability such as using punctuation, spaces, and capitalization (looking at you, E E Cummings!). While coding style differs greatly across languages and across groups, the key is to remain consistent so that your code is easily readable. If you have not developed a coding style yet, I would recommend following Google’s R Style Guide (https://google.github.io/styleguide/Rguide.xml) and adapt it as necessary. Remember: BE CONSISTENT

Variable Names

You can use nearly anything as a variable name in R. The only rules are:

Must not start with a number
Must not include any characters that have a “special meaning” such as * + and so on

In practical terms, this means you should only use letters, numbers, the underscore _ and periods . in your variable names. It’s also good practice not to give variables the same names as existing R commands. For example mean would not be a good choice for a variable name but sampleMean is fine.

[A note for those of you who have programming experience: while R supports object-oriented programming, periods . do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.]

Documentation

Google will be your best friend in learning how to code. If you know a command, but don’t remember how it works, you can either use help(command) or ?command to pull up the documentation for the specified command.

Data Frames and Tables

Try R introduced the concept of data frames to store data. While the “data.frame” package can accomplish everything we are going to do in this course, the “data.table” package is optimized for larger data sets (see https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping for some benchmarks comparing “data.table” to “data.frame”). Below is a short intro to the “data.table” syntax so that you will be equipped to handle larger datasets in the future (for a more in-depth tutorial, check out https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html).

We’ll be using the salary data for Chicago city employees to introduce us to the “data.table” syntax.

Install `data.table`

install.packages("data.table")

Reading Data

Instead of using the read.csv or read.table commands, use the fread (it stands for “fast read”) command to load any structured data set (doesn’t matter if it is a .csv, tab-delimited, or other commonly delimited files)

library(data.table)
salaryData <- fread("https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv?accessType=DOWNLOAD")

We can rename the columns as we desire (also, shorter names are easier to type).

names(salaryData) <- c("Name", "Position", "Dept", "Salary")
head(salaryData)

##                   Name                 Position             Dept
## 1:     AARON,  ELVIA J         WATER RATE TAKER      WATER MGMNT
## 2:   AARON,  JEFFERY M           POLICE OFFICER           POLICE
## 3:      AARON,  KARINA           POLICE OFFICER           POLICE
## 4: AARON,  KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES
## 5: ABAD JR,  VICENTE M        CIVIL ENGINEER IV      WATER MGMNT
## 6:     ABARCA,  ANABEL     ASST TO THE ALDERMAN     CITY COUNCIL
##        Salary
## 1:  $90744.00
## 2:  $84450.00
## 3:  $84450.00
## 4:  $89880.00
## 5: $106836.00
## 6:  $70764.00

Unfortunately, since the salary data has “$”, R thinks it is a character string instead of numbers. To get rid of that, we need to strip out the dollar signs and convert it to numeric data (slightly advanced, so it’s okay if you don’t completely follow what’s happening here).

salaryData$Salary <- gsub('\\$', '', salaryData$Salary)
salaryData$Salary <- as.numeric(salaryData$Salary)

Subsetting Data

Let’s say that we’re only interested in looking at the police department. I’ll make a subset of only the data related to the police department. To get the corresponding subset, we need to figure out which column is the department (it’s “Dept”, since I changed it). The below command only returns the entries for which the condition is true (“==” is a logical operator).

copSalaries <- salaryData[Dept == "POLICE"]
head(copSalaries)

##                     Name       Position   Dept Salary
## 1:     AARON,  JEFFERY M POLICE OFFICER POLICE  84450
## 2:        AARON,  KARINA POLICE OFFICER POLICE  84450
## 3:      ABBATE,  TERRY M POLICE OFFICER POLICE  90618
## 4:     ABBOTT,  LYNISE M      CLERK III POLICE  46896
## 5:       ABDALLAH,  ZAID POLICE OFFICER POLICE  74028
## 6: ABDELHADI,  ABDALMAHD POLICE OFFICER POLICE  81588

Exploring Data

Let’s sort the data by salary since that’s usually the first thing anyone would do. We have to insert the “-” to sort from highest to lowest, otherwise it defaults to sorting from lowest to highest.

copSalaries <- copSalaries[order(-Salary)]

The top 5 highest paid people in the CPD are as follows:

head(copSalaries, n = 5)

##                   Name                    Position   Dept Salary
## 1:  ESCALANTE,  JOHN J FIRST DEPUTY SUPERINTENDENT POLICE 197724
## 2:   JOHNSON,  EDDIE T                       CHIEF POLICE 185364
## 3:  RICCIO,  ANTHONY J                       CHIEF POLICE 185364
## 4:      ROY,  EUGENE J                       CHIEF POLICE 185364
## 5: WELCH III,  EDDIE L                       CHIEF POLICE 185364

No surprise there, the police chiefs and superintendent are the highest paid in the CPD.

How many people make over $100K? To do this easily, we tell R to sum the number of salary observations that are greater than $100K.

over100K <- copSalaries[, sum(Salary > 100000)]
print(over100K)

## [1] 1719

Wow! Over 1700 people in the CPD make over $100K!

Making Data Tables and Matrices

While you will generally be loading data into R for analysis instead of creating it yourself, there are times where you will want to create small datasets to test your code before you run it on a 1GB (or larger) data file.

To make a data table, it’s easiest to create the columns that will populate your data table first and then combine them together.

person <- c("Linus", "Snoopy", "Lucy", "Woodstock")
age <- c(5, 8, 6, 2)
weight <- c(40, 25, 50, 1)
myData <- data.table(person, age, weight)
myData

##       person age weight
## 1:     Linus   5     40
## 2:    Snoopy   8     25
## 3:      Lucy   6     50
## 4: Woodstock   2      1

Part 3: Exercises

Basics

Calculate how many minutes there are in a week.
Add up the numbers 1 8 4 2 9 4 8 5 without using any plus signs
You’ve forgotten what the function rep does. Load the corresponding help file.
Suppose I ran the following R commands in order. What result would I get? Do not use R to answer this: think it through and then check your answer.

x <- 5
y <- 7
z <- x + y
z + 3 == 15

How can I get R to print out “Go Penn!” thirty times without repeatedly typing this by hand?
Create a vector called x containing the sequence -1, -0.9, … 0, 0.1, …, 0.9, 1 and then display the result

Vectors (Harry Potter Style)

Create two vectors: wizards and ranking. The vector wizards should contain the following names: Harry, Ron, Fred, George, Sirius. The vector ranking should contain the following numbers: 4, 2, 5, 1, 3 in it.
Extract the second element of the vector wizards.
Replace the names Fred, George and Sirius in the vector wizards with Hermione, Ginny, and Malfoy, respectively.
Someone who hasn’t read Harry Potter needs labels to determine who these characters are. Assign names to the elements of the vector wizards: Lead, Friend, Friend, Wife, Rival. Display the result.
An avid reader of Harry Potter argues that Malfoy is not Harry’s rival by the end of the series. Change Rival to Ex-Rival.
Make a barplot of the vector ranking. Why can’t you make a barplot of the vector wizards?
Assign the elements of the vector wizards to be the names of the vector ranking. Then re-do the barplot so we can see who’s who.

More Vectors and Charts (Steve’s Personal Finances)

In 2009 Steve’s income was $50,000 and his total expenses were $35,000. In 2010 his income was $52,000 and his expenses were $34,000. In 2011, his income was $52,500 and his expenses were $38,000. Finally, in 2012 Steve’s earnings were $48,000 and his expenses were $40,000. Create three vectors to store this information in parallel: years, income and expenses.
Following on from the previous question, calculate Steve’s annual savings and store this in a vector called savings. Make a scatterplot of years against savings.

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:graphics':
## 
##     layout

Assuming zero interest on bank deposits (roughly accurate at the moment), calculate the total amount that Steve has saved over all the years for which we have data.
Redefine the vector years so that it runs from 2009-2013. Redefine income to match but record income for 2013 as NA (this is R’s code for missing data). How can we compute the sum of the elements of this new income vector (ignoring the NA)?

Summary Stats

Twenty-six students took the midterm. Here are their scores: 18, 95, 76, 90, 84, 83, 80, 79, 63, 76, 55, 78, 90, 81, 88, 89, 92, 73, 83, 72, 85, 66, 77, 82, 99, 87. Assign these values to a vector called scores.
Calculate the mean, median and standard deviation of the scores.

Data Tables (Trojan War)

Create three vectors. First store the numeric values 21, 26, 51, 22, 160, 160, 160 in a vector called age. Next, store the names Achilles, Hector, Priam, Paris, Apollo, Athena, Aphrodite in a character vector called person. Finally store the words Aggressive, Loyal, Regal, Cowardly, Proud, Wise, Conniving in a vector called description
Create a data table called trojanWar whose columns contain the vectors from the previous question.
Suppose you wanted to display only the column of trojanWar that contains each person’s description. What command would you use?
What command would you use to show information for Achilles and Hector only?
What command would you use to display the person and description columns for Apollo, Athena and Aphrodite only?
It turns out there was a mistake in the data: Priam’s age should be 72 rather than 51. Make the appropriate change to trojanWar