In this section, you’ll be completing levels 1-6 on Try R, an online tutorial for learning R.
Complete levels 1-6. After you complete each level, go back to RStudio and solve the corresponding exercises given in Part 4 below to test your understanding. Save your work in an R script.
For your reference, I have posted a complete transcript of the commands introduced in Try R on the course webpage.
Try R gives a good foundation for starting out with R, but there are some important commands it doesn’t mention. Once you’ve finished Try R, work through this material. Corresponding exercises are given in Part 4 below. You may also want to read this part before attempting the exercises for Try R Section 6.
Just like in writing, there are certain style rules that help improve readability such as using punctuation, spaces, and capitalization (looking at you, E E Cummings!). While coding style differs greatly across languages and across groups, the key is to remain consistent so that your code is easily readable. If you have not developed a coding style yet, I would recommend following Google’s R Style Guide (https://google.github.io/styleguide/Rguide.xml) and adapt it as necessary. Remember: BE CONSISTENT
You can use nearly anything as a variable name in R. The only rules are:
*
+
and so onIn practical terms, this means you should only use letters, numbers, the underscore _
and periods .
in your variable names. It’s also good practice not to give variables the same names as existing R commands. For example mean
would not be a good choice for a variable name but sampleMean
is fine.
[A note for those of you who have programming experience: while R supports object-oriented programming, periods .
do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.]
Google will be your best friend in learning how to code. If you know a command, but don’t remember how it works, you can either use help(command)
or ?command
to pull up the documentation for the specified command.
Try R introduced the concept of data frames to store data. While the “data.frame” package can accomplish everything we are going to do in this course, the “data.table” package is optimized for larger data sets (see https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping for some benchmarks comparing “data.table” to “data.frame”). Below is a short intro to the “data.table” syntax so that you will be equipped to handle larger datasets in the future (for a more in-depth tutorial, check out https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html).
We’ll be using the salary data for Chicago city employees to introduce us to the “data.table” syntax.
data.table
install.packages("data.table")
Instead of using the read.csv
or read.table
commands, use the fread
(it stands for “fast read”) command to load any structured data set (doesn’t matter if it is a .csv, tab-delimited, or other commonly delimited files)
library(data.table)
salaryData <- fread("https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv?accessType=DOWNLOAD")
We can rename the columns as we desire (also, shorter names are easier to type).
names(salaryData) <- c("Name", "Position", "Dept", "Salary")
head(salaryData)
## Name Position Dept
## 1: AARON, ELVIA J WATER RATE TAKER WATER MGMNT
## 2: AARON, JEFFERY M POLICE OFFICER POLICE
## 3: AARON, KARINA POLICE OFFICER POLICE
## 4: AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES
## 5: ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT
## 6: ABARCA, ANABEL ASST TO THE ALDERMAN CITY COUNCIL
## Salary
## 1: $90744.00
## 2: $84450.00
## 3: $84450.00
## 4: $89880.00
## 5: $106836.00
## 6: $70764.00
Unfortunately, since the salary data has “$”, R thinks it is a character string instead of numbers. To get rid of that, we need to strip out the dollar signs and convert it to numeric data (slightly advanced, so it’s okay if you don’t completely follow what’s happening here).
salaryData$Salary <- gsub('\\$', '', salaryData$Salary)
salaryData$Salary <- as.numeric(salaryData$Salary)
Let’s say that we’re only interested in looking at the police department. I’ll make a subset of only the data related to the police department. To get the corresponding subset, we need to figure out which column is the department (it’s “Dept”, since I changed it). The below command only returns the entries for which the condition is true (“==” is a logical operator).
copSalaries <- salaryData[Dept == "POLICE"]
head(copSalaries)
## Name Position Dept Salary
## 1: AARON, JEFFERY M POLICE OFFICER POLICE 84450
## 2: AARON, KARINA POLICE OFFICER POLICE 84450
## 3: ABBATE, TERRY M POLICE OFFICER POLICE 90618
## 4: ABBOTT, LYNISE M CLERK III POLICE 46896
## 5: ABDALLAH, ZAID POLICE OFFICER POLICE 74028
## 6: ABDELHADI, ABDALMAHD POLICE OFFICER POLICE 81588
Let’s sort the data by salary since that’s usually the first thing anyone would do. We have to insert the “-” to sort from highest to lowest, otherwise it defaults to sorting from lowest to highest.
copSalaries <- copSalaries[order(-Salary)]
The top 5 highest paid people in the CPD are as follows:
head(copSalaries, n = 5)
## Name Position Dept Salary
## 1: ESCALANTE, JOHN J FIRST DEPUTY SUPERINTENDENT POLICE 197724
## 2: JOHNSON, EDDIE T CHIEF POLICE 185364
## 3: RICCIO, ANTHONY J CHIEF POLICE 185364
## 4: ROY, EUGENE J CHIEF POLICE 185364
## 5: WELCH III, EDDIE L CHIEF POLICE 185364
No surprise there, the police chiefs and superintendent are the highest paid in the CPD.
How many people make over $100K? To do this easily, we tell R to sum the number of salary observations that are greater than $100K.
over100K <- copSalaries[, sum(Salary > 100000)]
print(over100K)
## [1] 1719
Wow! Over 1700 people in the CPD make over $100K!
While you will generally be loading data into R for analysis instead of creating it yourself, there are times where you will want to create small datasets to test your code before you run it on a 1GB (or larger) data file.
To make a data table, it’s easiest to create the columns that will populate your data table first and then combine them together.
person <- c("Linus", "Snoopy", "Lucy", "Woodstock")
age <- c(5, 8, 6, 2)
weight <- c(40, 25, 50, 1)
myData <- data.table(person, age, weight)
myData
## person age weight
## 1: Linus 5 40
## 2: Snoopy 8 25
## 3: Lucy 6 50
## 4: Woodstock 2 1
Calculate how many minutes there are in a week.
Add up the numbers 1 8 4 2 9 4 8 5 without using any plus signs
You’ve forgotten what the function rep
does. Load the corresponding help file.
x <- 5
y <- 7
z <- x + y
z + 3 == 15
How can I get R to print out “Go Penn!” thirty times without repeatedly typing this by hand?
Create a vector called x
containing the sequence -1, -0.9, … 0, 0.1, …, 0.9, 1 and then display the result
Create two vectors: wizards
and ranking
. The vector wizards
should contain the following names: Harry, Ron, Fred, George, Sirius. The vector ranking
should contain the following numbers: 4, 2, 5, 1, 3 in it.
Extract the second element of the vector wizards
.
Replace the names Fred, George and Sirius in the vector wizards with Hermione, Ginny, and Malfoy, respectively.
Someone who hasn’t read Harry Potter needs labels to determine who these characters are. Assign names to the elements of the vector wizards
: Lead, Friend, Friend, Wife, Rival. Display the result.
An avid reader of Harry Potter argues that Malfoy is not Harry’s rival by the end of the series. Change Rival to Ex-Rival.
Make a barplot of the vector ranking
. Why can’t you make a barplot of the vector wizards
?
Assign the elements of the vector wizards
to be the names of the vector ranking
. Then re-do the barplot so we can see who’s who.
In 2009 Steve’s income was $50,000 and his total expenses were $35,000. In 2010 his income was $52,000 and his expenses were $34,000. In 2011, his income was $52,500 and his expenses were $38,000. Finally, in 2012 Steve’s earnings were $48,000 and his expenses were $40,000. Create three vectors to store this information in parallel: years
, income
and expenses
.
Following on from the previous question, calculate Steve’s annual savings and store this in a vector called savings
. Make a scatterplot of years
against savings
.
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:graphics':
##
## layout
Assuming zero interest on bank deposits (roughly accurate at the moment), calculate the total amount that Steve has saved over all the years for which we have data.
Redefine the vector years
so that it runs from 2009-2013. Redefine income
to match but record income
for 2013 as NA
(this is R’s code for missing data). How can we compute the sum of the elements of this new income
vector (ignoring the NA
)?
Twenty-six students took the midterm. Here are their scores: 18, 95, 76, 90, 84, 83, 80, 79, 63, 76, 55, 78, 90, 81, 88, 89, 92, 73, 83, 72, 85, 66, 77, 82, 99, 87. Assign these values to a vector called scores
.
Calculate the mean, median and standard deviation of the scores.
Create three vectors. First store the numeric values 21, 26, 51, 22, 160, 160, 160 in a vector called age
. Next, store the names Achilles, Hector, Priam, Paris, Apollo, Athena, Aphrodite in a character vector called person
. Finally store the words Aggressive, Loyal, Regal, Cowardly, Proud, Wise, Conniving in a vector called description
Create a data table called trojanWar
whose columns contain the vectors from the previous question.
Suppose you wanted to display only the column of trojanWar
that contains each person’s description
. What command would you use?
What command would you use to show information for Achilles and Hector only?
What command would you use to display the person
and description
columns for Apollo, Athena and Aphrodite only?
It turns out there was a mistake in the data: Priam’s age should be 72 rather than 51. Make the appropriate change to trojanWar