Many data sets have character strings in them. For example, in a file of tweets from Twitter (which are basically just strings of characters), perhaps you want to search for occurrences of a certain word or twitter handle. Or a character variable in a data set might be location with a city and state abbreviation, and you want to extract those observations with location containing “NY.”
In this tutorial, you will learn how to manipulate text data using the package stringr
and how to match patterns using regular expressions. Some of the commands include:
Command | Description |
---|---|
str_sub |
Extract substring from a given start to end position |
str_detect |
Detect presence/absence of first occurrence of substring |
str_locate |
Give position (start, end) of first occurrence of substring |
str_locate_all |
Give positions of all occurrences of a substring |
str_replace |
Replace one substring with another |
We introduce some basic commands from stringr
.
The str_sub
command extracts substrings from a string (that is, a sequence of characters) given the starting and ending position. For instance, to extract the characters in the second through fourth position or each string in fruits:
library(stringr)
fruits <- c("apple", "pineapple", "Pear", "orange", "peach", "banana")
str_sub(fruits, 2, 4)
## [1] "ppl" "ine" "ear" "ran" "eac" "ana"
Question 1 What are the characters in the first through third position of each string in fruits?
The str_detect
command checks to see if any instance of a pattern occurs in a string.
str_detect(fruits, "p") #any occurrence of 'p'?
## [1] TRUE TRUE FALSE FALSE TRUE FALSE
Note that pattern matching is case-sensitive.
To locate the position of a pattern within a string, use str_locate
:
str_locate(fruits, "an")
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] 3 4
## [5,] NA NA
## [6,] 2 3
Only the fourth and sixth fruits contain “an.” In the case of “banana,” note that only the first occurrence of “an” is returned.
To find all instances of “an” within each string:
str_locate_all(fruits,"an")
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
##
## [[4]]
## start end
## [1,] 3 4
##
## [[5]]
## start end
##
## [[6]]
## start end
## [1,] 2 3
## [2,] 4 5
Remark
The command str_locate_all
returns a list.
out <- str_locate_all(fruits, "an")
data.class(out)
## [1] "list"
out[[6]]
## start end
## [1,] 2 3
## [2,] 4 5
unlist(out)
## [1] 3 4 2 4 3 5
length(unlist(out))/2 #total number of times "an" occurs in vector fruits
## [1] 3
Now suppose we want to detect or locate words that begin with “p” or end in “e,” or match a more complex criteria. A regular expression is a sequence of characters that define a pattern.
Let’s detect strings that begin with either “p” or “P”. The metacharacter “^” is used to indicate the beginning of the string, and “[Pp]” is used to indicate “P” or “p”.
str_detect(fruits, "^[Pp]")
## [1] FALSE TRUE TRUE FALSE TRUE FALSE
Similarly, the metacharacter “$” is used to signify the end of a string.
str_detect(fruits, "e$" ) #end in 'e'
## [1] TRUE TRUE FALSE TRUE FALSE FALSE
The following are other metacharacters that have special meanings and so are reserved:
* \ + $ { } [ ] ^ ? .
For instance, a period matches any single character:
gr.y matches gray, grey, gr9y, grEy, etc.
and * indicates 0 or more instances of the preceding character:
xy*z matches xz, xyz, xyyz, xyyyz, xyyyyz, etc.
To detect the letter “a” followed by 0 or more occurrences of “p”, type:
str_detect(fruits, "ap*")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Compare this to
str_detect(fruits, "ap+")
## [1] TRUE TRUE FALSE FALSE FALSE FALSE
The “+” in front of the “p” indicates that we want one or more occurrences of “p.”
Here is a more complex pattern:
str_detect(fruits, "^a(.*)e$")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
The anchors ^
and $
are used to indicate we want strings that begin with the letter a and end with e. The (.*)
indicates that we want to match 0 or more occurrences of any character. In particular, parentheses can be used to group parts of the pattern for readability.
Suppose we want to extract information on 10 digit United States phone numbers from a text file.
a1 <- "Home: 507-645-5489"
a2 <- "Cell: 219.917.9871"
a3 <- "My work phone is 507-202-2332"
a4 <- "I don't have a phone"
info <- c(a1, a2, a3, a4)
info
## [1] "Home: 507-645-5489" "Cell: 219.917.9871"
## [3] "My work phone is 507-202-2332" "I don't have a phone"
We will now extract just the phone numbers from this string.
The area code must start with a 2 or higher so we use brackets again to indicate a range: [2-9]. The next two digits can be between 0 and 9, so we write [0-9]{2}. For the separator, we use [-.] to indicate either a dash or a period. The complete regular expression is given below:
phone <- "([2-9][0-9]{2})[-.]([0-9]{3})[-.]([0-9]{4})"
out <- str_detect(info, phone)
out
## [1] TRUE TRUE TRUE FALSE
Again, str_detect
just indicates the presence or absence of the pattern in question.
str_extract(info, phone)
## [1] "507-645-5489" "219.917.9871" "507-202-2332" NA
Let’s anonymize the phone-numbers!
str_replace(info, phone, "XXX-XXX-XXXX")
## [1] "Home: XXX-XXX-XXXX" "Cell: XXX-XXX-XXXX"
## [3] "My work phone is XXX-XXX-XXXX" "I don't have a phone"
Remarks
str_locate(info, "[.]") #find first instance of period
## start end
## [1,] NA NA
## [2,] 10 10
## [3,] NA NA
## [4,] NA NA
str_locate(info, "\\.") #same
## start end
## [1,] NA NA
## [2,] 10 10
## [3,] NA NA
## [4,] NA NA
str_locate(info, ".") #first instance of any character
## start end
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
str_detect(fruits, "^[Pp]") #starts with 'P' or 'p'
## [1] FALSE TRUE TRUE FALSE TRUE FALSE
str_detect(fruits, "[^Pp]") #any character except 'P' or 'p'
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "^[^Pp]") #start with any character except 'P' or 'p'
## [1] TRUE FALSE FALSE TRUE FALSE TRUE
veggies
containing “carrot”, “bean”, “peas”, “cabbage”, “scallion”, “asparagus.Find those strings that contain the pattern “ea”.
Find those strings that end in “s”.
Find those strings that contain at least two “a”’s.
Find those strings that begin with any letter except “c”.
Find the starting and ending position of the pattern “ca” in each string.
"^[Ss](.*)(t+)(.+)(t+)"
matches “scuttlebutt”, “Stetson”, and “Scattter”, but not “Scatter.” Why?