OUHSC Statistical Computing User Group
Will Beasley, Dept of Pediatrics,
Biomedical and Behavioral Methodology Core (BBMC)
(Based of the presentation from May 3, 2016)
~
and in MySQL with REGEXP
. And in Oracle with REGEXP_SUBSTR
and REGEXP_LIKE
, and even REGEXP_REPLACE
.LIKE
SQL operator might do everything you need anyway, if you have only a simple comparison.A 'regex' is typically a carefully crafted string that describes a pattern of text. It can:
It's like the big brother of wildcards you match filenames with
(eg, "*.R"
).
Pattern | Matches |
---|---|
mike |
“mike”, “smike”, “miked”, etc. |
mike4 |
“mike4”, “smike4”, etc. |
mike\d |
“mike” followed by any single digit (eg “mike8”, “smike8”) |
mike\d+ |
“mike” followed by one or more digits (eg “mike1234”, “smike8”) |
^mike$ |
only “mike” |
\b19(?=(1|2))(\d{2})\b
and 20\2
converts years in the 1910s and 1920s to the 2010s and 2020s
(but leaves later years as they are).
Today's as language agnostic as possible. SAS, R, and Python examples in Part 2.
Later, consider RegexBuddy for $40.
"\d\w"
) instead of"[:digit:][:alnum:]"
)Start with
`import re`
search()
–the first match.
m = re.search('(?<=a)d', 'ad')
m.group(0)
findall()
–all non-overlapping matches.
match()
–careful, needs to be at the start.
sub()
1916-1918 subscales for a subject
1998-1914 subscales for a subject
subscales for a subject 1998-1920
grep()
and grep(..., value=T)
grepl()
sub()
and gsub()
regexpr()
, gregexpr()
, regexec()
rematch2 is fairly new and become my preference in some scenarios.
My advice:
grep()
, grepl()
, sub()
or gsub()
.regexpr()
, gregexpr()
, & regexec()
)cols <- c("id", "dx1", "dx2", "dx3", "dx4dx")
pattern <- "^(dx)(\\d)$"
grep( pattern, cols)
[1] 2 3 4
grep( pattern, cols, value=T) # equivalent: cols[grep(pattern, cols)]
[1] "dx1" "dx2" "dx3"
grepl(pattern, cols)
[1] FALSE TRUE TRUE TRUE FALSE
sub("dx", "pz", cols)
[1] "id" "pz1" "pz2" "pz3" "pz4dx"
gsub("dx", "pz", cols)
[1] "id" "pz1" "pz2" "pz3" "pz4pz"
rematch2::re_match(cols, pattern)
# A tibble: 5 x 4
`` `` .text .match
<chr> <chr> <chr> <chr>
1 <NA> <NA> id <NA>
2 dx 1 dx1 dx1
3 dx 2 dx2 dx2
4 dx 3 dx3 dx3
5 <NA> <NA> dx4dx <NA>
Named capture group produces column names (e.g., ?<index>
).
pattern_named <- "^(?<type>dx)(?<index>\\d)$"
rematch2::re_match(cols, pattern_named)
# A tibble: 5 x 4
type index .text .match
<chr> <chr> <chr> <chr>
1 <NA> <NA> id <NA>
2 dx 1 dx1 dx1
3 dx 2 dx2 dx2
4 dx 3 dx3 dx3
5 <NA> <NA> dx4dx <NA>
rematch2::re_match(cols, pattern_named)$index
[1] NA "1" "2" "3" NA
library(magrittr)
Named capture group produces column names (e.g., ?<index>
).
pattern_2 <- "^(?<gender>m|f)(?<age>\\d)$"
tibble::tibble(
raw = c("m3", "f4", "f5", "m2")
) %>%
rematch2::bind_re_match(raw, pattern_2)
raw gender age
1 m3 m 3
2 f4 f 4
3 f5 f 5
4 m2 m 2
s <- c("1916-1918 subscales for a subject", "1898-2003 subscales for a subject", "subscales for a subject 1998-1920")
g <- sub("19(1|2)(\\d)", "20\\1\\2", s)
cat(g, sep="\n")
2016-1918 subscales for a subject
1898-2003 subscales for a subject
subscales for a subject 1998-2020
grep("19(1|2)(\\d)", s)
[1] 1 3
grep("19(1|2)(\\d)", s, value=T)
[1] "1916-1918 subscales for a subject" "subscales for a subject 1998-1920"
grepl("19(1|2)(\\d)", s)
[1] TRUE FALSE TRUE
Use the simplest function for the job:
sub()
instead of gsub()
re_match()
instead of re_match_all()
(in rematch2 package)Detailed advice for asking regex questions to people outside of your specific project:
Here's an example of one of my regex questions: https://stackoverflow.com/questions/55270833/regex-to-qualify-n
1234
23
14a
1a3
234
1.39
In the right panel of bottom right panel of regex101.com. Especially these first:
. versus \.
\w and \d and \s (versus \W and \D and \S)
^ and \A
$ and \Z
? and * and + and things like {3,6}
Capturing
Character classes
"CL_ID" = "ClientID"
, "RMSEQ" = "RemovedSequence"
, "RMVL_BEGDT" = "RemovalBeginDate"
, "RMVL_ENDDT" = "RemovalEndDate"
, "END_TYP_CDE" = "EndTypeID"
, "REMOVED_FROM" = "RemovedFromTypeID"
, "CURR_RMVL_TYP" = "RemovalTypeCurrentID"
, "ORIG_RMVL_TYP" = "RemovalTypeOriginalID"
, "FMLY_STRUCTURE" = "FamilyStructureTypeID"
requireNamespace("dplyr", quietly=TRUE) #hadley/dplyr
requireNamespace("lubridate")
requireNamespace("OuhscMunge", quietly=T) #OuhscBbmc/OuhscMunge
9
4
34
3
62
43
1
Time,Gender,Genetype,Treatment,MouseID,OR-Recognition Index,FC-t-F %,FC-b-F %,FC-a-F %
4M,Male,WILD,Control,c9-1,0.32,11.9,0,25.7
4M,Male,WILD,Control,c13-2,0.47,23.7,0,11.
4M,Male,WILD,Prozac,c10-2,0.62,40.7,11.4,51.4
4M,Male,WILD,Prozac,c14-3,0.63,10.2,0,28.6
4M,Male,YFP,Control,c9-2,0.42,42.4,11.4,22.9
4M,Male,YFP,Control,c13-1,0.5,15.3,0,54.1
4M,Male,YFP,Control,c13-nm,1,27.1,0,31.4
4M,Male,YFP,Prozac,c10-1,0.65,20.3,17.1,54.3
(In some cases, you'd have to parse only the cell, not the entire line. But this is good practice.)
Example 3
(,*\s*)"(\w+)"\s+=\s+"(\w+)" and
$1"$3" = "$2"
Example 4
library\((\w+),\s*quietly=(T|TRUE)\) and
library($1)
Example 5
\b(\d)\b and
0$1
Example 6
,c(\d{1,2})-(\d|nm),