OUHSC Statistical Computing User Group
Will Beasley, Dept of Pediatrics,
Biomedical and Behavioral Methodology Core (BBMC)
~
and in MySQL with REGEXP
. And in Oracle with REGEXP_SUBSTR
and REGEXP_LIKE
, and even REGEXP_REPLACE
.LIKE
SQL operator might do everything you need anyway.A 'regex' is typically a carefully crafted string that describes a pattern of text. It can:
It's like the big brother of wildcards you match filenames with
(eg, "*.R"
).
Pattern | Matches |
---|---|
mike |
“mike”, “smike”, “miked”, etc. |
mike4 |
“mike4”, “smike4”, etc. |
mike\d |
“mike” followed by any single digit (eg “mike8”, “smike8”) |
mike\d+ |
“mike” followed by one or more digits (eg “mike1234”, “smike8”) |
^mike$ |
only “mike” |
\b19(?=(1|2))(\d{2})\b
and 20\2
converts years in the 1910s and 1920s to the 2010s and 2020s
(but leaves later years as they are).
Today's as language agnostic as possible. SAS, R, and Python examples in Part 2.
Later, consider RegexBuddy for $40.
"\d\w"
) instead of "[:digit:][:alnum:]"
)grep()
and grep(..., value=T)
grepl()
sub()
and gsub()
regexpr()
, gregexpr()
, regexec()
Start with
`import re`
search()
–the first match.
m = re.search('(?<=a)d', 'ad')
m.group(0)
findall()
–all non-overlapping matches.
match()
–careful, needs to be at the start.
sub()
1916-1918 subscales for a subject
1998-1914 subscales for a subject
subscales for a subject 1998-1920
s <- c("1916-1918 subscales for a subject", "1898-2003 subscales for a subject", "subscales for a subject 1998-1920")
g <- sub("19(1|2)(\\d)", "20\\1\\2", s)
cat(g, sep="\n")
2016-1918 subscales for a subject
1898-2003 subscales for a subject
subscales for a subject 1998-2020
grep("19(1|2)(\\d)", s)
[1] 1 3
grep("19(1|2)(\\d)", s, value=T)
[1] "1916-1918 subscales for a subject" "subscales for a subject 1998-1920"
grepl("19(1|2)(\\d)", s)
[1] TRUE FALSE TRUE
1234
23
14a
1a3
234
1.39
In the right panel of bottom right panel of regex101.com. Especially these first:
. versus \.
\w and \d and \s (versus \W and \D and \S)
^ and \A
$ and \Z
? and * and + and things like {3,6}
Capturing
Character classes
"CL_ID" = "ClientID"
, "RMSEQ" = "RemovedSequence"
, "RMVL_BEGDT" = "RemovalBeginDate"
, "RMVL_ENDDT" = "RemovalEndDate"
, "END_TYP_CDE" = "EndTypeID"
, "REMOVED_FROM" = "RemovedFromTypeID"
, "CURR_RMVL_TYP" = "RemovalTypeCurrentID"
, "ORIG_RMVL_TYP" = "RemovalTypeOriginalID"
, "FMLY_STRUCTURE" = "FamilyStructureTypeID"
requireNamespace("dplyr", quietly=TRUE) #hadley/dplyr
requireNamespace("lubridate")
requireNamespace("OuhscMunge",quietly=T) #OuhscBbmc/OuhscMunge
9
4
34
3
62
43
1
Time,Gender,Genetype,Treatment,MouseID,OR-Recognition Index,FC-t-F %,FC-b-F %,FC-a-F %
4M,Male,WILD,Control,c9-1,0.32,11.9,0,25.7
4M,Male,WILD,Control,c13-2,0.47,23.7,0,11.
4M,Male,WILD,Prozac,c10-2,0.62,40.7,11.4,51.4
4M,Male,WILD,Prozac,c14-3,0.63,10.2,0,28.6
4M,Male,YFP,Control,c9-2,0.42,42.4,11.4,22.9
4M,Male,YFP,Control,c13-1,0.5,15.3,0,54.1
4M,Male,YFP,Control,c13-nm,1,27.1,0,31.4
4M,Male,YFP,Prozac,c10-1,0.65,20.3,17.1,54.3
(In some cases, you'd have to parse only the cell, not the entire line. But this is good practice.)
Look for an email invitation to a REDCap survey.
Example 3
(,*\s*)"(\w+)"\s+=\s+"(\w+)" and
$1"$3" = "$2"
Example 4
library\((\w+),\s*quietly=(T|TRUE)\) and
library($1)
Example 5
\b(\d)\b and
0$1
Example 6
,c(\d{1,2})-(\d|nm),