grep
starts_with
ifelse
, str_detect
, and toupper
Function: str_detect
, RegEx: ^
, .*
, $
Note: the two lines of code below extract different strings in the column RAM$stocklong
. The first doesn't require stock names to end in the string "Japan". The second one does. .*
is an indicator for "any string".
Context: Searching for species and region string patterns in a column describing fisheries stock in order to manually assign OHI and FAO region ids.
RAM <- read.csv("https://rawgit.com/OHI-Science/ohiprep_v2018/master/globalprep/fis/v2018/int/RAM_fao_ohi_rgns.csv")> RAM[which(str_detect(RAM$stocklong, "^Walleye pollock.*Japan")),] rgn_id fao_id assessid RAM_area_m2 stockid stocklong3552 NA NA FAFRFJ-APOLLNSJ-1970-2013-JPNIMP2016 NA APOLLNSJ Walleye pollock Sea of Japan North3553 NA NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016 NA APOLLPJPN Walleye pollock Pacific Coast of Japan> RAM[which(str_detect(RAM$stocklong, paste(c("^Walleye pollock.*Japan$"), collapse = "|"))),] rgn_id fao_id assessid RAM_area_m2 stockid stocklong3553 NA NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016 NA APOLLPJPN Walleye pollock Pacific Coast of Japan
Back to Table of Contents
Function: str_extract
, RegEx: (\\d)+
Note: The +
sign indicates one or more digits. Without the +
, you would extract just the first digit in the string. basename( )
takes the base file name of a file path. In this case, it's 'annual_catch_2003.tif'
Context: Extract year from a file path (Catch raster) to find the corresponding year in another file path (NPP raster)
file <- "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/comm_landings/annual_catch_2003.tif"> basename(file)[1] "annual_catch_2003.tif"year <- str_extract(basename(file),"(\\d)+") # extracts any digitsnpp_files_gf[str_detect(npp_files_gf, yr)][1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/VGPM_primary_productivity/int/annual_npp/annual_mean_npp_moll_2003_gf.tif"
Back to Table of Contents
Function: str_extract
, RegEx: (\\d)+$
Combining the date regex with the $
sign.
Example fish taxon data
> head(mean_catch$stock_id_taxonkey[1:10])[1] Marine_fishes_not_identified-57_100039 [2] Miscellaneous_marine_crustaceans-57_100047[3] Miscellaneous_marine_molluscs-57_100058 [4] Marine_fishes_not_identified-57_100139 [5] Cephalopoda-57_290002 [6] Rajiformes-57_300014
I want to extract the 6-digit taxon key appended at the end of each value, not the FAO id, the two-digit number that precedes it.
Applying str_extract
and regex (\\d)+$
> mean_catch$stock_id_taxonkey[1][1] Marine_fishes_not_identified-57_100039> str_extract(mean_catch$stock_id_taxonkey[1], "(\\d)+$")[1] "100039"> mean_catch$stock_id_taxonkey[560467][1] Serranidae-31_NA> str_extract(mean_catch$stock_id_taxonkey[560467], "(\\d)+$")[1] NA
Back to Table of Contents
Function: str_extract
, RegEx: "^(\\w+).(\\d){2}"
Extract string that starts with one or more word characters, followed by some characters indicated by .
, and ending with the first two digits that come up. To specify 1 or 2 digits, can adjust to ^(\\w+).(\\d{1,2}
.
> mean_catch$stock_id_taxonkey[1][1] Marine_fishes_not_identified-57_100039> str_extract(mean_catch$stock_id_taxonkey[1], "^(\\w+).(\\d){1,2}")[1] "Marine_fishes_not_identified-57"
Back to Table of Contents
grep
Search for the string "NY.GDP.PCAP.PP.KD" in data frame indicators
which has a column named indicator
and print rows that match that qualification.
> class(indicators)[1] "data.frame"names(indicators)[1] "indicator" "name" "description" "sourceDatabase" "sourceOrganization"
> indicators[grep("NY.GDP.PCAP.PP.KD", indicators$indicator), ] indicator name4460 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 international $)4461 NY.GDP.PCAP.PP.KD.ZG GDP per capita, PPP annual growth (%) description4460 GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2005 international dollars.4461 sourceDatabase sourceOrganization4460 World Development Indicators World Bank, International Comparison Program database.4461 Africa Development Indicators
Back to Table of Contents
Here's an example cluster of file paths
> count[1] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size1_180.tif"[2] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size2_180.tif"[3] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size3_180.tif"[4] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size4_180.tif"
Use base function strsplit
to split up the file path into a list of the sections
> strsplit(count[1],'/','.')[[1]] [1] "" "home" "shares" [4] "ohi" "git-annex" "globalprep" [7] "cw_pressure_trash" "v2015" "globalplastic_wd_cd_rasters_180"[10] "count_density_size1_180.tif"
I want the tenth string section (cont. on next slide)
> unlist(strsplit(count[1],'/','.'))[10][1] "count_density_size1_180.tif"
Back to Table of Contents
Can create a function/loop to grab the 10th string section in each of the file paths in a group of file paths.
In this example, the function loops through each file to grab the string, reads the entire raster file, does some manipulation, then saves the raster in a new file using the string that was selected.
Example:
unlog = function(file){ name = unlist(strsplit(file,'/','.'))[3] #split filename, grab second string to use in naming tif r = raster(file) out = 10^r writeRaster(out,filename=paste0(data_wd,'v2015/tmp/unlog/unlog_',name,sep=''),overwrite=T,format='GTiff')}
Back to Table of Contents
data_files <- list.files(file.path(path_data, "annual_catch"), full.names = T) # 132 filesyr = 2015yr <- as.character(yr)## Select all catch data files with "2015" in the file namedatanames <- data_files[which(str_detect(data_files, yr))]
voila
> datanames[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchInd_2015.rds" [2] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchNInd_2015.rds"
Now I can manipulate and combine them
## read in the two data tables list_data <- map(datanames, readRDS) ## combine the two data tables in your list combined <- bind_rows(list_data) ## save to fis folder in mazu saveRDS(combined, paste0(fis_path, "annual_catch/", sprintf("Catch_%s.rds",yr)))
Back to Table of Contents
starts_with
Function: dplyr::starts_with
Data table
> head(data) rgn_id avg3yr_2007 avg3yr_2008 avg3yr_2009 avg3yr_2010 avg3yr_2011 avg3yr_20121 21 92.666667 102.00000 103.000000 110.6667 104.66667 110.33332 69 45.000000 42.66667 39.000000 42.0000 51.33333 51.00003 70 102.000000 85.66667 74.666667 91.0000 115.00000 119.00004 73 14749.000000 14465.66667 14502.666667 14629.3333 14770.33333 15059.66675 143 262.000000 183.00000 133.666667 130.6667 132.00000 146.00006 144 3.333333 0.00000 3.333333 5.0000 5.00000 4.0000 avg3yr_2013 avg3yr_2014 avg3yr_2015 avg3yr_2016 avg3yr_20171 113.3333 115.33333 114.66667 116.00000 119.000002 49.0000 44.00000 42.33333 42.33333 37.666673 112.3333 100.66667 91.00000 75.66667 74.333334 14730.0000 14892.00000 13994.66667 14357.00000 14145.666675 111.0000 82.33333 79.66667 67.00000 70.000006 17.0000 17.00000 14.66667 0.00000 4.00000 Reference_avg1979to2000monthlypixels pctdevR_2007 pctdevR_2008 pctdevR_2009 pctdevR_20101 101.77273 0.910525532 1.0022331 1.012058955 1.087390202 61.09091 0.736607143 0.6984127 0.638392857 0.687500003 108.09091 0.943650126 0.7925428 0.690776563 0.841883944 13281.95455 1.110454034 1.0891218 1.091907567 1.101444325 332.86364 0.787109108 0.5497747 0.401565843 0.392553146 355.68182 0.009371672 0.0000000 0.009371672 0.01405751
## health datahealth <- data %>% dplyr::filter(rgn_typ == "eez") %>% dplyr::select(rgn_id, habitat, dplyr::starts_with("pctdevR")) %>% tidyr::gather("year", "health", -(1:2)) %>% dplyr::mutate(year = substring(year, 9, 12)) %>% dplyr::mutate(year = as.numeric(year)) %>% dplyr::mutate(health = ifelse(health > 1, 1, health))
Back to Table of Contents
Remove the suffix "_name" from multiple columns in data
Example 1:
comname sciname ico_gl ico_rgn_id <chr> <chr> <lgl> <int>1 Blue Shark Prionace glauca T NA2 Whale Shark Rhincodon typus T NA3 Shortfin Mako Isurus oxyrinchus T NA4 Olive Ridley Turtle Lepidochelys olivacea T NA5 Irrawaddy Dolphin Orcaella brevirostris T NA6 Humphead wrasse, Napoleon Wrasse Cheilinus undulatus T NA
data <- data %>% setNames(names(.) %>% str_replace('name', ''))
Back to Table of Contents
Example 2:
lsp_new_old <- status_3nm_new %>% full_join(status_3nm_old, by = c('rgn_id')) %>% full_join(status_1km_new, by = c('rgn_id')) %>% full_join(status_1km_old, by = c('rgn_id')) %>% mutate(status_old = (status_3nm_old + status_1km_old) / 2, status_new = (status_3nm_new + status_1km_new) / 2) %>% gather(rgn, score_new, contains('new')) %>% gather(rgn_old, score_old, contains('old')) %>% mutate(rgn = str_replace(rgn, '_new', ''), rgn_old = str_replace(rgn_old, '_old', ''), score_new = round(score_new, 3), score_old = round(score_old, 3))
Back to Table of Contents
ifelse
and stringsExample 1: Mutate column with ifelse
, str_detect
, and toupper
. Create column cat
based on specifications or matches in cat_txt
ico_assess <- ico_assess_raw %>% rename(cat = code, cat_txt = category) %>% mutate(cat = toupper(cat), cat = str_replace(cat, 'LR/', ''), cat = ifelse(cat %in% c('K', 'I'), 'DD', cat), cat = ifelse(cat == 'NR', 'NE', cat), cat = ifelse(str_detect(toupper(cat_txt), 'VERY RARE'), 'CR', cat), cat = ifelse(str_detect(toupper(cat_txt), 'LESS RARE'), 'T', cat), cat = ifelse(str_detect(toupper(cat_txt), 'STATUS INADEQUATELY KNOWN'), 'DD', cat), cat = ifelse(cat == 'V', 'VU', cat), cat = ifelse(cat == 'E', 'EN', cat))
Back to Table of Contents
Example 2: Manipulating ifelse
and strings to write.csv
if(sum(str_detect(names(mar_sp), "exclude$"))==1) { write.csv(maric, 'output/MAR_FP_data.csv', row.names=FALSE) } else if (sum(str_detect(names(mar_sp), "exclude_no_seaweed"))==1) { write.csv(maric, 'test/MAR_FP_data_no_seaweed.csv', row.names=FALSE) } else if (sum(str_detect(names(mar_sp), "exclude_no_nei"))==1) { write.csv(maric, 'test/MAR_FP_data_no_nei.csv', row.names=FALSE) }
Back to Table of Contents
Here's my file
f[1] "/home/shares/ohi/git-annex/globalprep/_raw_data/FAO_commodities/d2018/FAO_raw_commodities_quant_1976_2015.csv"
Search for string quant
or value
> str_detect(f, c('quant','value'))[1] TRUE FALSE> c('tonnes','usd')[1] "tonnes" "usd"
Save the appropriate units to a variable for use later
units <- c('tonnes','usd')[str_detect(f, c('quant','value'))]> c('tonnes','usd')[str_detect(f, c('quant','value'))][1] "tonnes"
Back to Table of Contents
Example data table
> head(mar_out) rgn_id species fao environment year value Taxon_code gap_0_fill species_code1 5 Blue shrimp Pacific, Western Central Marine 1983 16 SH 0 12 5 Blue shrimp Pacific, Western Central Marine 1984 51 SH 0 13 5 Blue shrimp Pacific, Western Central Marine 1985 87 SH 0 14 5 Blue shrimp Pacific, Western Central Marine 1986 59 SH 0 15 5 Blue shrimp Pacific, Western Central Marine 1987 87 SH 0 16 5 Blue shrimp Pacific, Western Central Marine 1988 217 SH 0 1
Some of the species names have weird symbols: Barramundi(=Giant seaperch)
Back to Table of Contents
Filter for values with that "(=...)" symbol and remove the string inside the parenthesis. The \\
escapes symbols in strings so it knows you want a literal symbol not a regex expression. For example "." is a regex expression, but if you wanted to search for a literal "." you would probably add "\.\*".
> test <- mar_out[which(str_detect(mar_out$species, "(=.*)")),]> a <- unique(test$species)> a[1] Barramundi(=Giant seaperch) Chinook(=Spring=King) salmon [3] Coho(=Silver) salmon Northern quahog(=Hard clam) [5] Snooks(=Robalos) nei Blackspot(=red) seabream [7] Silversides(=Sand smelts) nei> str_replace(a, "\\(\\=.*\\)","")[1] "Barramundi" "Chinook salmon" "Coho salmon" [4] "Northern quahog" "Snooks nei" "Blackspot seabream"[7] "Silversides nei"
Back to Table of Contents
grep
starts_with
ifelse
, str_detect
, and toupper
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |