+ - 0:00:00
Notes for current slide
Notes for next slide

OHI String Applications

iwensu0313

09-14-2018

1 / 19

Extract using start and end pattern

Function: str_detect, RegEx: ^, .*, $

Note: the two lines of code below extract different strings in the column RAM$stocklong. The first doesn't require stock names to end in the string "Japan". The second one does. .* is an indicator for "any string". Context: Searching for species and region string patterns in a column describing fisheries stock in order to manually assign OHI and FAO region ids.

RAM <- read.csv("https://rawgit.com/OHI-Science/ohiprep_v2018/master/globalprep/fis/v2018/int/RAM_fao_ohi_rgns.csv")
> RAM[which(str_detect(RAM$stocklong, "^Walleye pollock.*Japan")),]
rgn_id fao_id assessid RAM_area_m2 stockid stocklong
3552 NA NA FAFRFJ-APOLLNSJ-1970-2013-JPNIMP2016 NA APOLLNSJ Walleye pollock Sea of Japan North
3553 NA NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016 NA APOLLPJPN Walleye pollock Pacific Coast of Japan
> RAM[which(str_detect(RAM$stocklong, paste(c("^Walleye pollock.*Japan$"), collapse = "|"))),]
rgn_id fao_id assessid RAM_area_m2 stockid stocklong
3553 NA NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016 NA APOLLPJPN Walleye pollock Pacific Coast of Japan

Back to Table of Contents

3 / 19

Extracting date or number

Function: str_extract, RegEx: (\\d)+

Note: The + sign indicates one or more digits. Without the +, you would extract just the first digit in the string. basename( ) takes the base file name of a file path. In this case, it's 'annual_catch_2003.tif' Context: Extract year from a file path (Catch raster) to find the corresponding year in another file path (NPP raster)

file <- "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/comm_landings/
annual_catch_2003.tif"
> basename(file)
[1] "annual_catch_2003.tif"
year <- str_extract(basename(file),"(\\d)+") # extracts any digits
npp_files_gf[str_detect(npp_files_gf, yr)]
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/VGPM_primary_productivity
/int/annual_npp/annual_mean_npp_moll_2003_gf.tif"

Back to Table of Contents

4 / 19

Extract end number string

Function: str_extract, RegEx: (\\d)+$

Combining the date regex with the $ sign.

Example fish taxon data

> head(mean_catch$stock_id_taxonkey[1:10])
[1] Marine_fishes_not_identified-57_100039
[2] Miscellaneous_marine_crustaceans-57_100047
[3] Miscellaneous_marine_molluscs-57_100058
[4] Marine_fishes_not_identified-57_100139
[5] Cephalopoda-57_290002
[6] Rajiformes-57_300014

I want to extract the 6-digit taxon key appended at the end of each value, not the FAO id, the two-digit number that precedes it.

Applying str_extract and regex (\\d)+$

> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039
> str_extract(mean_catch$stock_id_taxonkey[1], "(\\d)+$")
[1] "100039"
> mean_catch$stock_id_taxonkey[560467]
[1] Serranidae-31_NA
> str_extract(mean_catch$stock_id_taxonkey[560467], "(\\d)+$")
[1] NA

Back to Table of Contents

5 / 19

Extract string with words and digits

Function: str_extract, RegEx: "^(\\w+).(\\d){2}"

Extract string that starts with one or more word characters, followed by some characters indicated by ., and ending with the first two digits that come up. To specify 1 or 2 digits, can adjust to ^(\\w+).(\\d{1,2}.

> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039
> str_extract(mean_catch$stock_id_taxonkey[1], "^(\\w+).(\\d){1,2}")
[1] "Marine_fishes_not_identified-57"

Back to Table of Contents

6 / 19

Extract string with grep

Search for the string "NY.GDP.PCAP.PP.KD" in data frame indicators which has a column named indicator and print rows that match that qualification.

> class(indicators)
[1] "data.frame"
names(indicators)
[1] "indicator" "name" "description" "sourceDatabase" "sourceOrganization"
> indicators[grep("NY.GDP.PCAP.PP.KD", indicators$indicator), ]
indicator name
4460 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 international $)
4461 NY.GDP.PCAP.PP.KD.ZG GDP per capita, PPP annual growth (%)
description
4460 GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2005 international dollars.
4461
sourceDatabase sourceOrganization
4460 World Development Indicators World Bank, International Comparison Program database.
4461 Africa Development Indicators

Back to Table of Contents

7 / 19

Select string in a file path based on position

Here's an example cluster of file paths

> count
[1] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size1_180.tif"
[2] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size2_180.tif"
[3] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size3_180.tif"
[4] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size4_180.tif"

Use base function strsplit to split up the file path into a list of the sections

> strsplit(count[1],'/','.')
[[1]]
[1] "" "home" "shares"
[4] "ohi" "git-annex" "globalprep"
[7] "cw_pressure_trash" "v2015" "globalplastic_wd_cd_rasters_180"
[10] "count_density_size1_180.tif"

I want the tenth string section (cont. on next slide)

> unlist(strsplit(count[1],'/','.'))[10]
[1] "count_density_size1_180.tif"

Back to Table of Contents

8 / 19

Can create a function/loop to grab the 10th string section in each of the file paths in a group of file paths.

In this example, the function loops through each file to grab the string, reads the entire raster file, does some manipulation, then saves the raster in a new file using the string that was selected.

Example:

unlog = function(file){
name = unlist(strsplit(file,'/','.'))[3] #split filename, grab second string to use in naming tif
r = raster(file)
out = 10^r
writeRaster(out,filename=paste0(data_wd,'v2015/tmp/unlog/unlog_',name,sep=''),overwrite=T,format='GTiff')
}

Back to Table of Contents

9 / 19

Select files in a list of files matching a given year

data_files <- list.files(file.path(path_data, "annual_catch"), full.names = T) # 132 files
yr = 2015
yr <- as.character(yr)
## Select all catch data files with "2015" in the file name
datanames <- data_files[which(str_detect(data_files, yr))]

voila

> datanames
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchInd_2015.rds"
[2] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchNInd_2015.rds"

Now I can manipulate and combine them

## read in the two data tables
list_data <- map(datanames, readRDS)
## combine the two data tables in your list
combined <- bind_rows(list_data)
## save to fis folder in mazu
saveRDS(combined, paste0(fis_path, "annual_catch/", sprintf("Catch_%s.rds",yr)))

Back to Table of Contents

10 / 19

Extracting string using starts_with

Function: dplyr::starts_with

Data table

> head(data)
rgn_id avg3yr_2007 avg3yr_2008 avg3yr_2009 avg3yr_2010 avg3yr_2011 avg3yr_2012
1 21 92.666667 102.00000 103.000000 110.6667 104.66667 110.3333
2 69 45.000000 42.66667 39.000000 42.0000 51.33333 51.0000
3 70 102.000000 85.66667 74.666667 91.0000 115.00000 119.0000
4 73 14749.000000 14465.66667 14502.666667 14629.3333 14770.33333 15059.6667
5 143 262.000000 183.00000 133.666667 130.6667 132.00000 146.0000
6 144 3.333333 0.00000 3.333333 5.0000 5.00000 4.0000
avg3yr_2013 avg3yr_2014 avg3yr_2015 avg3yr_2016 avg3yr_2017
1 113.3333 115.33333 114.66667 116.00000 119.00000
2 49.0000 44.00000 42.33333 42.33333 37.66667
3 112.3333 100.66667 91.00000 75.66667 74.33333
4 14730.0000 14892.00000 13994.66667 14357.00000 14145.66667
5 111.0000 82.33333 79.66667 67.00000 70.00000
6 17.0000 17.00000 14.66667 0.00000 4.00000
Reference_avg1979to2000monthlypixels pctdevR_2007 pctdevR_2008 pctdevR_2009 pctdevR_2010
1 101.77273 0.910525532 1.0022331 1.012058955 1.08739020
2 61.09091 0.736607143 0.6984127 0.638392857 0.68750000
3 108.09091 0.943650126 0.7925428 0.690776563 0.84188394
4 13281.95455 1.110454034 1.0891218 1.091907567 1.10144432
5 332.86364 0.787109108 0.5497747 0.401565843 0.39255314
6 355.68182 0.009371672 0.0000000 0.009371672 0.01405751
11 / 19
## health data
health <- data %>%
dplyr::filter(rgn_typ == "eez") %>%
dplyr::select(rgn_id, habitat, dplyr::starts_with("pctdevR")) %>%
tidyr::gather("year", "health", -(1:2)) %>%
dplyr::mutate(year = substring(year, 9, 12)) %>%
dplyr::mutate(year = as.numeric(year)) %>%
dplyr::mutate(health = ifelse(health > 1, 1, health))

Back to Table of Contents

12 / 19

Remove string from column names

Remove the suffix "_name" from multiple columns in data

Example 1:

comname sciname ico_gl ico_rgn_id
<chr> <chr> <lgl> <int>
1 Blue Shark Prionace glauca T NA
2 Whale Shark Rhincodon typus T NA
3 Shortfin Mako Isurus oxyrinchus T NA
4 Olive Ridley Turtle Lepidochelys olivacea T NA
5 Irrawaddy Dolphin Orcaella brevirostris T NA
6 Humphead wrasse, Napoleon Wrasse Cheilinus undulatus T NA
data <- data %>%
setNames(names(.) %>%
str_replace('name', ''))

Back to Table of Contents

13 / 19

Example 2:

lsp_new_old <- status_3nm_new %>%
full_join(status_3nm_old, by = c('rgn_id')) %>%
full_join(status_1km_new, by = c('rgn_id')) %>%
full_join(status_1km_old, by = c('rgn_id')) %>%
mutate(status_old = (status_3nm_old + status_1km_old) / 2,
status_new = (status_3nm_new + status_1km_new) / 2) %>%
gather(rgn, score_new, contains('new')) %>%
gather(rgn_old, score_old, contains('old')) %>%
mutate(rgn = str_replace(rgn, '_new', ''),
rgn_old = str_replace(rgn_old, '_old', ''),
score_new = round(score_new, 3),
score_old = round(score_old, 3))

Back to Table of Contents

14 / 19

Manipulate ifelse and strings

Example 1: Mutate column with ifelse, str_detect, and toupper. Create column cat based on specifications or matches in cat_txt

ico_assess <- ico_assess_raw %>%
rename(cat = code, cat_txt = category) %>%
mutate(cat = toupper(cat),
cat = str_replace(cat, 'LR/', ''),
cat = ifelse(cat %in% c('K', 'I'), 'DD', cat),
cat = ifelse(cat == 'NR', 'NE', cat),
cat = ifelse(str_detect(toupper(cat_txt), 'VERY RARE'), 'CR', cat),
cat = ifelse(str_detect(toupper(cat_txt), 'LESS RARE'), 'T', cat),
cat = ifelse(str_detect(toupper(cat_txt), 'STATUS INADEQUATELY KNOWN'), 'DD', cat),
cat = ifelse(cat == 'V', 'VU', cat),
cat = ifelse(cat == 'E', 'EN', cat))

Back to Table of Contents

15 / 19

Example 2: Manipulating ifelse and strings to write.csv

if(sum(str_detect(names(mar_sp), "exclude$"))==1) {
write.csv(maric, 'output/MAR_FP_data.csv', row.names=FALSE)
} else if (sum(str_detect(names(mar_sp), "exclude_no_seaweed"))==1) {
write.csv(maric, 'test/MAR_FP_data_no_seaweed.csv', row.names=FALSE)
} else if (sum(str_detect(names(mar_sp), "exclude_no_nei"))==1) {
write.csv(maric, 'test/MAR_FP_data_no_nei.csv', row.names=FALSE)
}

Back to Table of Contents

16 / 19

Conditional Selection Based on String in File Name

Here's my file

f
[1] "/home/shares/ohi/git-annex/globalprep/_raw_data/FAO_commodities/d2018/FAO_raw_commodities_quant_1976_2015.csv"

Search for string quant or value

> str_detect(f, c('quant','value'))
[1] TRUE FALSE
> c('tonnes','usd')
[1] "tonnes" "usd"

Save the appropriate units to a variable for use later

units <- c('tonnes','usd')[str_detect(f, c('quant','value'))]
> c('tonnes','usd')[str_detect(f, c('quant','value'))]
[1] "tonnes"

Back to Table of Contents

17 / 19

Removing weird symbols in species names

Example data table

> head(mar_out)
rgn_id species fao environment year value Taxon_code gap_0_fill species_code
1 5 Blue shrimp Pacific, Western Central Marine 1983 16 SH 0 1
2 5 Blue shrimp Pacific, Western Central Marine 1984 51 SH 0 1
3 5 Blue shrimp Pacific, Western Central Marine 1985 87 SH 0 1
4 5 Blue shrimp Pacific, Western Central Marine 1986 59 SH 0 1
5 5 Blue shrimp Pacific, Western Central Marine 1987 87 SH 0 1
6 5 Blue shrimp Pacific, Western Central Marine 1988 217 SH 0 1

Some of the species names have weird symbols: Barramundi(=Giant seaperch)

Back to Table of Contents

18 / 19

Filter for values with that "(=...)" symbol and remove the string inside the parenthesis. The \\ escapes symbols in strings so it knows you want a literal symbol not a regex expression. For example "." is a regex expression, but if you wanted to search for a literal "." you would probably add "\.\*".

> test <- mar_out[which(str_detect(mar_out$species, "(=.*)")),]
> a <- unique(test$species)
> a
[1] Barramundi(=Giant seaperch) Chinook(=Spring=King) salmon
[3] Coho(=Silver) salmon Northern quahog(=Hard clam)
[5] Snooks(=Robalos) nei Blackspot(=red) seabream
[7] Silversides(=Sand smelts) nei
> str_replace(a, "\\(\\=.*\\)","")
[1] "Barramundi" "Chinook salmon" "Coho salmon"
[4] "Northern quahog" "Snooks nei" "Blackspot seabream"
[7] "Silversides nei"

Back to Table of Contents

19 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow