class: center, middle, inverse, title-slide

# OHI String Applications
### iwensu0313
### 09-14-2018

---

name: toc

# Table of Contents

1. [Extract using start and end pattern](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#start_end) 
2. [Extracting date or number](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#date)
3. [Extract string with `grep`](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#grep)
4. [Select string in a file path based on position](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#posit)
5. [Select files in a list of files matching a given year](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#year)
6. [Extracting string using `starts_with`](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#start_with)
7. [Remove string from column names](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#column_name)
8. [Manipulate column with `ifelse`, `str_detect`, and `toupper`](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#ifelse)
9. [Conditional Selection Based on String in File Name](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#condition)
10. [Removing weird symbols in species names](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#symbols)

---
name: start_end

## Extract using start and end pattern

Function: `str_detect`, RegEx: `^`, `.*`, `$`

Note: the two lines of code below extract different strings in the column `RAM$stocklong`. The first doesn't require stock names to end in the string "Japan". The second one does. `.*` is an indicator for "any string".
Context: Searching for species and region string patterns in a column describing fisheries stock in order to manually assign OHI and FAO region ids.

```r
RAM <- read.csv("https://rawgit.com/OHI-Science/ohiprep_v2018/master/globalprep/fis/v2018/int/RAM_fao_ohi_rgns.csv")
> RAM[which(str_detect(RAM$stocklong, "^Walleye pollock.*Japan")),]
     rgn_id fao_id                              assessid RAM_area_m2   stockid                              stocklong
3552     NA     NA  FAFRFJ-APOLLNSJ-1970-2013-JPNIMP2016          NA  APOLLNSJ     Walleye pollock Sea of Japan North
3553     NA     NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016          NA APOLLPJPN Walleye pollock Pacific Coast of Japan
> RAM[which(str_detect(RAM$stocklong, paste(c("^Walleye pollock.*Japan$"), collapse = "|"))),]
     rgn_id fao_id                              assessid RAM_area_m2   stockid                              stocklong
3553     NA     NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016          NA APOLLPJPN Walleye pollock Pacific Coast of Japan
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]

---
name: date

## Extracting date or number
Function: `str_extract`, RegEx: `(\\d)+`

Note: The `+` sign indicates one or more digits. Without the `+`, you would extract just the first digit in the string. `basename( )` takes the base file name of a file path. In this case, it's 'annual_catch_2003.tif'
Context: Extract year from a file path (Catch raster) to find the corresponding year in another file path (NPP raster)

```r
file <- "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/comm_landings/
annual_catch_2003.tif"

> basename(file)
[1] "annual_catch_2003.tif"

year <- str_extract(basename(file),"(\\d)+") # extracts any digits

npp_files_gf[str_detect(npp_files_gf, yr)]
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/VGPM_primary_productivity
/int/annual_npp/annual_mean_npp_moll_2003_gf.tif"
```

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---

## Extract end number string
Function: `str_extract`, RegEx: `(\\d)+$`

Combining the date regex with the `$` sign.

Example fish taxon data

```r
> head(mean_catch$stock_id_taxonkey[1:10])
[1] Marine_fishes_not_identified-57_100039    
[2] Miscellaneous_marine_crustaceans-57_100047
[3] Miscellaneous_marine_molluscs-57_100058   
[4] Marine_fishes_not_identified-57_100139    
[5] Cephalopoda-57_290002                     
[6] Rajiformes-57_300014  
```

I want to extract the 6-digit taxon key appended at the end of each value, not the FAO id, the two-digit number that precedes it.

Applying `str_extract` and regex `(\\d)+$`

```r
> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039
> str_extract(mean_catch$stock_id_taxonkey[1], "(\\d)+$")
[1] "100039"

> mean_catch$stock_id_taxonkey[560467]
[1] Serranidae-31_NA
> str_extract(mean_catch$stock_id_taxonkey[560467], "(\\d)+$")
[1] NA
```

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---

## Extract string with words and digits
Function: `str_extract`, RegEx: `"^(\\w+).(\\d){2}"`

Extract string that starts with one or more word characters, followed by some characters indicated by `.`, and ending with the first two digits that come up. To specify 1 or 2 digits, can adjust to `^(\\w+).(\\d{1,2}`.

```r
> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039

> str_extract(mean_catch$stock_id_taxonkey[1], "^(\\w+).(\\d){1,2}")
[1] "Marine_fishes_not_identified-57"
```

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: grep

## Extract string with `grep`

Search for the string "NY.GDP.PCAP.PP.KD" in data frame `indicators` which has a column named `indicator` and print rows that match that qualification.

```r
> class(indicators)
[1] "data.frame"
names(indicators)
[1] "indicator"          "name"               "description"        "sourceDatabase"     "sourceOrganization"
```

```r
> indicators[grep("NY.GDP.PCAP.PP.KD", indicators$indicator), ] 
                indicator                                                name
4460    NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 international $)
4461 NY.GDP.PCAP.PP.KD.ZG               GDP per capita, PPP annual growth (%)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            description
4460 GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2005 international dollars.
4461                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                    sourceDatabase                                     sourceOrganization
4460  World Development Indicators World Bank, International Comparison Program database.
4461 Africa Development Indicators     
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: posit

## Select string in a file path based on position

Here's an example cluster of file paths

```r
> count
[1] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size1_180.tif"
[2] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size2_180.tif"
[3] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size3_180.tif"
[4] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size4_180.tif"
```
Use base function `strsplit` to split up the file path into a list of the sections

```r
> strsplit(count[1],'/','.')
[[1]]
 [1] ""                                "home"                            "shares"                         
 [4] "ohi"                             "git-annex"                       "globalprep"                     
 [7] "cw_pressure_trash"               "v2015"                           "globalplastic_wd_cd_rasters_180"
[10] "count_density_size1_180.tif"
```
I want the tenth string section (cont. on next slide)

```r
>  unlist(strsplit(count[1],'/','.'))[10]
[1] "count_density_size1_180.tif"
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---

Can create a function/loop to grab the 10th string section in each of the file paths in a group of file paths.

In this example, the function loops through each file to grab the string, reads the entire raster file, does some manipulation, then saves the raster in a new file using the string that was selected.

**Example**:

```r
  unlog = function(file){

name = unlist(strsplit(file,'/','.'))[3] #split filename, grab second string to use in naming tif
  r = raster(file)
  out = 10^r
  
  writeRaster(out,filename=paste0(data_wd,'v2015/tmp/unlog/unlog_',name,sep=''),overwrite=T,format='GTiff')
  
}
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: year

## Select files in a list of files matching a given year

```r
data_files <- list.files(file.path(path_data, "annual_catch"), full.names = T) # 132 files

yr = 2015
yr <- as.character(yr)
## Select all catch data files with "2015" in the file name
datanames <- data_files[which(str_detect(data_files, yr))]
```
voila

```r
> datanames
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchInd_2015.rds" 
[2] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchNInd_2015.rds"
```
Now I can manipulate and combine them

```r
## read in the two data tables
  list_data <- map(datanames, readRDS)
  ## combine the two data tables in your list
  combined <- bind_rows(list_data)
  ## save to fis folder in mazu
  saveRDS(combined, paste0(fis_path, "annual_catch/", sprintf("Catch_%s.rds",yr)))
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: start_with

## Extracting string using `starts_with`

Function: `dplyr::starts_with`

Data table

```r
> head(data)
  rgn_id  avg3yr_2007 avg3yr_2008  avg3yr_2009 avg3yr_2010 avg3yr_2011 avg3yr_2012
1     21    92.666667   102.00000   103.000000    110.6667   104.66667    110.3333
2     69    45.000000    42.66667    39.000000     42.0000    51.33333     51.0000
3     70   102.000000    85.66667    74.666667     91.0000   115.00000    119.0000
4     73 14749.000000 14465.66667 14502.666667  14629.3333 14770.33333  15059.6667
5    143   262.000000   183.00000   133.666667    130.6667   132.00000    146.0000
6    144     3.333333     0.00000     3.333333      5.0000     5.00000      4.0000
  avg3yr_2013 avg3yr_2014 avg3yr_2015 avg3yr_2016 avg3yr_2017
1    113.3333   115.33333   114.66667   116.00000   119.00000
2     49.0000    44.00000    42.33333    42.33333    37.66667
3    112.3333   100.66667    91.00000    75.66667    74.33333
4  14730.0000 14892.00000 13994.66667 14357.00000 14145.66667
5    111.0000    82.33333    79.66667    67.00000    70.00000
6     17.0000    17.00000    14.66667     0.00000     4.00000
  Reference_avg1979to2000monthlypixels pctdevR_2007 pctdevR_2008 pctdevR_2009 pctdevR_2010
1                            101.77273  0.910525532    1.0022331  1.012058955   1.08739020
2                             61.09091  0.736607143    0.6984127  0.638392857   0.68750000
3                            108.09091  0.943650126    0.7925428  0.690776563   0.84188394
4                          13281.95455  1.110454034    1.0891218  1.091907567   1.10144432
5                            332.86364  0.787109108    0.5497747  0.401565843   0.39255314
6                            355.68182  0.009371672    0.0000000  0.009371672   0.01405751
```

---

```r
## health data
health <- data %>%
  dplyr::filter(rgn_typ == "eez") %>%
  dplyr::select(rgn_id, habitat, dplyr::starts_with("pctdevR")) %>%
  tidyr::gather("year", "health", -(1:2)) %>%
  dplyr::mutate(year = substring(year, 9, 12)) %>%
  dplyr::mutate(year = as.numeric(year)) %>%
  dplyr::mutate(health = ifelse(health > 1, 1, health))
```

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: column_name

## Remove string from column names

Remove the suffix "_name" from multiple columns in `data`

Example 1:

```r
comname                          sciname               ico_gl ico_rgn_id
 <chr>                            <chr>                 <lgl>       <int>
1 Blue Shark                       Prionace glauca       T              NA
2 Whale Shark                      Rhincodon typus       T              NA
3 Shortfin Mako                    Isurus oxyrinchus     T              NA
4 Olive Ridley Turtle              Lepidochelys olivacea T              NA
5 Irrawaddy Dolphin                Orcaella brevirostris T              NA
6 Humphead wrasse, Napoleon Wrasse Cheilinus undulatus   T              NA
```

```r
data <- data %>%
    setNames(names(.) %>%
               str_replace('name', ''))
```

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
Example 2:

```r
lsp_new_old <- status_3nm_new %>%
  full_join(status_3nm_old, by = c('rgn_id')) %>%
  full_join(status_1km_new, by = c('rgn_id')) %>%
  full_join(status_1km_old, by = c('rgn_id')) %>%
  mutate(status_old = (status_3nm_old + status_1km_old) / 2,
         status_new = (status_3nm_new + status_1km_new) / 2) %>%
  gather(rgn, score_new, contains('new')) %>%
  gather(rgn_old, score_old, contains('old')) %>% 
  mutate(rgn = str_replace(rgn, '_new', ''),
         rgn_old = str_replace(rgn_old, '_old', ''),
         score_new = round(score_new, 3),
         score_old = round(score_old, 3))
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: ifelse

## Manipulate `ifelse` and strings

**Example 1**: Mutate column with `ifelse`, `str_detect`, and `toupper`. Create column `cat` based on specifications or matches in `cat_txt`

```r
ico_assess <- ico_assess_raw %>%
  rename(cat = code, cat_txt = category) %>%
  mutate(cat = toupper(cat),
         cat = str_replace(cat, 'LR/', ''),
         cat = ifelse(cat %in% c('K', 'I'), 'DD', cat),
         cat = ifelse(cat == 'NR', 'NE', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'VERY RARE'), 'CR', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'LESS RARE'), 'T', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'STATUS INADEQUATELY KNOWN'), 'DD', cat),
         cat = ifelse(cat == 'V', 'VU', cat), 
         cat = ifelse(cat == 'E', 'EN', cat))
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
**Example 2**: Manipulating `ifelse` and strings to `write.csv`

```r
if(sum(str_detect(names(mar_sp), "exclude$"))==1) {
 write.csv(maric, 'output/MAR_FP_data.csv', row.names=FALSE)
 } else if (sum(str_detect(names(mar_sp), "exclude_no_seaweed"))==1) {
   write.csv(maric, 'test/MAR_FP_data_no_seaweed.csv', row.names=FALSE)
   } else if (sum(str_detect(names(mar_sp), "exclude_no_nei"))==1) {
     write.csv(maric, 'test/MAR_FP_data_no_nei.csv', row.names=FALSE)
     }
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: condition

## Conditional Selection Based on String in File Name

Here's my file

```r
f
[1] "/home/shares/ohi/git-annex/globalprep/_raw_data/FAO_commodities/d2018/FAO_raw_commodities_quant_1976_2015.csv"
```

Search for string `quant` or `value`

```r
> str_detect(f, c('quant','value'))
[1]  TRUE FALSE
> c('tonnes','usd')
[1] "tonnes" "usd"  
```

Save the appropriate units to a variable for use later

```r
units <- c('tonnes','usd')[str_detect(f, c('quant','value'))]

> c('tonnes','usd')[str_detect(f, c('quant','value'))]
[1] "tonnes"
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
name: symbols

## Removing weird symbols in species names

Example data table

```r
> head(mar_out)
  rgn_id     species                      fao environment year value Taxon_code gap_0_fill species_code
1      5 Blue shrimp Pacific, Western Central      Marine 1983    16         SH          0            1
2      5 Blue shrimp Pacific, Western Central      Marine 1984    51         SH          0            1
3      5 Blue shrimp Pacific, Western Central      Marine 1985    87         SH          0            1
4      5 Blue shrimp Pacific, Western Central      Marine 1986    59         SH          0            1
5      5 Blue shrimp Pacific, Western Central      Marine 1987    87         SH          0            1
6      5 Blue shrimp Pacific, Western Central      Marine 1988   217         SH          0            1
```
Some of the species names have weird symbols: `Barramundi(=Giant seaperch)`

.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]
---
Filter for values with that "(=...)" symbol and remove the string inside the parenthesis. The `\\` escapes symbols in strings so it knows you want a literal symbol not a regex expression. For example ".*" is a regex expression, but if you wanted to search for a literal ".*" you would probably add "\\.\\*".

```r
> test <- mar_out[which(str_detect(mar_out$species, "(=.*)")),]
>  a <- unique(test$species)
> a
[1] Barramundi(=Giant seaperch)   Chinook(=Spring=King) salmon 
[3] Coho(=Silver) salmon          Northern quahog(=Hard clam)  
[5] Snooks(=Robalos) nei          Blackspot(=red) seabream     
[7] Silversides(=Sand smelts) nei
> str_replace(a, "\\(\\=.*\\)","")
[1] "Barramundi"         "Chinook salmon"     "Coho salmon"       
[4] "Northern quahog"    "Snooks nei"         "Blackspot seabream"
[7] "Silversides nei"   
```
.footnote[Back to [Table of Contents](https://rawgit.com/OHI-Science/globalfellows/gh-pages/xaringan/strings_slidedeck.html#2)]

Notes for current slide

Notes for next slide

OHI String Applicationsiwensu031309-14-20181 / 19

Table of Contents

2 / 19

Extract using start and end pattern

Function: str_detect, RegEx: ^, .*, $

Note: the two lines of code below extract different strings in the column RAM$stocklong. The first doesn't require stock names to end in the string "Japan". The second one does. .* is an indicator for "any string". Context: Searching for species and region string patterns in a column describing fisheries stock in order to manually assign OHI and FAO region ids.

RAM <- read.csv("https://rawgit.com/OHI-Science/ohiprep_v2018/master/globalprep/fis/v2018/int/RAM_fao_ohi_rgns.csv")
> RAM[which(str_detect(RAM$stocklong, "^Walleye pollock.*Japan")),]
     rgn_id fao_id                              assessid RAM_area_m2   stockid                              stocklong
3552     NA     NA  FAFRFJ-APOLLNSJ-1970-2013-JPNIMP2016          NA  APOLLNSJ     Walleye pollock Sea of Japan North
3553     NA     NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016          NA APOLLPJPN Walleye pollock Pacific Coast of Japan
> RAM[which(str_detect(RAM$stocklong, paste(c("^Walleye pollock.*Japan$"), collapse = "|"))),]
     rgn_id fao_id                              assessid RAM_area_m2   stockid                              stocklong
3553     NA     NA FAFRFJ-APOLLPJPN-1975-2013-JPNIMP2016          NA APOLLPJPN Walleye pollock Pacific Coast of Japan

Back to Table of Contents

3 / 19

Extracting date or number

Function: str_extract, RegEx: (\\d)+

Note: The + sign indicates one or more digits. Without the +, you would extract just the first digit in the string. basename( ) takes the base file name of a file path. In this case, it's 'annual_catch_2003.tif' Context: Extract year from a file path (Catch raster) to find the corresponding year in another file path (NPP raster)

file <- "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/comm_landings/
annual_catch_2003.tif"
> basename(file)
[1] "annual_catch_2003.tif"
year <- str_extract(basename(file),"(\\d)+") # extracts any digits
npp_files_gf[str_detect(npp_files_gf, yr)]
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/VGPM_primary_productivity
/int/annual_npp/annual_mean_npp_moll_2003_gf.tif"

Back to Table of Contents

4 / 19

Extract end number string

Function: str_extract, RegEx: (\\d)+$

Combining the date regex with the $ sign.

Example fish taxon data

> head(mean_catch$stock_id_taxonkey[1:10])
[1] Marine_fishes_not_identified-57_100039    
[2] Miscellaneous_marine_crustaceans-57_100047
[3] Miscellaneous_marine_molluscs-57_100058   
[4] Marine_fishes_not_identified-57_100139    
[5] Cephalopoda-57_290002                     
[6] Rajiformes-57_300014

I want to extract the 6-digit taxon key appended at the end of each value, not the FAO id, the two-digit number that precedes it.

Applying str_extract and regex (\\d)+$

> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039
> str_extract(mean_catch$stock_id_taxonkey[1], "(\\d)+$")
[1] "100039"
> mean_catch$stock_id_taxonkey[560467]
[1] Serranidae-31_NA
> str_extract(mean_catch$stock_id_taxonkey[560467], "(\\d)+$")
[1] NA

Back to Table of Contents

5 / 19

Extract string with words and digits

Function: str_extract, RegEx: "^(\\w+).(\\d){2}"

Extract string that starts with one or more word characters, followed by some characters indicated by ., and ending with the first two digits that come up. To specify 1 or 2 digits, can adjust to ^(\\w+).(\\d{1,2}.

> mean_catch$stock_id_taxonkey[1]
[1] Marine_fishes_not_identified-57_100039
> str_extract(mean_catch$stock_id_taxonkey[1], "^(\\w+).(\\d){1,2}")
[1] "Marine_fishes_not_identified-57"

Back to Table of Contents

6 / 19

Extract string with `grep`

Search for the string "NY.GDP.PCAP.PP.KD" in data frame indicators which has a column named indicator and print rows that match that qualification.

> class(indicators)
[1] "data.frame"
names(indicators)
[1] "indicator"          "name"               "description"        "sourceDatabase"     "sourceOrganization"

> indicators[grep("NY.GDP.PCAP.PP.KD", indicators$indicator), ] 
                indicator                                                name
4460    NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 international $)
4461 NY.GDP.PCAP.PP.KD.ZG               GDP per capita, PPP annual growth (%)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            description
4460 GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2005 international dollars.
4461                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                    sourceDatabase                                     sourceOrganization
4460  World Development Indicators World Bank, International Comparison Program database.
4461 Africa Development Indicators

Back to Table of Contents

7 / 19

Select string in a file path based on position

Here's an example cluster of file paths

> count
[1] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size1_180.tif"
[2] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size2_180.tif"
[3] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size3_180.tif"
[4] "/home/shares/ohi/git-annex/globalprep/cw_pressure_trash/v2015/globalplastic_wd_cd_rasters_180/count_density_size4_180.tif"

Use base function strsplit to split up the file path into a list of the sections

> strsplit(count[1],'/','.')
[[1]]
 [1] ""                                "home"                            "shares"                         
 [4] "ohi"                             "git-annex"                       "globalprep"                     
 [7] "cw_pressure_trash"               "v2015"                           "globalplastic_wd_cd_rasters_180"
[10] "count_density_size1_180.tif"

I want the tenth string section (cont. on next slide)

>  unlist(strsplit(count[1],'/','.'))[10]
[1] "count_density_size1_180.tif"

Back to Table of Contents

8 / 19

Can create a function/loop to grab the 10th string section in each of the file paths in a group of file paths.

In this example, the function loops through each file to grab the string, reads the entire raster file, does some manipulation, then saves the raster in a new file using the string that was selected.

Example:

  unlog = function(file){
  name = unlist(strsplit(file,'/','.'))[3] #split filename, grab second string to use in naming tif
  r = raster(file)
  out = 10^r
  writeRaster(out,filename=paste0(data_wd,'v2015/tmp/unlog/unlog_',name,sep=''),overwrite=T,format='GTiff')
}

Back to Table of Contents

9 / 19

Select files in a list of files matching a given year

data_files <- list.files(file.path(path_data, "annual_catch"), full.names = T) # 132 files
yr = 2015
yr <- as.character(yr)
## Select all catch data files with "2015" in the file name
datanames <- data_files[which(str_detect(data_files, yr))]

voila

> datanames
[1] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchInd_2015.rds" 
[2] "/home/shares/ohi/git-annex/globalprep/prs_fish/v2018/int/annual_catch/CatchNInd_2015.rds"

Now I can manipulate and combine them

## read in the two data tables
  list_data <- map(datanames, readRDS)
  ## combine the two data tables in your list
  combined <- bind_rows(list_data)
  ## save to fis folder in mazu
  saveRDS(combined, paste0(fis_path, "annual_catch/", sprintf("Catch_%s.rds",yr)))

Back to Table of Contents

10 / 19

Extracting string using `starts_with`

Function: dplyr::starts_with

Data table

> head(data)
  rgn_id  avg3yr_2007 avg3yr_2008  avg3yr_2009 avg3yr_2010 avg3yr_2011 avg3yr_2012
1     21    92.666667   102.00000   103.000000    110.6667   104.66667    110.3333
2     69    45.000000    42.66667    39.000000     42.0000    51.33333     51.0000
3     70   102.000000    85.66667    74.666667     91.0000   115.00000    119.0000
4     73 14749.000000 14465.66667 14502.666667  14629.3333 14770.33333  15059.6667
5    143   262.000000   183.00000   133.666667    130.6667   132.00000    146.0000
6    144     3.333333     0.00000     3.333333      5.0000     5.00000      4.0000
  avg3yr_2013 avg3yr_2014 avg3yr_2015 avg3yr_2016 avg3yr_2017
1    113.3333   115.33333   114.66667   116.00000   119.00000
2     49.0000    44.00000    42.33333    42.33333    37.66667
3    112.3333   100.66667    91.00000    75.66667    74.33333
4  14730.0000 14892.00000 13994.66667 14357.00000 14145.66667
5    111.0000    82.33333    79.66667    67.00000    70.00000
6     17.0000    17.00000    14.66667     0.00000     4.00000
  Reference_avg1979to2000monthlypixels pctdevR_2007 pctdevR_2008 pctdevR_2009 pctdevR_2010
1                            101.77273  0.910525532    1.0022331  1.012058955   1.08739020
2                             61.09091  0.736607143    0.6984127  0.638392857   0.68750000
3                            108.09091  0.943650126    0.7925428  0.690776563   0.84188394
4                          13281.95455  1.110454034    1.0891218  1.091907567   1.10144432
5                            332.86364  0.787109108    0.5497747  0.401565843   0.39255314
6                            355.68182  0.009371672    0.0000000  0.009371672   0.01405751

11 / 19

## health data
health <- data %>%
  dplyr::filter(rgn_typ == "eez") %>%
  dplyr::select(rgn_id, habitat, dplyr::starts_with("pctdevR")) %>%
  tidyr::gather("year", "health", -(1:2)) %>%
  dplyr::mutate(year = substring(year, 9, 12)) %>%
  dplyr::mutate(year = as.numeric(year)) %>%
  dplyr::mutate(health = ifelse(health > 1, 1, health))

Back to Table of Contents

12 / 19

Remove string from column names

Remove the suffix "_name" from multiple columns in data

Example 1:

comname                          sciname               ico_gl ico_rgn_id
 <chr>                            <chr>                 <lgl>       <int>
1 Blue Shark                       Prionace glauca       T              NA
2 Whale Shark                      Rhincodon typus       T              NA
3 Shortfin Mako                    Isurus oxyrinchus     T              NA
4 Olive Ridley Turtle              Lepidochelys olivacea T              NA
5 Irrawaddy Dolphin                Orcaella brevirostris T              NA
6 Humphead wrasse, Napoleon Wrasse Cheilinus undulatus   T              NA

data <- data %>%
    setNames(names(.) %>%
               str_replace('name', ''))

Back to Table of Contents

13 / 19

Example 2:

lsp_new_old <- status_3nm_new %>%
  full_join(status_3nm_old, by = c('rgn_id')) %>%
  full_join(status_1km_new, by = c('rgn_id')) %>%
  full_join(status_1km_old, by = c('rgn_id')) %>%
  mutate(status_old = (status_3nm_old + status_1km_old) / 2,
         status_new = (status_3nm_new + status_1km_new) / 2) %>%
  gather(rgn, score_new, contains('new')) %>%
  gather(rgn_old, score_old, contains('old')) %>% 
  mutate(rgn = str_replace(rgn, '_new', ''),
         rgn_old = str_replace(rgn_old, '_old', ''),
         score_new = round(score_new, 3),
         score_old = round(score_old, 3))

Back to Table of Contents

14 / 19

Manipulate `ifelse` and strings

Example 1: Mutate column with ifelse, str_detect, and toupper. Create column cat based on specifications or matches in cat_txt

ico_assess <- ico_assess_raw %>%
  rename(cat = code, cat_txt = category) %>%
  mutate(cat = toupper(cat),
         cat = str_replace(cat, 'LR/', ''),
         cat = ifelse(cat %in% c('K', 'I'), 'DD', cat),
         cat = ifelse(cat == 'NR', 'NE', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'VERY RARE'), 'CR', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'LESS RARE'), 'T', cat),
         cat = ifelse(str_detect(toupper(cat_txt), 'STATUS INADEQUATELY KNOWN'), 'DD', cat),
         cat = ifelse(cat == 'V', 'VU', cat), 
         cat = ifelse(cat == 'E', 'EN', cat))

Back to Table of Contents

15 / 19

Example 2: Manipulating ifelse and strings to write.csv

if(sum(str_detect(names(mar_sp), "exclude$"))==1) {
 write.csv(maric, 'output/MAR_FP_data.csv', row.names=FALSE)
 } else if (sum(str_detect(names(mar_sp), "exclude_no_seaweed"))==1) {
   write.csv(maric, 'test/MAR_FP_data_no_seaweed.csv', row.names=FALSE)
   } else if (sum(str_detect(names(mar_sp), "exclude_no_nei"))==1) {
     write.csv(maric, 'test/MAR_FP_data_no_nei.csv', row.names=FALSE)
     }

Back to Table of Contents

16 / 19

Conditional Selection Based on String in File Name

Here's my file

f
[1] "/home/shares/ohi/git-annex/globalprep/_raw_data/FAO_commodities/d2018/FAO_raw_commodities_quant_1976_2015.csv"

Search for string quant or value

> str_detect(f, c('quant','value'))
[1]  TRUE FALSE
> c('tonnes','usd')
[1] "tonnes" "usd"

Save the appropriate units to a variable for use later

units <- c('tonnes','usd')[str_detect(f, c('quant','value'))]
> c('tonnes','usd')[str_detect(f, c('quant','value'))]
[1] "tonnes"

Back to Table of Contents

17 / 19

Removing weird symbols in species names

Example data table

> head(mar_out)
  rgn_id     species                      fao environment year value Taxon_code gap_0_fill species_code
1      5 Blue shrimp Pacific, Western Central      Marine 1983    16         SH          0            1
2      5 Blue shrimp Pacific, Western Central      Marine 1984    51         SH          0            1
3      5 Blue shrimp Pacific, Western Central      Marine 1985    87         SH          0            1
4      5 Blue shrimp Pacific, Western Central      Marine 1986    59         SH          0            1
5      5 Blue shrimp Pacific, Western Central      Marine 1987    87         SH          0            1
6      5 Blue shrimp Pacific, Western Central      Marine 1988   217         SH          0            1

Some of the species names have weird symbols: Barramundi(=Giant seaperch)

Back to Table of Contents

18 / 19

Filter for values with that "(=...)" symbol and remove the string inside the parenthesis. The \\ escapes symbols in strings so it knows you want a literal symbol not a regex expression. For example "." is a regex expression, but if you wanted to search for a literal "." you would probably add "\.\*".

> test <- mar_out[which(str_detect(mar_out$species, "(=.*)")),]
>  a <- unique(test$species)
> a
[1] Barramundi(=Giant seaperch)   Chinook(=Spring=King) salmon 
[3] Coho(=Silver) salmon          Northern quahog(=Hard clam)  
[5] Snooks(=Robalos) nei          Blackspot(=red) seabream     
[7] Silversides(=Sand smelts) nei
> str_replace(a, "\\(\\=.*\\)","")
[1] "Barramundi"         "Chinook salmon"     "Coho salmon"       
[4] "Northern quahog"    "Snooks nei"         "Blackspot seabream"
[7] "Silversides nei"

Back to Table of Contents

19 / 19

Table of Contents

2 / 19

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow