mining_check.Rmd

---
title: 'Calculate fishing stressor sensitivities from IUCN impact scores'
author: "*Compiled on `r date()` by `r Sys.info()['user']`*"
output: 
  html_document:
    code_folding: hide
    toc: true
    toc_depth: 3
    toc_float: yes
    number_sections: true
    theme: cerulean
    highlight: haddock
    includes: 
      in_header: '~/github/src/templates/ohara_hdr.html'
  pdf_document:
    toc: true
---

``` {r setup, echo = TRUE, message = FALSE, warning = FALSE}

knitr::opts_chunk$set(fig.width = 6, fig.height = 4, fig.path = 'figs/',
                      echo = TRUE, message = FALSE, warning = FALSE)

source('https://raw.githubusercontent.com/oharac/src/master/R/common.R')

source(here::here('common_fxns.R'))

library(tidytext) ### to help parse ngrams in the narratives

```


# Summary

Stressor sensitivity is a very simplified version of stressor weights (see draft1 folder for complicated version).  Here, we will simply note whether a given species is sensitive to a given stressor, using the IUCN impact weights (i.e. non-negligible impact score), ranked as 0-3 (no, low, medium, high impact) for later differentiation if necessary.  For species with NA impact scores, we can imply sensitivity based on other fields, e.g. severity.  This version of sensitivity is not related to the range of exposure, as the weight calculation was doing.

Sensitivity will denote whether a species range overlapping a stressor range constitutes an impact.

# Methods

## Set up stressor sensitivity dataframe

Using all listed threats and the threat-to-stressor lookup, and focusing on those species with valid maps (i.e. a csv file in `spp_rasts_mol_2019`), identify species with non-negligible impacts from impact score, and where impact is unknown, based on severity field.  Here we are looking at fishing threats.  To differentiate between pelagic low/high bycatch, demersal destructive, and demersal non-destructive low/high bycatch, we will mine the species narratives for gear types, and use habitat information to locate the species in the water column.

In cases where a species is impacted by multiple threats that map to a single stressor (e.g. different mentions of pelagic high bycatch gear in different threat categories) we summarize to the _maximum_ score out of all listed.  For example, a species suffers low impact from unintentional impacts from high bycatch pelagic fishing (score = 3) but medium impact from intentional use (score = 5), the high bycatch pelagic fishing stressor layer will be weighted as a 5 across the board.

In this step we will include all species including LC and DD.  We will however limit to comprehensively assessed taxa.

Threats from IUCN mapped to fishing stressors:

| Code  | Description                                   | Potential strs      |
|:-----:|:----------------------------------------------|:--------------------|
| 5.4   | Biological resource use: Fishing & harvesting aquatic resources | NA |
| 5.4.1 | Intentional use: subsistence/small scale (species being assessed is the target) | art_fish |
| 5.4.2 | Intentional use: large scale (species being assessed is the target) | dem_dest, dem_nondest_hb, dem_nondest_lb, pel_hb, pel_lb |
| 5.4.3 | Unintentional effects: subsistence/small scale (species being assessed is not the target) | art_fish |
| 5.4.4 | Unintentional effects: large scale (species being assessed is not the target) | dem_dest, dem_nondest_hb, dem_nondest_lb, pel_hb, pel_lb |
| 5.4.5 | Persecution/control | NA |
| 5.4.6 | Motivation Unknown/Unrecorded | NA |

```{r set up spp with maps and threats}
spp_marine <- read_csv(here('_data', sprintf('spp_marine_from_api_%s.csv', api_version)))

spp_maps <- read_csv(here('_data', sprintf('spp_marine_maps_%s.csv', api_version)))

spp_risk <- read_csv(here('_data', sprintf('iucn_risk_current_%s.csv', api_version))) %>%
  select(iucn_sid, cat_score)

spp_comp <- read_csv(here('_data', sprintf('iucn_comp_assessed_%s.csv', api_version)))

chi_lookup <- read_csv(here('_raw/iucn_threat_to_stressor_lookup.csv')) %>%
  mutate(stressor = str_split(stressor, ';')) %>%
  unnest(stressor) %>%
  filter(str_detect(stressor, '[a-z]')) %>%
  filter(category == 'fishing')

spp_threats <- read_csv(file.path(dir_bd_anx, 'iucn/threats',
                               sprintf('iucn_spp_threats_%s.csv', api_version))) %>%
  left_join(chi_lookup, by = 'code') %>%
  filter(!is.na(code))

### Join the maps to the threats.
### * recategorize sensitivities based on impact score
### * if no impact score, categorize sensitivities based on severity
### * rather than filtering, set NA and past scores to 0
###   * this allows non-threatened species to remain in the mix
threats_to_marine <- spp_marine %>%
  left_join(spp_risk, by = 'iucn_sid') %>%
  left_join(spp_threats, by = 'iucn_sid') %>%
  ### code low-med-high sensitivity based on impact:
  mutate(sens = case_when(score_num %in% 0:2 ~ FALSE,     ### no/negligible impact
                          score_num > 2      ~ TRUE,      ### non-negligible impact
                          TRUE               ~ NA)) %>%   ### set all else to NA
  ### for NA impact, code sensitivity based on severity:
  mutate(sev = tolower(severity),
         sens = case_when(is.na(sens) & str_detect(sev, 'negligible|no decl') ~ FALSE,
                          is.na(sens) & str_detect(sev, 'very rapid decl')    ~ TRUE,
                          is.na(sens) & str_detect(sev, 'rapid decl')         ~ TRUE,
                          is.na(sens) & str_detect(sev, 'causing|slow, significant') ~ TRUE,
                          TRUE ~ sens)) %>%
  ### set past impacts and unresolved sensitivies to 0:
  mutate(sens = ifelse(timing == 'Past, Unlikely to Return', FALSE, sens),
         sens = ifelse(is.na(sens), FALSE, sens)) %>%
  filter(!is.na(code)) %>% ### drop code mismatches
  filter(!is.na(stressor)) %>% ### drop codes that don't match stressors
  select(iucn_sid, cat_score, sens, 
         code, stressor, score_num) %>%
  distinct()

### sensitivity categories:
# [1] "Unknown"                          "Negligible declines"             
# [3] NA                                 "No decline"                      
# [5] "Rapid Declines"                   "Slow, Significant Declines"      
# [7] "Causing/Could cause fluctuations" "Very Rapid Declines"   

write_csv(threats_to_marine, here('int/spp_threats_fishing.csv'))
```

``` {r}
n_comp <- spp_marine %>% 
  filter(iucn_sid %in% spp_maps$iucn_sid) %>%
  filter(iucn_sid %in% spp_comp$iucn_sid) %>% 
  .$iucn_sid %>% unique() %>% length()
n_threatened <- spp_risk %>% 
  filter(iucn_sid %in% spp_maps$iucn_sid) %>%
  filter(iucn_sid %in% spp_comp$iucn_sid) %>% 
  filter(!is.na(cat_score) & !cat_score %in% c(0, 1)) %>%
  .$iucn_sid %>% unique() %>% length()

spp_stressed <- threats_to_marine %>%
  filter(iucn_sid %in% spp_maps$iucn_sid) %>%
  filter(iucn_sid %in% spp_comp$iucn_sid) %>% 
  filter(!is.na(cat_score) & !cat_score %in% c(0, 1)) %>%
  filter(sens) 

n_stressed <- spp_stressed %>%
  .$iucn_sid %>% unique() %>% length()
  
```

Out of the list of `r n_comp` marine species with maps in comprehensively-assessed taxa, `r n_threatened` are neither Least Concern nor Extinct nor Data Deficient.  

The resulting number of threatened spp (including Near Threatened) with sensitivity to at least one __fishing__ stressor is: `r n_stressed`.

NOTE: These numbers may include a small handful of spp with effectively zero ocean range, which will not show up in maps, will not affect calculations, etc.  These can be found from the impacted area dataframe.
<!-- 6957 comp-assessed mapped species  -->
<!-- 1357 spp with category NC to CR  -->
<!--  922 with sensitivity to fishing stressors -->

## Identify stressors impact by gear type

Differentiate sensitivity to fishing effort based on gear type: demersal destructive gear (bottom trawl) and high bycatch (trawl, longline, gillnet).

* intentional use: can be low or high bycatch
    * if no high bycatch keywords, assume low bycatch?
    * if high bycatch keywords, check also for low bycatch keywords - seine? hand line/pole line?
* unintentional use: assume high bycatch?

Fishing stressor categories by gear type in Watson data, categorized via Halpern et al 2019:

|        gear         | bycatch | habitat  | destructive |
|:--------------------|:-------:|:--------:|:-----------:|
| dredge              |   high  | demersal |     yes     |
| gillnet             |   high  | dem/pel  |     no      |
| purse seine nontuna |   low   | pelagic  |     no      |
| purse seine tuna    |   low   | pelagic  |     no      |
| lines non tuna      |   low   | dem/pel  |     no      |
| pole-line tuna      |   low   | pelagic  |     no      |
| longline nontuna    |   high  | dem/pel  |     no      |
| longline tuna       |   high  | pelagic  |     no      |
| trap                |   high  | demersal |     no      |
| trawl midwater      |   high  | pelagic  |     no      |
| trawl bottom        |   high  | demersal |     yes     |

```{r set up gear to stressor lookup}
### manually set up the gear types.
gear_to_stressor_df <- tribble(
  ~gear,      ~gear_full,           ~bycatch, ~gear_hab, ~dest, ~stressor,
  'dredge',   'dredge',              'high',   'dem',    T, 'dem_dest',
  'gillnet',  'gillnet',             'high',   'dem',    F, 'dem_nondest_hb',
  'gillnet',  'gillnet',             'high',   'pel',    F, 'pel_hb',
  'seine',    'purse seine nontuna',  'low',   'pel',    F, 'pel_lb',
  'seine',    'purse seine tuna',     'low',   'pel',    F, 'pel_lb',
  'poleline', 'lines non tuna',       'low',   'dem',    F, 'dem_nondest_lb',
  'poleline', 'lines non tuna',       'low',   'pel',    F, 'pel_lb',
  'poleline', 'pole-line tuna',       'low',   'pel',    F, 'pel_lb',
  'longline', 'longline nontuna',    'high',   'dem',    F, 'dem_nondest_hb',
  'longline', 'longline nontuna',    'high',   'pel',    F, 'pel_hb',
  'longline', 'longline tuna',       'high',   'pel',    F, 'pel_hb',
  'trap',     'trap',                'high',   'dem',    F, 'dem_nondest_hb',
  'midtrawl', 'trawl midwater',      'high',   'pel',    F, 'pel_hb',
  'btmtrawl', 'trawl bottom',        'high',   'dem',    T, 'dem_dest',
  'btmtrawl', 'trawl bottom',        'high',   'pel',    T, 'dem_dest')
### Note: bottom trawl is listed as impacting pelagic as well, to
### account for a bunch of sharks and rays that are listed only as 
### "pelagic" but are noted to be susceptible to bottom trawling pressure.

write_csv(gear_to_stressor_df, here('_raw/gear_to_stressor_lookup.csv'))
  
```

## Identify species habitat location (demersal vs pelagic)

Habitat codes 10.X are marine pelagic; 9.1 is neritic pelagic.  Other 9.X, 11.X, 12.X, 13.X are demersal or benthic.  

Note that some species are listed as unknown (18) or other (17) only, which leaves pelagic/demersal status unresolved.  When comparing to different gear types, this is problematic, so here we will assume that an unknown or other habitat indicates potential exposure to both demersal and pelagic gear types.

```{r}
spp_dem_pel <- read_csv(file.path(dir_bd_anx, 'iucn',
                                sprintf('spp_habs_from_api_%s.csv', api_version)),
                     col_types = cols('iucn_sid' = 'i',
                                      'code' = 'c')) %>%
  select(iucn_sid, code, habitat, suitability) %>%
  distinct() %>%
  filter(iucn_sid %in% spp_marine$iucn_sid) %>% ### limit to marine spp or unknown/other
  mutate(code_main = as.integer(str_replace_all(code, '\\..*', ''))) %>%
  mutate(dem = code_main %in% c(9, 11, 12, 13) & code != '9.1',
         pel = code_main == 10 | code == '9.1') %>%
  group_by(iucn_sid) %>%
  ### check for unresolved pelagic/demersal and unknown/other hab:
  mutate(na_check = (sum(pel) + sum(dem) == 0),
         pel = ifelse(na_check & code_main %in% 17:18, TRUE, pel),
         dem = ifelse(na_check & code_main %in% 17:18, TRUE, dem)) %>%
  summarize(pel = any(pel),
            dem = any(dem)) %>%
  gather(spp_hab, val, pel, dem) %>%
  filter(val  == TRUE) %>%
  select(-val)

write_csv(spp_dem_pel, here('int/spp_demersal_pelagic.csv'))

```


## Check for bycatch and gear instances in narratives

### identify keywords for mid-water and bottom trawls

Since trawl surveys are often used for research, we can avoid instances of "trawl" near "survey" as a pattern; are there any other troublesome patterns? Examine bigrams from the narratives that include 'trawl'.

```{r set up for bigram analysis, eval = TRUE}

spp_narr_file <- file.path(dir_bd_anx, 'iucn', 
                           sprintf('spp_narr_from_api_%s.csv', api_version))

spp_narrs <- read_csv(spp_narr_file,
                      col_types = cols(.default = 'c', iucn_sid = 'i')) %>%
  select(iucn_sid:usetrade) %>%
  gather(dimension, text, -iucn_sid) %>%
  ### remove HTML tags since they seem to affect the unnesting of sentences
  mutate(text = str_replace_all(text, '<.*?>', ' '))

### set up some stop words.  The stop_words df is a bit large and slow
### to search over... also doesn't drop numbers
cut_words <- data.frame(
    word = c('a', 'and', 'or', 'in', 'on', 'of', 'by',
             'to', 'but', 'the', 'from', 
             '[0-9,;\\.]+')
  ) %>%
  mutate(combo = paste0('^', word, ' | ', word, '$')) %>%
    ### e.g. not "and trawl" or "trawling and"
  summarize(to_cut = paste0(combo, collapse = '|'))

```

```{r examine trawl bigrams, eval = TRUE}

trawl_text_df <- spp_narrs %>%
  ### prefilter to dramatically speed up the unnest_tokens...
  filter(str_detect(tolower(text), 'trawl')) %>%
  # group_by(iucn_sid, dimension) %>%
  unnest_tokens(output = bigram, input = text, 
                token = 'ngrams', n = 2, drop = FALSE) %>%
  filter(str_detect(bigram, 'trawl')) %>%
  mutate(pre_gram = ifelse(str_detect(bigram, '^trawl'), TRUE, FALSE))


trawl_bigram_df <- trawl_text_df %>%
  select(-text, -iucn_sid) %>%
  filter(!str_detect(bigram, cut_words$to_cut)) %>%
  group_by(bigram, pre_gram) %>%
  summarize(instances = n()) %>%
  ungroup() %>%
  arrange(desc(instances)) %>%
  filter(instances > 1)

DT::datatable(trawl_bigram_df)

```

Combinations to exclude:

* trawl survey
* fishery-independent trawl, exploratory trawl, experimental trawl
* research trawl, scientific trawl
* trawl sample, trawl experiment, trawl research

What kinds of patterns involve non-research trawls, to differentiate between mid-water and bottom?  Look at preceding and following words in triads, again using a regex string to identify instances of `[word] trawl[suffix]` or `trawl[suffix] [word]`.

* bottom trawls: 
    * bottom trawl, deep.?water trawl, demersal trawl, beam trawl, benthic trawl,
    * crab trawl, crustacean trawl, groundfish trawl, shrimp trawl
* midwater trawls:
    * midwater trawl, pelagic trawl
* indeterminate types?:
    * otter trawl
    
These (exclusion and habitat types) are coded in a regex exception string.

### identify keywords for entanglement and nets

Some gear types, e.g. seines and gillnets, may cause entanglement issues.

```{r examine entangle instances, eval = TRUE}

entangle_text_df <- spp_narrs %>%
  ### prefilter to dramatically speed up the unnest_tokens...
  filter(str_detect(tolower(text), 'entangl')) %>%
  unnest_tokens(output = sentence, input = text, 
                token = 'sentences', drop = TRUE) %>%
  # group_by(iucn_sid, dimension) %>%
  unnest_tokens(output = word, input = sentence, 
                token = 'words', 
                drop = FALSE, collapse = NULL) %>%
  filter(str_detect(word, 'entangl')) %>%
  mutate(debris = str_detect(sentence, 'debris'),
         net = str_detect(sentence, '[^a-z]net[^a-z]|^net[^a-z]|[^a-z]net$'),
         trawl = str_detect(sentence, 'trawl'),
         seine = str_detect(sentence, 'seine'))

DT::datatable(entangle_text_df %>% select(-word))

entangle_gear_df <- entangle_text_df %>%
  gather('gear', 'val', debris:seine) %>%
  group_by(gear) %>%
  summarize(entangle_counts = sum(val))

knitr::kable(entangle_gear_df)
```

Inspecting these we see nets and anthropogenic debris responsible for entanglements.  It seems reasonable to assign entanglement as a keyword for sensitivity to mid- and bottom trawl and gillnets.  What about seines?  Few instances of that combination in a single sentence, and most of those are "incidental" entanglement rather than attributed to seines.

### identify keywords for pole & line combos

What kinds of patterns involve line, pole-and-line, etc.  Look at preceding and following words in triads.  Rather than identify instances of tuna vs. non-tuna pole and line, we can simply specify that a given species is sensitive to low bycatch, non-destructive line fishing in whichever habitats the species occurs (pelagic, demersal, both).

```{r trigram analysis for pole and line}

### exclude things like:
### "decline", "longline", "linear", "lineatus", "lineolatus"
line_exclude <- c('[a-z]line|line[ao]|long.?line')

line_text_df <- spp_narrs %>%
  ### prefilter to dramatically speed up the unnest_tokens...
  filter(str_detect(tolower(text), 'line')) %>%
  # group_by(iucn_sid, dimension) %>%
  unnest_tokens(output = trigram, input = text, 
                token = 'ngrams', n = 3, drop = FALSE) %>%
  filter(str_detect(trigram, 'line') &
           !str_detect(trigram, line_exclude))


line_trigram_df <- line_text_df %>%
  select(-text, -iucn_sid) %>%
  filter(!str_detect(trigram, cut_words$to_cut)) %>%
    ### will drop "and" (etc) but only at start or end of trigram
  group_by(trigram) %>%
  summarize(instances = n()) %>%
  ungroup() %>%
  arrange(desc(instances)) %>%
  filter(instances > 1)

DT::datatable(line_trigram_df)

```

* some examples (consider also swapping spaces with hyphens, e.g. [^a-z] or [ -])
    * hook(s) (and) line, hand line, troll(ing) line,            
    * caught by line, caught in line, caught on line, caught with line, 
    * deep water line, hook and line, pole-and-line, drop-line, 
    * line fishery, line fisheries, industrial line fishery, inshore line fishing
    * commercial line
* exclude: longline (or long line)


### identify keywords for destructive fisheries?

Some species mention threats from destructive fishing practices such as blast/dynamite fishing and cyanide fishing.  These are not in our maps of gear types so they will be excluded.

## Parse threats to count by gear type

Once parsed, examine frequency of various gear types.  Presumably should be a higher proportion of gears associated with high bycatch - i.e. indiscriminate gear types.

To help differentiate between mid-water trawl and bottom trawl, for species where narratives do not clearly differentiate, we will use habitat information to determine sensitivity.

```{r set up gear keywords}
### create search terms for midwater trawl (separate into "trawl"-specific 
### vs keywords that don't involve "trawl")
gears_midtrawl1 <- c('mid.?water trawl', 'pelagic trawl')
gears_midtrawl2 <- c('entangl')
### create search terms for bottom trawl:
gears_btmtrawl1 <- c('bottom trawl', 'deep.?water trawl', 
                    'demersal trawl', 'beam trawl', 'benthic trawl',
                    'crab trawl', 'crustacean trawl', 'groundfish trawl', 
                    'shrimp trawl', 'prawn trawl', 'scallop trawl')
gears_btmtrawl2 <- c('deep sea shrimp fisher',
                     'deep sea crustacean fisher', 
                     'deep.?water demersal fisher',
                     'danish seine', 'entangl')

### create keywords for non-fishery "trawl" instances
trawl_except <- c(
  '(explora|experim|research|scientific|fisher[a-z]+.indep)[a-z]* trawl',
  'trawl[a-z]* (surv|sample|experiment|research)')

### create search terms for line fishing (but not longline):
gears_line <- c('caught [a-z]+ line', 
                  # caught in/on/with line (won't include longline)
                'deep.?water line', 
                'hook.and.line', 
                'pole.and.line', 
                'drop.?line', 
                '(?<!long.?)line fish',
                  # line fishery excl longline/long-line, etc
                'commercial line') # line fishing/fishery/fisheries

### create search terms for gillnet fishing:
gears_net <- c('gill.?net', 'drift.?net', 'set.?net')

### create regex of gear type search terms
gears_of_interest <- c('by.?catch', 
                       'dredge',
                       'long.?line', 
                       gears_line,
                       'seine',
                       gears_net,
                       '[^a-z]trap', ### avoid "entrapment"
                       'trawl',
                       'entangl',
                       'fisher') %>%
  paste0(collapse = '|')
```


``` {r search narratives for terms}
gear_narr_df <- spp_narrs %>%
  gather(dimension, text, -iucn_sid) %>%
  ### prefilter for speedier unnesting of tokens:
  filter(str_detect(tolower(text), gears_of_interest)) %>%
  tidytext::unnest_tokens(input = text, output = sentence, token = 'sentences') %>%
  ### then filter to drop sentences without gear mentions:
  filter(str_detect(sentence, gears_of_interest))

### count all the "trawl" instances, and subtract out the non-stressor trawls
trawl_narr_class <- gear_narr_df %>%
  mutate(trawl_count = str_count(sentence, 'trawl'),
         trawl_count = trawl_count - str_count(sentence, trawl_except))

gear_narr_class <- gear_narr_df %>%
  mutate(dredge_count   = str_count(sentence, 'dredge'),
         longline_count = str_count(sentence, 'long.?line'),
         poleline_count = str_count(sentence, paste0(gears_line, collapse = '|')),
         seine_count    = str_count(sentence, 'seine'), ### note: includes danish seine
         trap_count     = str_count(sentence, 'trap'),
         gillnet_count  = str_count(sentence, paste0(gears_net, collapse = '|')),
         ### trawl counts associated with explicit mention of "trawl":
         midtrawl_count1 = str_count(sentence, paste0(gears_midtrawl1, collapse = '|')),
         btmtrawl_count1 = str_count(sentence, paste0(gears_btmtrawl1, collapse = '|')),
         ### trawl counts associated with non-"trawl" descriptions:
         midtrawl_count2 = str_count(sentence, paste0(gears_midtrawl2, collapse = '|')),
         btmtrawl_count2 = str_count(sentence, paste0(gears_btmtrawl2, collapse = '|'))) %>%
  full_join(trawl_narr_class, by = c('iucn_sid', 'dimension', 'sentence')) %>%
  ### check that the midtrawl and btmtrawl counts ("trawl"-specific) don't exceed the 
  ### total non-scientific/exploratory trawls
  mutate(midtrawl_count1 = ifelse(midtrawl_count1 > trawl_count,
                                 trawl_count, midtrawl_count1),
         btmtrawl_count1 = ifelse(btmtrawl_count1 > trawl_count,
                                 trawl_count, btmtrawl_count1)) %>%
  ### if "trawl" counts but no mid- or btm-trawl, assign trawl counts to btm_trawl
  ### (assume generic trawl means bottom)
  mutate(btmtrawl_count1 = ifelse(midtrawl_count1 == 0 & btmtrawl_count1 == 0,
                                 trawl_count, btmtrawl_count1)) %>%
  ### consolidate the mid-water trawls and bottom trawls
  mutate(btmtrawl_count = btmtrawl_count1 + btmtrawl_count2,
         midtrawl_count = midtrawl_count1 + midtrawl_count2) %>%
  ### clean up
  select(-trawl_count, -sentence,
         -btmtrawl_count1, -btmtrawl_count2, 
         -midtrawl_count1, -midtrawl_count2)

gear_narr_tidy <- gear_narr_class %>%
  gather(gear, count, ends_with('count')) %>%
  mutate(gear = str_replace_all(gear, '_count', '')) %>%
  group_by(iucn_sid, gear) %>%
  summarize(count = sum(count)) %>%
  ungroup()

write_csv(gear_narr_tidy, here('int/gear_narr_tidy.csv'))

```

## Combine gear-to-stressors with species IDs and habitats

```{r}
gear_narr_tidy <- read_csv(here('int/gear_narr_tidy.csv'))
spp_dem_pel <- read_csv(here('int/spp_demersal_pelagic.csv'))
gear_to_stressor_lookup <- read_csv(here('_raw/gear_to_stressor_lookup.csv'))
spp_threats_from_iucn <- read_csv(here('int/spp_threats_fishing.csv'))

### identify all the unique fishing stressors
stressors <- spp_threats_from_iucn$stressor %>% unique()

### separate out artisanal fishing threats as these are not gear-based
art_fish_from_iucn <- spp_threats_from_iucn %>%
  filter(stressor == 'art_fish') %>%
  select(iucn_sid, stressor, sens, code) %>%
  distinct()

### for gear-related stressors, bind spp and gear habitats per species,
### and join to the threats
gear_sens_df <- gear_narr_tidy %>%
  left_join(spp_dem_pel, by = 'iucn_sid') %>%
  left_join(gear_to_stressor_lookup, by = 'gear') %>%
  filter(spp_hab == gear_hab | is.na(spp_hab)) %>%
    ### some spp have NA for spp_hab - assume these can be either hab
  left_join(spp_threats_from_iucn, by = c('iucn_sid', 'stressor'))

### Summarize sensitivity to each stressor as any non-zero counts
### within that stressor category.  Complete the list so all stressors
### show up for all species, leaving NAs for those added via complete().
gear_sens_summary <- gear_sens_df %>%
  group_by(iucn_sid, stressor, code, score_num) %>%
  filter(!is.na(sens)) %>%
  summarize(sens = sum(count * sens) > 0) %>%
  bind_rows(art_fish_from_iucn) %>%
  group_by(iucn_sid) %>%
  complete(stressor = stressors, fill = list(sens = FALSE)) %>%
  ungroup() %>%
  mutate(score_num = ifelse(score_num >= 3 & !sens, NA, score_num))

write_csv(gear_sens_summary, 
          here('int', sprintf('spp_sens_fishing_%s.csv', api_version)))
```

## Check the number of spp sensitive to different types...

```{r}
spp_sens <- gear_sens_summary %>%
  filter(iucn_sid %in% spp_maps$iucn_sid) %>%
  filter(iucn_sid %in% spp_comp$iucn_sid) %>% 
  left_join(spp_risk, by = 'iucn_sid') %>%
  filter(!is.na(cat_score) & !cat_score %in% c(0, 1)) %>%
  filter(sens)

n_spp_sens <- spp_sens %>%
  .$iucn_sid %>%
  unique() %>%
  length()

drop_check_all <- spp_marine %>%
  filter(iucn_sid %in% spp_stressed$iucn_sid) %>%
  filter(!iucn_sid %in% spp_sens$iucn_sid) %>%
  left_join(spp_narrs, by = 'iucn_sid') %>%
  left_join(spp_maps, by = 'iucn_sid') %>%
  left_join(spp_comp %>% select(-sciname), by = 'iucn_sid') %>%
  select(iucn_sid, sciname, dimension, assess_gp, text) %>%
  distinct()

drop_check <- drop_check_all %>%
  ### prefilter to dramatically speed up the unnest_tokens...
  filter(dimension != 'habitat') %>%
  filter(str_detect(tolower(text), 'fish')) %>%
  unnest_tokens(output = sentence, input = text, 
                token = 'sentences', drop = TRUE) %>%
  filter(str_detect(sentence, 'fish'))
```

According to this analysis so far, there are `r n_spp_sens` distinct threatened species (with maps, comprehensively assessed, etc) affected by various fishing stressors.  Note that `r n_stressed` species are listed as threatened therefore a few are not being captured via gear type.  See here:

`r DT::datatable(drop_check)`

```{r}
spp_per_stressor <- gear_sens_summary %>%
  select(-code) %>%
  filter(sens) %>%
  distinct() %>%
  filter(iucn_sid %in% spp_stressed$iucn_sid) %>%
  count(stressor) %>%
  arrange(n) %>%
  mutate(stressor = fct_inorder(stressor))
ggplot(spp_per_stressor, aes(x = stressor, y = n)) +
  ggtheme_plot() +
  geom_col() + 
  coord_flip() +
  labs(title = 'sensitive spp by stressor',
       y = 'number of species affected by stressor')
```

Most species are affected by artisanal fishing.

```{r}
stressor_per_spp <- gear_sens_summary %>%
  select(-code) %>%
  filter(sens) %>%
  distinct() %>%
  filter(iucn_sid %in% spp_stressed$iucn_sid) %>%
  count(iucn_sid) %>%
  mutate(n_txt = as.character(n))

ggplot(stressor_per_spp, aes(x = n_txt)) +
  ggtheme_plot() +
  geom_bar() + 
  labs(title = 'stressor sensitivities per spp',
       x = 'number of stressors on spp',
       y = 'number of species')
```

Most species are affected by only one fishing stressor.

## Check for species listed as a threat but not otherwise captured

```{r}
threats  <- read_csv(here('int/spp_threats_fishing.csv')) %>%
  rename(thr_sens = sens) %>%
  mutate(thr_listed = TRUE)
# iucn_sid; cat_score; code; stressor; thr_sens; thr_listed

gear_sum <- read_csv(here('int', sprintf('spp_sens_fishing_%s.csv', api_version))) %>%
  rename(gear_sens = sens)
# iucn_sid; stressor; code; gear_sens

spp_comp <- read_csv(here('_data', sprintf('iucn_comp_assessed_%s.csv', api_version)))

tmp <- threats %>%
  left_join(gear_sum, by = c('iucn_sid', 'stressor', 'code')) %>%
  group_by(iucn_sid, code) %>%
  summarize(n_sens = sum(thr_sens),
            n_stressor = sum(gear_sens, na.rm = TRUE)) %>%
  ungroup()

tmp_nomatch <- tmp %>%
  filter(n_sens >= 1 & n_stressor == 0) %>%
  group_by(iucn_sid) %>%
  summarize(codes = paste(code, collapse = ', ')) %>%
  left_join(spp_comp, by = 'iucn_sid') %>%
  filter(!is.na(assess_gp)) %>%
  filter(assess_gp != 'sturgeons')
  
table(tmp_nomatch$assess_gp)

DT::datatable(tmp_nomatch)
```

Note that each of these is listed as being threatened by fishing pressures, but no categorization of gear type was found in the narratives.  Inspect each of these to better understand/inform better definition of keywords.

First round: 129 unmatched spp.  

* Fixed the `spp_hab == gear_hab` filter by adding `spp_hab == gear_hab | is.na(spp_hab)`
* Added "deep sea (shrimp|crustacean) fisher"(ies) to search term for trawl
* fixed up the "trawl" counting and method to exclude "trawl survey" etc by splitting explicit mentions of "trawl" from other terms without the word "trawl" e.g. "deep water shrimp fishery"

After these, 37 species remain that seem to have significant fishing impacts but not enough detail in the narratives to identify gear types.  To be conservative, these will simply be dropped at this point.

```{r}
x <- readxl::read_excel(here('unmatched_fishing_threats.xlsx'))

z <- tmp_nomatch %>%
  left_join(spp_narrs, by = 'iucn_sid') %>%
  filter(dimension %in% c('threats')) %>%
  # filter(dimension %in% c('threats', 'rationale', 'conservationmeasures')) %>%
  filter(!is.na(text))
  # filter(!iucn_sid %in% x$iucn_sid)

non_threats <- c('no major threats', 
                 'no apparent fishing threats', 
                 'indirect fishing impacts', 
                 'competition with fisheries - not direct',
                 'destructive',
                 'NA')

### species manually checked and not yet addressed
y <- x %>% filter(iucn_sid %in% z$iucn_sid) %>%
  filter(!is.na(notes)) %>%
  filter(!str_detect(notes, paste0(non_threats, collapse = '|')))

### species manually checked and noted as "no specific fishery" - can't infer
### what type of gear is involved
yy <- x %>% filter(iucn_sid %in% z$iucn_sid) %>%
  filter(notes == 'no specific fishery')

### non-matched not included in check excel - should be zero
yyy <- z %>% filter(!iucn_sid %in% x$iucn_sid)

DT::datatable(y)
```