-
Notifications
You must be signed in to change notification settings - Fork 4
/
01_selecting-study-species.Rmd
325 lines (273 loc) · 11.9 KB
/
01_selecting-study-species.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
editor_options:
chunk_output_type: console
output:
html_document:
df_print: paged
---
# Selecting species of interest
Prior to preparing eBird data for occupancy modeling, we selected a list of species using a simple and objective criteria. Our primary focus is to understand how terrestrial bird species occupancy (largely passerine species) varied as a function of climate and land cover across the Nilgiri and the Anamalai hills of the Western Ghats.
We derived this list from inclusion criteria adapted from the State of India’s Birds 2020. Initially, we considered all species reported on eBird that occurred within the outlines of our study area. We then added a filter to consider only terrestrial birds and removed species that are often easily confused for their congeners (eg. green/greenish warbler). In addition, we considered only those species that had a minimum of 1000 detections each between 2013 and 2021. Next, the study area was divided into 25 x 25 km cells following Viswanathan et al. (2020). We then kept only those species that occurred in at least 5% of all checklists across half of the 25 x 25 km cells from where they have been reported (there are 42 unique 25 x 25 km grid cells across our study area). We used the above criteria to ensure as much uniform sampling of a species as possible across our study area and to reduce any erroneous associations between environmental drivers and species occupancy. This resulted in a total of 79 species, prior to occupancy modeling.
This script shows the proportion of checklists that report a particular species across every 25km by 25km grid across the Nilgiris and the Anamalais. Using this analysis, we arrived at a final list of species for occupancy modeling.
## Prepare libraries
```{r setup_sup_02}
# load libraries
library(data.table)
library(readxl)
library(magrittr)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggthemes)
library(scico)
# round any function
round_any <- function(x, accuracy = 25000) {
round(x / accuracy) * accuracy
}
# set file paths for auk functions
# To use these two datasets, please download the latest versions from https://ebird.org/data/download and set the file path accordingly. Since these two datasets are extremely large, we have not uploaded the same to github.
# In this study, the version of data loaded corresponds to November 2021.
f_in_ebd <- file.path("data/ebd_IN_relNov-2021.txt")
f_in_sampling <- file.path("data/ebd_sampling_relNov-2021.txt")
```
## Subset species by geographical confines of the study area
```{r load_raw_data_supp02}
# read in shapefile of the study area to subset by bounding box
library(sf)
wg <- st_read("data/spatial/hillsShapefile/Nil_Ana_Pal.shp")
box <- st_bbox(wg)
# read in data and subset
# To access the latest dataset, please visit: https://ebird.org/data/download and set the file path accordingly.
ebd <- fread("data/ebd_IN_relNov-2021.txt")
ebd <- ebd[between(LONGITUDE, box["xmin"], box["xmax"]) &
between(LATITUDE, box["ymin"], box["ymax"]), ]
ebd <- ebd[year(`OBSERVATION DATE`) >= 2013, ]
# make new column names
newNames <- str_replace_all(colnames(ebd), " ", "_") %>%
str_to_lower()
setnames(ebd, newNames)
# keep useful columns
columnsOfInterest <- c(
"common_name", "scientific_name", "observation_count", "locality",
"locality_id", "locality_type", "latitude",
"longitude", "observation_date", "sampling_event_identifier"
)
ebd <- ebd[, ..columnsOfInterest]
```
## Subset an initial list of terrestrial birds based on a) minimum of 1000 detections between 2013-2021 and b) remove species that are often easily confused with congeners
```{r}
# Convert all presences marked 'X' as '1'
ebd <- ebd %>%
mutate(observation_count = ifelse(observation_count == "X",
"1", observation_count
))
# Convert observation count to numeric
ebd$observation_count <- as.numeric(ebd$observation_count)
totCount <- ebd %>%
dplyr::select(scientific_name, common_name, observation_count) %>%
group_by(scientific_name, common_name) %>%
summarise(tot = sum(observation_count))
# subset species with a min of 1000 detections
tot1000 <- totCount %>%
filter(tot > 1000)
species1000 <- tot1000$scientific_name
ebd1000 <- ebd[scientific_name %in% species1000, ]
# Beginning with 3.37 million observations of 684 species in eBird that occurred within the outlines of our study area (Fig. 1), over the years 2013–2021, we retained only those species that had a minimum of 1,000 detections each between 2013 and 2021 (347 species remaining; 3.33 million observations). Next, we divided the study area into 25x25 km cells following State of India’s Birds 2020 methodology. We kept only those species that occurred in at least 5% of all checklists across half of the grids (42 unique grid cells) from which they had been reported.
# export the above list as .csv to carry out initial filtering based on natural history
write.csv(totCount, "data/species-list.csv", row.names = F)
```
## Read subset of species following filtering and removal of waterbirds, raptors, and other noctural species
```{r soi_supplement,message=FALSE, warning=FALSE}
# add species of interest
# please note the below script is obtained after manual subsetting based on natural history and hence the user is asked to examine the dataset obtained in the previous time step prior to further processing
specieslist <- read.csv("data/species-list.csv")
speciesOfInterest <- specieslist$scientific_name
```
## Load raw data for locations
Add a spatial filter and assign grids of 25km x 25km.
```{r strict_filter_supp02}
# strict spatial filter and assign grid
locs <- ebd[, .(longitude, latitude)]
# transform to UTM and get 25km boxes
coords <- setDF(locs) %>%
st_as_sf(coords = c("longitude", "latitude")) %>%
`st_crs<-`(4326) %>%
bind_cols(as.data.table(st_coordinates(.))) %>%
st_transform(32643) %>%
mutate(id = 1:nrow(.))
# convert wg to UTM for filter
wg <- st_transform(wg, 32643)
coords <- coords %>%
filter(id %in% unlist(st_contains(wg, coords))) %>%
rename(longitude = X, latitude = Y) %>%
bind_cols(as.data.table(st_coordinates(.))) %>%
st_drop_geometry() %>%
as.data.table()
# remove unneeded objects
rm(locs)
gc()
coords <- coords[, .N, by = .(longitude, latitude, X, Y)]
ebd <- merge(ebd, coords, all = FALSE, by = c("longitude", "latitude"))
ebd <- ebd[(longitude %in% coords$longitude) &
(latitude %in% coords$latitude), ]
```
## Get proportional obs counts in 25km cells
```{r count_obs_cell}
# round to 25km cell in UTM coords
ebd[, `:=`(X = round_any(X), Y = round_any(Y))]
# count checklists in cell
ebd_summary <- ebd[, nchk := length(unique(sampling_event_identifier)),
by = .(X, Y)
]
# count checklists reporting each species in cell and get proportion
ebd_summary <- ebd_summary[, .(nrep = length(unique(
sampling_event_identifier
))),
by = .(X, Y, nchk, scientific_name)
]
ebd_summary[, p_rep := nrep / nchk]
# filter for soi
ebd_summary <- ebd_summary[scientific_name %in% speciesOfInterest, ]
# complete the dataframe for no reports
# keep no reports as NA --- allows filtering based on proportion reporting
ebd_summary <- setDF(ebd_summary) %>%
complete(
nesting(X, Y), scientific_name # ,
# fill = list(p_rep = 0)
) %>%
filter(!is.na(p_rep))
```
## Which species are reported sufficiently in checklists?
```{r }
# A total of 42 unique grids (of 25km by 25km) across the study area
# total number of checklists across unique grids
tot_n_chklist <- ebd_summary %>%
distinct(X, Y, nchk)
# species-specific number of grids
spp_grids <- ebd_summary %>%
group_by(scientific_name) %>%
distinct(X, Y) %>%
count(scientific_name,
name = "n_grids"
)
# Write the above two results
write_csv(tot_n_chklist, "data/01_nchk-per-grid.csv")
write_csv(spp_grids, "data/01_ngrids-per-spp.csv")
# left-join the datasets
ebd_summary <- left_join(ebd_summary, spp_grids, by = "scientific_name")
# check the proportion of grids across which this cut-off is met for each species
# Is it > 90% or 70%?
# For example, with a 3% cut-off, ~100 species are occurring in >50%
# of the grids they have been reported in
p_cutoff <- 0.05 # Proportion of checklists a species has been reported in
grid_proportions <- ebd_summary %>%
group_by(scientific_name) %>%
tally(p_rep >= p_cutoff) %>%
mutate(prop_grids_cut = n / (spp_grids$n_grids)) %>%
arrange(desc(prop_grids_cut))
grid_prop_cut <- filter(
grid_proportions,
prop_grids_cut >= p_cutoff
)
# Write the results
write_csv(grid_prop_cut, "data/01_prop-grids-per-spp.csv")
# Identifying the number of species that occur in potentially <5% of all lists
total_number_lists <- sum(tot_n_chklist$nchk)
spp_sum_chk <- ebd_summary %>%
distinct(X, Y, scientific_name, nrep) %>%
group_by(scientific_name) %>%
mutate(sum_chk = sum(nrep)) %>%
distinct(scientific_name, sum_chk)
# Approximately 90 to 100 species occur in >5% of all checklists
prop_all_lists <- spp_sum_chk %>%
mutate(prop_lists = sum_chk / total_number_lists) %>%
arrange(desc(prop_lists))
```
## Figure: Checklist distribution
```{r load_map_plot_data}
# add land
library(rnaturalearth)
land <- ne_countries(
scale = 50, type = "countries", continent = "asia",
country = "india",
returnclass = c("sf")
)
# crop land
land <- st_transform(land, 32643)
```
```{r plot_obs_distributions,echo=FALSE}
# make plot
wg <- st_transform(wg, 32643)
bbox <- st_bbox(wg)
# get a plot of number of checklists across grids
plotNchk <-
ggplot() +
geom_sf(data = land, fill = "grey90", col = NA) +
geom_tile(
data = tot_n_chklist,
aes(X, Y, fill = nchk), lwd = 0.5, col = "grey90"
) +
geom_sf(data = wg, fill = NA, col = "black", lwd = 0.3) +
scale_fill_scico(
palette = "lajolla",
direction = 1,
trans = "log10",
limits = c(1, 10000),
breaks = 10^c(1:4)
) +
coord_sf(xlim = bbox[c("xmin", "xmax")], ylim = bbox[c("ymin", "ymax")]) +
theme_few() +
theme(
legend.position = "right",
axis.title = element_blank(),
axis.text.y = element_text(angle = 90),
panel.background = element_rect(fill = "lightblue")
) +
labs(fill = "number\nof\nchecklists")
# export data
ggsave(plotNchk,
filename = "figs/fig_number_checklists_25km.png", height = 12,
width = 7, device = png(), dpi = 300
)
dev.off()
# filter list of species
ebd_filter <- semi_join(ebd_summary, grid_prop_cut, by = "scientific_name")
plotDistributions <-
ggplot() +
geom_sf(data = land, fill = "grey90", col = NA) +
geom_tile(
data = ebd_filter,
aes(X, Y, fill = p_rep), lwd = 0.5, col = "grey90"
) +
geom_sf(data = wg, fill = NA, col = "black", lwd = 0.3) +
scale_fill_scico(palette = "lajolla", direction = 1, label = scales::percent) +
facet_wrap(~scientific_name, ncol = 12) +
coord_sf(xlim = bbox[c("xmin", "xmax")], ylim = bbox[c("ymin", "ymax")]) +
ggthemes::theme_few(
base_family = "TT Arial",
base_size = 8
) +
theme(
legend.position = "right",
strip.text = element_text(face = "italic"),
axis.title = element_blank(),
axis.text.y = element_text(angle = 90),
panel.background = element_rect(fill = "lightblue")
) +
labs(fill = "prop.\nreporting\nchecklists")
# export data
ggsave(plotDistributions,
filename = "figs/fig_species_distributions.png",
height = 25, width = 25, device = png(), dpi = 300
)
dev.off()
```
![Proportion of checklists reporting a species in each grid cell (25km side) between 2013 and 2021. Checklists were filtered to be within the boundaries of the Nilgiris and the Anamalai hills (black outline), but rounding to 25km cells may place cells outside the boundary. Deeper shades of red indicate a higher proportion of checklists reporting a species.](figs/fig_species_prop_checklists_25kmgrids.png)
## Prepare the species list
```{r}
# write the new list of species that occur in at least 5% of checklists across a minimum of 50% of the grids they have been reported in
new_sp_list <- semi_join(specieslist, grid_prop_cut, by = "scientific_name")
write_csv(new_sp_list, "results/01_list-of-species-cutoff.csv")
```