forked from vjjan91/eBirdOccupancy
-
Notifications
You must be signed in to change notification settings - Fork 2
/
02_prep-ebird-data.Rmd
260 lines (207 loc) · 9.16 KB
/
02_prep-ebird-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
editor_options:
chunk_output_type: console
---
# Preparing eBird Data
## Prepare libraries and data sources
Here, we will load the necessary libraries required for preparing the eBird data. Please download the latest versions of the eBird Basic Dataset (for India) and the eBird Sampling dataset from https://ebird.org/data/download.
```{r load_libs, message=FALSE, warning=FALSE}
# load libraries
library(tidyverse)
library(readr)
library(sf)
library(auk)
library(readxl)
library(lubridate)
# custom sum function
sum.no.na <- function(x) {
sum(x, na.rm = T)
}
# set file paths for auk functions
# To use these two datasets, please download the latest versions from https://ebird.org/data/download and set the file path accordingly. Since these two datasets are extremely large, we have not uploaded the same on github.
# In this study, the version of data loaded corresponds to November 2021.
f_in_ebd <- file.path("data/ebd_IN_relNov-2021.txt")
f_in_sampling <- file.path("data/ebd_sampling_relNov-2021.txt")
```
## Filter data
Insert the list of species that we will be analyzing in this study. We initially chose those species that occurred in at least 5% of all checklists across 50% of the 25 x 25 km cells from where they have been reported, resulting in a total of 79 species. To arrive at this final list of species, we carried out further pre-processing which can be found in the previous script.
For further details regarding the list of species, please refer to the main text of the manuscript.
```{r soi, message=FALSE, warning=FALSE}
# add species of interest
specieslist <- read.csv("data/species_list.csv")
speciesOfInterest <- as.character(specieslist$scientific_name)
```
Here, we set broad spatial filters for the states of Kerala, Tamil Nadu and Karnataka and keep only those checklists for our list of species that were reported between 1st Jan 2013 and 31st May 2021.
```{r prep_ebd_filters}
# run filters using auk packages
ebd_filters <- auk_ebd(f_in_ebd, f_in_sampling) %>%
auk_species(speciesOfInterest) %>%
auk_country(country = "IN") %>%
auk_state(c("IN-KL", "IN-TN", "IN-KA")) %>%
# Restricting geography to TamilNadu, Kerala & Karnataka
auk_date(c("2013-01-01", "2021-05-31")) %>%
auk_complete()
# check filters
ebd_filters
```
Below code need not be run if it has been filtered once already and the above path leads to the right dataset. NB: This is a computation heavy process, run with caution.
```{r output_loc}
# specify output location and perform filter
f_out_ebd <- "data/01_ebird-filtered-EBD-westernGhats.txt"
f_out_sampling <- "data/01_ebird-filtered-sampling-westernGhats.txt"
```
```{r filter_data}
ebd_filtered <- auk_filter(ebd_filters,
file = f_out_ebd,
file_sampling = f_out_sampling, overwrite = TRUE
)
```
## Process filtered data
The data has been filtered above using the auk functions. We will now work with the filtered checklist observations (Please note that we have not yet spatially filtered the checklists to the confines of our study area, which is the Nilgiris and the Anamalai hills. This step is carried out further on).
```{r read_data}
# read in the data
ebd <- read_ebd(f_out_ebd)
```
eBird checklists only suggest whether a species was reported at a particular location. To arrive at absence data, we use a process known as zero-filling [@johnston2019a], wherein a new dataframe is created with a 0 marked for each checklist when the bird was not observed.
```{r fill_zeroes}
# fill zeroes
zf <- auk_zerofill(f_out_ebd, f_out_sampling)
new_zf <- collapse_zerofill(zf)
```
Let us now choose specific columns necessary for further analysis.
```{r choose_cols}
# choose columns of interest
columnsOfInterest <- c(
"checklist_id", "scientific_name", "common_name",
"observation_count", "locality", "locality_id",
"locality_type", "latitude", "longitude",
"observation_date", "time_observations_started",
"observer_id", "sampling_event_identifier",
"protocol_type", "duration_minutes",
"effort_distance_km", "effort_area_ha",
"number_observers", "species_observed",
"reviewed"
)
# make list of presence and absence data and choose cols of interest
data <- list(ebd, new_zf) %>%
map(function(x) {
x %>% select(one_of(columnsOfInterest))
})
# remove zerofills to save working memory
rm(zf, new_zf)
gc()
# check for presences and absence in absences df, remove essentially the presences df which may lead to erroneous analysis
data[[2]] <- data[[2]] %>% filter(species_observed == F)
```
## Spatial filter
A spatial filter is now supplied to further restrict our list of observations to the confines of the Nilgiris and the Anamalai hills of the Western Ghats biodiversity hotspot.
```{r spatial_filters}
# load shapefile of the study area
library(sf)
hills <- st_read("data/spatial/hillsShapefile/Nil_Ana_Pal.shp")
# write a prelim filter by bounding box
box <- st_bbox(hills)
# get data spatial coordinates
dataLocs <- data %>%
map(function(x) {
select(x, longitude, latitude) %>%
filter(between(longitude, box["xmin"], box["xmax"]) &
between(latitude, box["ymin"], box["ymax"]))
}) %>%
bind_rows() %>%
distinct() %>%
st_as_sf(coords = c("longitude", "latitude")) %>%
st_set_crs(4326) %>%
st_intersection(hills)
# get simplified data and drop geometry
dataLocs <- mutate(dataLocs, spatialKeep = T) %>%
bind_cols(., as_tibble(st_coordinates(dataLocs))) %>%
st_drop_geometry()
# bind to data and then filter
data <- data %>%
map(function(x) {
left_join(x, dataLocs, by = c("longitude" = "X", "latitude" = "Y")) %>%
filter(spatialKeep == T) %>%
select(-Id, -spatialKeep)
})
```
Save temporary data created so far.
```{r save_temp_data}
# save a temp data file
save(data, file = "data/01_data_temp.rdata")
```
## Handle presence data
Further pre-processing is required in the case of many checklists where species abundance is often unknown and an 'X' is denoted in such cases. Here, we convert all 'X' notations to a 1, suggesting a presence (as we are not concerned with abundance data in this analysis). We also removed those checklists where the duration in minutes is either not recorded or listed as zero. Lastly, we added an sampling effort based filter following [@johnston2019a], wherein we considered only those checklists with duration in minutes is less than 300 and distance in kilometers traveled is less than 5km. Lastly, we excluded those group checklists where the number of observers was greater than 10. For the sake of occupancy modeling of appropriate detection and occupancy covariates, we restrict all our checklists between December 1st and May 31st (non-rainy months)and checklists recorded between 5am and 7pm.
```{r proc_presence_data}
# in the first set, replace X, for presences, with 1
data[[1]] <- data[[1]] %>%
mutate(observation_count = ifelse(observation_count == "X",
"1", observation_count
))
# remove records where duration is 0
data <- map(data, function(x) filter(x, duration_minutes > 0))
# group data by site and sampling event identifier
# then, summarise relevant variables as the sum
dataGrouped <- map(data, function(x) {
x %>%
group_by(sampling_event_identifier) %>%
summarise_at(
vars(
duration_minutes, effort_distance_km,
effort_area_ha
),
list(sum.no.na)
)
})
# bind rows combining data frames, and filter
dataGrouped <- bind_rows(dataGrouped) %>%
filter(
duration_minutes <= 300,
effort_distance_km <= 5,
effort_area_ha <= 500
)
# get data identifiers, such as sampling identifier etc
dataConstants <- data %>%
bind_rows() %>%
select(
sampling_event_identifier, time_observations_started,
locality, locality_type, locality_id,
observer_id, observation_date, scientific_name,
observation_count, protocol_type, number_observers,
longitude, latitude
)
# join the summarised data with the identifiers,
# using sampling_event_identifier as the key
dataGrouped <- left_join(dataGrouped, dataConstants,
by = "sampling_event_identifier"
)
# remove checklists or seis with more than 10 obervers
count(dataGrouped, number_observers > 10) # count how many have 10+ obs
dataGrouped <- filter(dataGrouped, number_observers <= 10)
# keep only checklists between 5AM and 7PM
dataGrouped <- filter(dataGrouped, time_observations_started >= "05:00:00" & time_observations_started <= "19:00:00")
# keep only checklists between December 1st and May 31st
dataGrouped <- filter(dataGrouped, month(observation_date) %in% c(1, 2, 3, 4, 5, 12))
```
## Add decimal time
We added a column where time is denoted in decimal hours since midnight.
```{r decimal_time}
# assign present or not, and get time in decimal hours since midnight
library(lubridate)
time_to_decimal <- function(x) {
x <- hms(x, quiet = TRUE)
hour(x) + minute(x) / 60 + second(x) / 3600
}
# will cause issues if using time obs started as a linear effect and not quadratic
dataGrouped <- mutate(dataGrouped,
pres_abs = observation_count >= 1,
decimalTime = time_to_decimal(time_observations_started)
)
# check class of dataGrouped, make sure not sf
assertthat::assert_that(!"sf" %in% class(dataGrouped))
```
The above data is saved to a file.
```{r write_clean_data}
# save a temp data file
save(dataGrouped, file = "data/01_data_prelim_processing.Rdata")
```