-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional functionality for MOL internal uses #236
base: master
Are you sure you want to change the base?
Conversation
identify records outside a region of interest (mol_roi.R), identify records flagged according to a quality control attribute provided by the source (mol_flagged_by_source.R), and identify spatiotemporal duplicates (mol_spatiotemporal_duplicate.R).
Hi Matt, My apologies for the delay in getting back to you. bdc team will check the functions carefully, but I believe that they be incorporated into bdc seamlessly. It will be our pleasure to invite you to be a collaborator of bdc here on GitHub and on future publications. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matthewsrogan,
We are glad of your interest in the bdc and its usefulness for your research group. I think that your functions are a great contribution to bdc, and we can work on that to meet the package standards. First, we cannot include more dependencies directly in our package (for example, the terra
package in your functions), but it can be solved using our internal function (see comments in the review). Also, can you include some tests to make sure your functions are returning what they must return? Although tests are not mandatory for CRAN, they ensure that our functions work as expected. Let us know if you need some help with this.
data, | ||
flagStrings, | ||
flagCols | ||
){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be something like this: mol_flagged_by_source <- function(data, flagStrings, flagCols) {...}
.
R/mol_flagged_by_source.R
Outdated
|
||
check_col(data, flagCols) | ||
|
||
terms <- str_c(flagStrings, collapse = "|") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All function calls need prepended package name, for example: stringr::str_c(...)
. Same holds for if_any
and str_detect
below.
R/mol_roi.R
Outdated
if(!file.exists(roi)) stop("The ROI filepath is invalid.") | ||
|
||
# check proper format | ||
if(!any(str_detect(roi, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing prepended package name in function calls here.
R/mol_roi.R
Outdated
|
||
### Convert to sf/terra | ||
if("SpatialPolygons" %in% .class2(roi)) roi <- st_as_sfc(roi) | ||
if(class(roi) == "RasterLayer") roi <- rast(roi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing prepended package name in function calls here.
R/mol_roi.R
Outdated
|
||
|
||
unflgd <- crpd %>% | ||
dplyr::mutate(within = st_within(., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing prepended package name in function calls here.
R/mol_roi.R
Outdated
maskValue){ | ||
|
||
### Reproject to raster CRS | ||
dataCoords[, c("tempX", "tempY")] <- geom(project(vect(dataCoords, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing prepended package name in function calls here.
R/mol_roi.R
Outdated
|
||
if(!is.na(maskValue)){ | ||
smpld <- smpld %>% | ||
filter(value != maskValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing prepended package name in function calls here. filter
here and in 238.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi
I tested the mol_roi() function, and there are several errors. Also some inconsistencies with class of R object that this function is limited to, but they are widely used in bdc, for example, mol_roi is written to handle data.frame, however it must work with tibble too and return tibble objects :)
R/mol_roi.R
Outdated
#' @param data data.frame. Containing geographical coordinates. | ||
#' @param roi spatial region of interest. Can be provided as an sf, sfc, | ||
#' SpatialPolygons, SpatialPolygonsDataFrame, rasterLayer, spatRaster, | ||
#' or as a file path with extension #' ".shp", ".gpkg", or ".tif". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove #' before ".shp", ".gpkg", or ".tif".
(#' or as a file path with extension ".shp", ".gpkg", or ".tif".)
R/mol_roi.R
Outdated
#' @details This test identifies records outside a particular region of interest | ||
#' defined by a raster or vector layer | ||
#' | ||
#' @return A data.frame containing the column ".outside_roi". Compliant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bdc functions return tibble objects
check_col(data, c(lat, lon)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace by
all(c(lat, lon)%in%colnames(data))
So you can remove this check_col function that belongs to a package not used in bdc
if (!is.data.frame(data)) { | ||
stop(deparse(substitute(data)), " is not a data.frame", call. = FALSE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function must be flexible to handle tibble or data.frame object
R/mol_roi.R
Outdated
if(!any(c("character", | ||
"sf", | ||
"sfc", | ||
"SpatialPolygons", | ||
"RasterLayer", | ||
"SpatRaster")) %in% class(roi)){ | ||
stop("The region of interest is not in a valid format.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses are wrong here replace by
if(!any(c("character",
"sf",
"sfc",
"SpatialPolygons",
"RasterLayer",
"SpatRaster") %in% class(roi))){
stop("The region of interest is not in a valid format.")
}
R/mol_roi.R
Outdated
### converts coordinates columns to numeric | ||
data <- | ||
data %>% | ||
dplyr::rowid_to_column("id_temp") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rowied_to_column() is from tibble package, fix it
R/mol_roi.R
Outdated
dataCoords %>% | ||
dplyr::select(dplyr::all_of(c("id_temp", lon, lat))) %>% | ||
dplyr::filter(!is.na(.data[[lat]]), | ||
!is.na(.data[[lon]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did dataCoords come from? this object was not created before in the function, please check it.
Also, the output of these codes is not being saved, fix it .
dplyr::filter(!is.na(.data[[lat]]), | ||
!is.na(.data[[lon]])) | ||
|
||
if(nrow(dataCoords) == 0) stop("No valid coordinates.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No filter was applied to test this,
Maybe you want to write
dataCoords <- data %>%
dplyr::select(dplyr::all_of(c("id_temp", lon, lat))) %>%
dplyr::filter(!is.na(.data[[lat]]),
!is.na(.data[[lon]]))
if(nrow(dataCoords) == 0) stop("No valid coordinates.")
R/mol_roi.R
Outdated
|
||
### Convert to sf/terra | ||
if("SpatialPolygons" %in% .class2(roi)) roi <- st_as_sfc(roi) | ||
if(class(roi) == "RasterLayer") roi <- rast(roi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an error here
if(class(roi) == "RasterLayer") roi <- rast(roi)
Error in if (class(roi) == "RasterLayer") roi <- rast(roi) :
the condition has length > 1
Maybe can be replaced by
class(roi)[1] == "RasterLayer"
R/mol_roi.R
Outdated
if(any(c("sf", "sfc") %in% class(roi))){ | ||
|
||
if(!any(is.na(maskValue), is.null(maskValue))){ | ||
warning("maskValue is ignored when ROI is provided as a vector layer.") | ||
} | ||
|
||
unflagged <- roi_sf(dataCoords, | ||
roi, | ||
lat, | ||
lon) | ||
} else( | ||
unflagged <- roi_rast(dataCoords, | ||
roi, | ||
lat, | ||
lon) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an error here
Spherical geometry (s2) switched off
although coordinates are longitude/latitude, st_union assumes that they are planar
Spherical geometry (s2) switched on
Error in dplyr::mutate()
:
! Problem while computing within = st_within(., roi, sparse = F)[, 1]
.
Caused by error in UseMethod()
:
! no applicable method for 'st_geometry' applied to an object of class "data.frame"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for running these checks and all of your suggestions. I hope to make all the corrections this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matthewsrogan,
I would like to work on some suggested functions to turn them able to bdc. May I make pull requests to the forked repo bdc_mol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Lucas, that would be great. I just pushed some updates I've been testing this week.
debugged raster and vector helper functions.
Hi Bruno, I am a data curator with the Map of Life and among my responsibilities is helping develop and apply cleaning workflows for global occurrence datasets. The bdc package mirrors an internal workflow we were developing, and so we are exploring using the bdc package as the core of our data prep pipeline for MOL and other projects within Yale’s Center for Biodiversity and Global Change. However, there are a few integral steps that we need that do not currently appear to be available within bdc. We are happy developing this extra functionality internally, but if you feel these steps have broader value, we would be happy to work with you to integrate them directly into the package.
I have written a few examples of functions that I wrote to work within the bdc pipeline. mol_roi.R is a spatial filtering step that lets us flag data using spatial layers such as our internal land layer derived from the Global Shoreline Vector. It’s basically a wrapper function with helpers for different formats of the spatial region of interest. mol_spatiotemporal_duplicate.R is a cleaning step we use to identify duplicate records prior to spatiotemporal filtering. It’s especially important when we integrate overlapping datasets, such as GBIF and California Consortium of Herbaria. mol_flagged_by_source.R is a function to flag records based on quality control attributes provided by the source (e.g. GBIF’s ‘issue’ attribute).
Other functions we need for our workflows include one to flag vagrant occurrences and geographic outliers based on the distance from the species’ range, and another to identify non-native occurrences such as those plants in gardens, animals in zoos, or domesticated species/variants.
Either way, we will continue to develop this functionality for our internal uses. But if you feel any of these steps are of value to the package’s user base, we are eager to work with you to make sure they work seamlessly with the package and that they pass all testing requirements. From our perspective, the more centralized everything is within bdc, the better.
The package is intuitive and meets our needs well, so we appreciate what you and your collaborators have done to develop it.
Cheers,
Matt Rogan