-
Notifications
You must be signed in to change notification settings - Fork 12
Tidyverse functions
Tidyverse functions are part of the tidyverse
collection of R packages. These functions work well together, keep the code readable and are good for exploring and transforming data, which is why we try to stick to only using these utensils for the mapping script. For mapping data to Darwin Core, we noticed you can mostly get by with just three functions: mutate()
, recode()
and case_when()
, which are discussed below. To learn more about the other Tidyverse functions used in the mapping script, type ?function_name
in your R Studio console or check the documentation of the tidyverse packages.
Piping means using the pipe operator %>%
or pipe. It is easy to use and highly increases the readability of your code:
# Take the dataframe "taxon", group the values of the column "kingdom" and show a count for each unique value
taxon %>%
group_by(kingdom) %>%
count()
Is a much more readable way than the classic approach of nesting functions:
# Take the dataframe "taxon", group the values of the column "kingdom" and show a count for each unique value
count(group_by(taxon, kingdom))
mutate()
adds or updates a column to your dataframe. You use it to add a new Darwin Core term to your data frame and populate it with one or more values. To allow comparison between the source data and the Darwin Core terms, do not update columns.
The basic code for mutate()
looks like this:
input_data %<>% mutate(new_column_name = ...)
With:
-
input_data
: a data frame with your input data, i.e. the source checklist data -
%<>%
: a shorter way of writinginput_data <- input_data %>% ...
-
mutate()
: a function to add or update a column -
new_column_name
: a name of the column you want to add to the dataframe, i.e. the Darwin Core term. If a column with that name already exists, it will update that column, which you want to avoid. That is why we suggest to prefix all Darwin Core column names withdwc_
, so you don't accidentally update one of the source columns. The prefix will be removed in the post-processing step. -
…
: the value(s) to populate this new column with, whether these are static, unaltered or altered
Some Darwin Core terms have the same static value for every record in the data, i.e. their content is constant for the whole dataset. This is mostly the case for record-level terms (metadata) in the taxon core, but other terms can be static as well.
To map to a static value, write that value in "double quotes":
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_kingdom = "Animalia")
To copy the unaltered value of a source column to a Darwin Core term, use the name of that column as your value:
taxon %<>% mutate(dwc_scientificName = scientific_name)
If you want to standardize, correct or combine the source data before mapping it to a Darwin Core term, you will have to write an expression in your mutate()
function to do that. A simple example is concatenating the values from two columns together:
taxon %<>% mutate(dwc_scientificName = paste(genus, species))
The range of possibilities and bugs (i.e. the example above will create odd values if one of the source columns is empty) is too big to cover here, but for standardizing/correcting values there are two functions we would like to introduce: recode()
and case_when()
. Both are used in conjunction with mutate()
.
recode()
replaces specific source values with a new, altered values in a one-to-one mapping. It is useful for correcting specific typos or mapping values to controlled vocabularies. The basic code is:
input_data %<>% mutate(dwc_term = recode(column,
"value_1" = "dwc_value_1",
"value_2" = "dwc_value_2",
.default = "" # Option to handle other source values, drop this to leave them as is
.missing = "" # Option to handle NA values
))
input_data %<>% mutate(scientific_name = recode(scientific_name,
"AseroÙ rubra" = "Asero rubra"
))
In the above example the typo AseroÙ rubra
is corrected to Asero rubra
. All the other scientific_name
s are left untouched (the .default
parameter is not used). Here the column scientific_name
is overwritten with the recoded values, as that column will be used as the basis for Taxon IDs.
Add comments to explain why you recoded some values:
taxon %<>% mutate(dwc_phylum = recode(phylum,
"Crustacea" = "Arthropoda" # Crustacea is not a phylum
))
taxon %<>% mutate(dwc_taxonRank = recode(rankmarker,
"infrasp." = "infraspecificname",
"sp." = "species",
"var." = "variety",
.default = ""
))
In the above example the rankmarker
is mapped to the GBIF vocabulary for taxon ranks. Any source value that wasn't defined, will be left empty (.default = ""
).
case_when
allows to assign values based on conditions, rather than specific values used for recode()
. It is useful when the mapping of a term depends on multiple source values. The basic code is:
input_data %<>% mutate(dwc_term = case_when(
conditional_statement_1 ~ "dwc_value_1",
conditional_statement_2 ~ "dwc_value_2",
TRUE ~ "dwc_value_3" # Option to handle all other conditions
))
You can read this as: if conditional_statement_1
is true then map to dwc_value_1
, if conditional_statement_2
is true then map to dwc_value_2
, else map to dwc_value_3
.
distribution %<>% mutate(dwc_locality = case_when(
!is.na(locality) ~ locality,
country_code == "BE" ~ "Belgium",
country_code == "GB" ~ "United Kingdom",
country_code == "MK" ~ "Macedonia",
country_code == "NL" ~ "The Netherlands",
TRUE ~ ""
))
In the above example the Darwin Core term locality
is populated with information from the locality
if that is not empty (!is.na
). Otherwise, the specific country_code
s is mapped to a country name. In the other cases (e.g. another country_code
) the location
is left empty (TRUE ~ ""
). Note how two source columns (locality
and country_code
) are used for this mapping.
- Home
- Getting started
- Basics
- Ingredients: Source data
- Instructions: R Markdown
- Utensils: Tidyverse functions
- Dinner: Darwin Core data
- Mapping script
- Data preparation
- Mapping
- GitHub
- Publishing data
- Examples