-
Notifications
You must be signed in to change notification settings - Fork 11
/
26-dataProduct-Crowther.Rmd
52 lines (40 loc) · 4.8 KB
/
26-dataProduct-Crowther.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Crowther
The Crowther dataset specifically looked for published and unpublished data pertaining to near surface soil carbon response to field warming experiments.
## July 2020 interview
With Dr. Tom Crowther on the Crowther data model.
1) Why did you start this study?
- The more data you collect, the more patterns you can identify against the noise.
- Global data is needed to identify global patterns.
- This project specifically worked to address soil carbon sensitivity to warming.
- But, in general, the Crowther lab is interested in the living components of the soil, both in categorizing and characterizing these living components.
- For example, if at two locations with the same climate, topography, and vegetation but different sets of soil organisms the functioning and carbon turn over would be very different.
2) Describe your workflow for ingesting data sets?
- The data template only asked for a brief description of the site location (latitude, longitude, etc.), the measurement they are looking for their project, and how the author collected that measurement. This very restricted data model is then much easier (and more satisfying) to collaborate with data products to fill out.
- Next, data producers are contacted directly and invited to co-author the resulting manuscript. The resulting manuscript generally has between 50 and 100 co-authors and averages 5,000 to 10,000 samples or locations across the globe.
- This simplistic approach is far more productive than having large tables that need computation and large amounts of time to fill.
3) What decisions did you make to arrive at this workflow?
- For the initial soil warming study, we wanted to quantify differences in carbon stocks in warmed and ambient plots. Thus %C and bulk density estimates were required, but we expected that the duration and intensity of warming are also likely to have strong effects so we collected that data to try to control for those effects. In addition, we knew that environmental variation would be likely to affect the magnitude of the changes, but this information was difficult to collect from all projects in a standardized way. We were trying to collect all the data possible and would end up with a table of 50 columns that are mostly incomplete. This drove us to infer possible driving variables from global data map products. In the end, we found that the data is much stronger when the missing data is taken from existing maps where experts have done correct extrapolations. They can then get all the metadata necessary just from knowing the latitude and longitude of the sample.
- We got better results with a small, easy to fill out data request from the data providers then a large verbose data template in the end. This allowed us to maximize the data volume that we were working with which was a high priority for us. To get the environmental characteristics at each sampling location, we use data from existing maps of climate, soil and topographic characteristics. Those have some uncertainty. But it ensures that we get a complete dataset, and we limit the uncertainty associated with sampling variation between people.
4) How would someone get a copy of the data in this study?
- After they are published, they are made available in their university library.
5) What would you do differently if you had to start again? What would be the same?
- Asking for small targeted data has worked very well.
- Managing the large co-author list and people skills remains challenging. Tom spends so much time on managing emails, or dealing with failed communication attempts due to switching of emails or being sent to spam. He believes that managing relationships with the authors of the data is the hardest, because if you miss one person, you could be missing a large opportunity for your project. Recently, he has been using list-servs as he finds it reaches more people, is less likely to be filtered out of people’s professional emails, and is easier to keep organized.
## Crowther data model
```{r}
Crowther_table <- dataDescription.ls$structure %>%
filter(data_product == 'Crowther') %>%
select(-data_product) %>%
rename('table' = 'data_table', 'column' = 'data_column') %>%
mutate(key = data_type == 'id',
ref = case_when(grepl('^Study$', column) ~ 'Raw',
TRUE ~ as.character(NA)),
ref_col = case_when(grepl('^Study$', column) ~ 'Study',
TRUE ~ as.character(NA))) %>%
mutate(ref = if_else(table == ref, as.character(NA), ref))
temp_dm <- as.data_model(Crowther_table)
graph <- dm_create_graph(temp_dm, rankdir = "RL", col_attr = c('column'))
dm_render_graph(graph)
```
## Acknowledgements
We would like to thank Dr. Tom Crowther (ETH Zürich) for their time and contributions to the June interview.