-
Notifications
You must be signed in to change notification settings - Fork 11
/
23-dataProduct-SIDb.Rmd
78 lines (63 loc) · 5.82 KB
/
23-dataProduct-SIDb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Soil Incubation Database (SIDb)
The Soil Incubation Database focuses on aggregating soil labortory incubations under different temperature and moisture conditions for model parameterization.
## June 2020 interview
These are notes from a 4 June 2020 interview with Carlos Sierra and Christina Scheadel on the Soil Incubation Database (SIDb).
1) Why did you start this study?
- There is this huge potential to use new innovations with soil incubation data, but people were using it in many different ways, making it hard to demonstrate its value. Time series incubations are important aspects of soil incubations that have been under utilized, in contrast to total respired fraction. In particular, time series incubation data can tell you about the carbon stock rate and the dynamics. The fact that the environment in these incubations is so tightly controlled, it is a feature, not a bug, removing confounding variables (in other words, sensitivity functions) from the results.
- This database is a merger of two projects by Carlos and Christina. These two projects were started roughly 5 years ago; once they both became aware of the other's project, they merged efforts. A week-long workshop in Jena, Germany during March 2019 accelerated work.
2) Describe your workflow for ingesting data sets.
- There are three main elements of a data contribution: Time series, Initial conditions, and Metadata.
- The templates evolved over time, especially in the March 2019 workshop. A complete template is essential to enable consistent information capture from the original manuscripts.
- Time series are captured from graphs in original manuscripts, and occasionally direct data from the original PIs.
- Data was generally not directly contributed from PI unless the PI was directly involved in the data product.
- SIDb is NOT a repository and instead a dynamic data product. Data should ideally be archived in a repository separately.
Follow up question: How successful were you in getting PIs to give you data?
- Within the group, there were people with data, so they got data from them instead of contacting PIs for them. Did not ask for data from PIs. The dataset is not complete at all. They were setting up the infrastructure for people to upload their own data. No outside contributions yet. Some people have expressed interest.
- Dynamic database, not a repository. Contribution to a product. You are expected to archive your own data.
- Current data products
+ There are 31 studies so far.
+ 684 total time-series.
+ The longest time series is 800 to 1,000 days.
3) What decisions did you make to arrive at this workflow?
- Driven by research questions, design is mostly trial and error.
- Reevaluate structure and workflow as they worked, or as needed.
- Going forward, when incorporating additional columns, their thoughts are that it needs to evolve based on the needs of that specific entry.
4) How would someone get a copy of the data in this study?
- `git clone` a GitHub repository and then interact with it through the associated R-package.
- R package reads the data, uses simple plotting, and models fitting via the FME R-package that can use both classical gradient and MCMC methods.
5) What would you do differently if you had to start again? What would be the same?
- Having the project be research question driven worked well even if it was a bit slow.
- Would have been good to collaborate sooner. That might have moved the project along faster.
- Excellent group dynamics
+ >“If you have a group that you feel comfortable with and where everyone is well-coordinated, ideas will flow much better in a good direction.”
+ Christina said the group talked every other week and that everyone was very responsive, making the group great.
+ Common research goals united the entire group.
+ Small workshop worked exceptionally well (less than 5 people)
* More or less a goal in mind than a structure.
* Goal: Have a paper somewhat completed
* Had a whiteboard
- There is a lot of disconnect between the informatics researchers and domain scientists. It feels like there should be more cross talk between the two then there currently is.
## Data model
```{r, fig.cap = 'SIDb data model, Please note that this is a "flattened" approximation of the SIDb tree data structure.'}
SIDb_table <- dataDescription.ls$structure %>%
filter(data_product == 'SIDb_restructured') %>%
rename("column" = "data_column", "table" = "data_table") %>%
mutate(key = column %in% c('level1', 'time', 'variable_name', 'column_name', 'siteIndex'),
ref = case_when(grepl('^level1$', column) ~ 'study',
grepl('^siteIndex$', column) ~ 'siteInfo',
grepl('timeSeriesIndex', column) ~ 'timeSeries',
grepl('^column_name$', column) ~ 'variables',
grepl('^time$', column) ~ 'timeSeries',
TRUE ~ as.character(NA)),
ref_col = case_when(grepl('^level1$', column) ~ 'level1',
grepl('^siteIndex$', column) ~ 'siteIndex',
grepl('^variable_name$', column) ~ 'variables',
grepl('^column_name$', column) ~ 'variables',
grepl('^time$', column) ~ 'time',
TRUE ~ as.character(NA))) %>%
mutate(ref = if_else(table == ref, as.character(NA), ref))
SIDb_dm <- as.data_model(SIDb_table)
dm_render_graph(dm_create_graph(SIDb_dm, rankdir = "RL", col_attr = c('column'), view_type = 'keys_only'))
```
## Acknowledgements
Special thanks to Drs Christina Scheadel (Northern Arizona University) and Carlos Sierra (Max Planck Institute for Biogeochemistry) for making the metadata and for making themselves available for interviews.