-
Notifications
You must be signed in to change notification settings - Fork 11
/
21-dataProduct-ISCN.Rmd
111 lines (94 loc) · 8.66 KB
/
21-dataProduct-ISCN.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# International Soil Carbon Network database v3 (ISCN3)
The International Soil Carbon Network (ISCN) database v3 was pulled together to create a large database of layer level soil carbon stock information.
## June 2020 interview
These are notes from a 4 June 2020 interview with Dr. Luke Nave on the International Soil Carbon Network (ISCN).
1) Why did you start this study/project?
- The study was motivated by the desire to draw together scattered data into a common place with a common format so that users could apply it to answer outstanding questions in the carbon cycle.
- These scientific questions at the time, 10 years ago, included:
+ How much soil carbon is there and where is it?
+ Given that we have stocks of X, how are these stocks changing with land use change, climate change, and management?
- There were also operational/methodical questions to be answered:
+ Can improved coverage of carbon stock data result in better model representation of where carbon soil is?
+ Can data on soil fractions and turn over times improve process representation in models?
2) Describe your workflow for ingesting data sets?
- Researchers identified which variables in the desired source dataset cross-walked with which variables in the ISCN template as well as identified where new variables needed to be added. Data providers then filled out an Excel template that was then ingested into the ISCN database using SQL commands.
- The greatest efforts were focused on the largest datasets.
- The threshold for adding a new variable to the template was high ~90% of the final variables were in the original.
+ Tried to be expansive then so they would not have to add more variables later in order to incorporate relatively common variables
+ However, adding new variables was on a case to case basis. For example, carbon and bulk density variables were more likely to be added with new methods, versus cation exchange capacity because carbon stock variables were what the dataset focused on.
- The templates evolved with each version but changed less than the data product.
+ Started with < 40,000 profiles and ending with > 71,000
- The documentation got better with each version.
3) What decisions did you make to arrive at this workflow?
- Combined two existing data structures from the flux community (Ameriflux) and the soil community (NRCS), because these were the two largest contributors to the data infrastructure.
- There was extensive work (a series of workshops, beginning with an Alaska-focused project) with the soil, carbon cycle, and biochemistry research communities to make sure that the structure worked for single investigator driven soil datasets.
4) How would someone get a copy of the data in this study?
- Go to the ISCN website and download a static excel file.
- There is a quick-start, intermediate, and advanced user guide.
5) What would you do differently if you had to start again?
- Develop products in parallel with the database itself. The interviewee's great disappointment was overemphasizing consolidating the data for their own sake but not spending enough time turning the data into publications. This could have shown what can be done with the synthesis to spark interests and show what is possible.
- Have a small group of people (5 or 6) executing a common set of standard gap fills and decision trees for bulk density, carbon concentrations, and stock computations. They would set rules and create commonly agreed upon couple of scripts that could do some gap-filling. This would save time for future users by just getting that done first.
+ ISCN3 has complex gap-filling methods that we used to generate soil organic carbon stock estimates and flagged this under a relatively cryptic headers; these were the product of only 2 people and much of the decision making is only documented in a mess of old emails.
- Put all the code that handles the data that turns it from data into product in a common format and allow people who are interested to get it and edit it. This would make it open access which was not a part of ISCN (the Scientific Steering Group specifically desired to control and document access, in order that data contributors received credit and citations). The current codebase used to generate ISCN3 is cryptic to the point of being black boxed after the retirement of the original programmer.
- Create a functional way to harvest information about treatments and disturbances, so that the dataset can be used to study treatment effects. The disturbance table was prototyped but never satisfactorily implemented, much less refined through implementation.
- Change latitude and longitude to profile level rather than site level attributes to create high resolution data.
- Have better documentation about how to cite your data contributors and how to make sense of the data products.
6) What would be the same?
- The capability of incorporating a lot of different kinds of data.
- The templates were generally adequate for most data needs we encountered.
- In terms of interfacing with the broader science community, listening and having workshops were helpful in that they resulted in positive changes to the templates.
- Allow for map based data retrieval to increase usability by allowing the user to filter by place and by variables. The user would no longer have to deal with large excel tables where most of the columns are empty anyways.
## Data models (database and tempate)
```{r}
ISCN3_table <- dataDescription.ls$structure %>%
filter(grepl('ISCN', data_product)) %>%
filter(data_product == 'ISCN3') %>%
mutate(key = data_type == 'id',
ref = case_when(
grepl('^dataset_name$', data_column) ~ 'dataset',
grepl('^profile_name$', data_column) ~ 'profile',
grepl('^layer_name$', data_column) ~ 'layer',
grepl('^site_name$', data_column) ~ 'layer',
TRUE ~ as.character(NA)),
ref_col = case_when(grepl('^dataset_name$', data_column) ~ 'dataset_name',
grepl('^profile_name$', data_column) ~ 'profile_name',
grepl('^layer_name$', data_column) ~ 'layer_name',
grepl('^site_name$', data_column) ~ 'site_name',
TRUE ~ as.character(NA))) %>%
mutate(ref = if_else(data_table == ref, as.character(NA), ref))%>%
rename('table'='data_table', 'column'='data_column' )
ISCNTemplate_table <- dataDescription.ls$structure %>%
filter(grepl('ISCN', data_product)) %>%
filter(data_product == 'ISCNTemplate') %>%
mutate(key = data_type == 'id',
ref = case_when(grepl('^dataset_name$', data_column) ~ 'metadata',
grepl('^site_name$', data_column) ~ 'site',
grepl('^cluster_name$', data_column) ~ 'cluster',
grepl('^profile_name$', data_column) ~ 'profile',
grepl('^layer_name$', data_column) ~ 'layer',
grepl('^fraction_name$', data_column) ~ 'fraction',
grepl('^gas_name$', data_column) ~ 'gas',
grepl('^other_name$', data_column) ~ 'other',
TRUE ~ as.character(NA)),
ref_col = case_when(grepl('^dataset_name$', data_column) ~ 'dataset_name',
grepl('^site_name$', data_column) ~ 'site_name',
grepl('^cluster_name$', data_column) ~ 'cluster_name',
grepl('^profile_name$', data_column) ~ 'profile_name',
grepl('^layer_name$', data_column) ~ 'layer_name',
grepl('^fraction_name$', data_column) ~ 'fraction_name',
grepl('^gas_name$', data_column) ~ 'gas_name',
grepl('^other_name$', data_column) ~ 'other_name',
TRUE ~ as.character(NA))) %>%
mutate(ref = if_else(data_table == ref, as.character(NA), ref))%>%
rename('table'='data_table', 'column'='data_column' )
ISCNTemplate_dm <- as.data_model(ISCNTemplate_table)
ISCN3_dm <- as.data_model(ISCN3_table)
```
```{r, fig.cap='ISCN3 data model'}
dm_render_graph(dm_create_graph(ISCN3_dm, rankdir = "BT", col_attr = c('column'), view_type = 'keys_only', graph_name='ISCN3 data model'))
```
```{r, fig.cap='ISCN template data mode'}
dm_render_graph(dm_create_graph(ISCNTemplate_dm, rankdir = "RL", col_attr = c('column'), view_type = 'keys_only', graph_name='ISCN-template data model'))
```
## Acknowledgements
Special thanks to Dr. Luke Nave (University of Michigan) for his help with the interpretation of ISCN.