Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand for deeply-nested fields #47

Open
brisk022 opened this issue Oct 5, 2017 · 2 comments
Open

expand for deeply-nested fields #47

brisk022 opened this issue Oct 5, 2017 · 2 comments

Comments

@brisk022
Copy link

brisk022 commented Oct 5, 2017

The package vignette provides an example for expanding first level fields to obtain a data frame. However, the approach does not work for deeper nested fields. For example,

files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   results()

produces a list with all children of the samples field concatenated into a comma-separated string without field names, e.g.

$cases
$cases$`3fe677f6-8329-447c-b999-5e70582624aa`
samples
1 01, 2017-03-04T16:37:25.946840-06:00, NA, true, NA, TCGA-IA-A83W-01A, NA, 2e4dfa77-839a-445d-beef-60b6396adf0c, FALSE, 10CCB12F-77E0-4100-A87A-0D36E5AF7F8B, NA, NA, Primary Tumor, live, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3607, 140, NA

This is of limited utility as the order of the fields cannot be trusted, so the values cannot be reliably mapped back to field names. The only work-around I could find was to provide a custom respond handler to prevent jsonlite from simplifying the vectors (and consequently other structures).

respHandler <- function(txt, ...) { jsonlite::fromJSON(txt, simplifyVector = F) }
files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   response(response_handler = respHandler) %$%
   lapply(results, unlist, recursive = T) %>%
   lapply(as.list) %>%
   bind_rows()

However, it would be nice if such expansion happened automatically when results are called.

@seandavi
Copy link
Collaborator

seandavi commented Oct 6, 2017

These results can be quite challenging to deal with, for sure. Taking a quick look at the results of:

res = files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   results()

Relying on the "print" method for complex R data structures can be misleading. I use the str function quite regularly (with switches like list.len to limit output sizes). str(res, list.len=5) shows:

List of 3
 $ cases  :List of 10
  ..$ c5c4b4a3-3224-4a72-a883-c99c7747e47b:'data.frame':	1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':	1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "03"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : logi NA
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ dd029237-d470-4b58-9cf0-11753fa60972:'data.frame':	1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':	1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "10"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "false"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 3fe677f6-8329-447c-b999-5e70582624aa:'data.frame':	1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':	1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "true"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 619c9069-53a8-4581-92a3-be1896fe7f66:'data.frame':	1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':	1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "true"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 30d8e7a6-675a-4999-a120-62add06bff3c:'data.frame':	1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':	1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "false"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  .. [list output truncated]
 $ file_id: chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
 $ id     : chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
 - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"

Note that each $cases has a "samples" data.frame embedded in it. One possible approach (of several) is to use the purrr package to do some further manipulation.

purrr::flatten(res$cases) %>% set_names(res$id) %>% flatten_df(.id="file_id")

This flattens the cases list, sets the names of the resulting flattened list back to the file_id so that we don't lose track of which sample goes with which file, and then flatten the "rows" of the samples into one big data frame, assigning the names of the list (the file_ids) the column specified by the .id argument. The results, then are here:

                                file_id sample_type_id
1  c5c4b4a3-3224-4a72-a883-c99c7747e47b             03
2  dd029237-d470-4b58-9cf0-11753fa60972             10
3  3fe677f6-8329-447c-b999-5e70582624aa             01
4  619c9069-53a8-4581-92a3-be1896fe7f66             01
5  30d8e7a6-675a-4999-a120-62add06bff3c             01
6  17bac11f-78c2-4921-bb3a-03c5c1afbd37             01
7  17bac11f-78c2-4921-bb3a-03c5c1afbd37             10
8  725f5ede-f22b-4422-a9e0-66646538121d             01
9  c85a6f34-7b6b-4677-beac-44f06bcc5c32             01
10 acd76d89-a4d7-47ea-a1c2-480a3d200634             01
11 9c51ff3a-6c88-4f17-9c33-c630a6d10ea3             01
                   updated_datetime time_between_excision_and_freezing
1  2017-03-04T16:37:25.946840-06:00                                 NA
2  2017-03-04T16:37:25.946840-06:00                                 NA
3  2017-03-04T16:37:25.946840-06:00                                 NA
4  2017-03-04T16:37:25.946840-06:00                                 NA
5  2017-03-04T16:37:25.946840-06:00                                 NA
6  2017-03-04T16:37:25.946840-06:00                                 NA
7  2017-03-04T16:37:25.946840-06:00                                 NA
8  2017-03-04T16:37:25.946840-06:00                                 NA
9  2017-03-04T16:37:25.946840-06:00                                 NA
10 2017-03-04T16:37:25.946840-06:00                                 NA
11 2017-03-04T16:37:25.946840-06:00                                 NA
   oct_embedded tumor_code_id     submitter_id intermediate_dimension
1          <NA>            NA TCGA-AB-2904-03A                   <NA>
2         false            NA TCGA-EL-A3H5-10A                   <NA>
3          true            NA TCGA-IA-A83W-01A                   <NA>
4          true            NA TCGA-ZG-A9ND-01A                   <NA>
5         false            NA TCGA-QH-A870-01A                   <NA>
6          <NA>            NA TCGA-77-6842-01A                    0.8
7          <NA>            NA TCGA-77-6842-10A                   <NA>
8         false            NA TCGA-A8-A06Z-01A                   <NA>
9          true            NA TCGA-AN-A0XW-01A                   <NA>
10         <NA>            NA TCGA-DU-7014-01A                      1
11         true            NA TCGA-DD-A4NR-01A                   <NA>
                              sample_id is_ffpe
1  44992adb-cabf-4c2f-9f3b-45cf97531319   FALSE
2  fc5aa545-07cc-4ad8-9eba-5b5d4ea186fb   FALSE
3  2e4dfa77-839a-445d-beef-60b6396adf0c   FALSE
4  949f85dd-0d5d-4b3f-a7a9-a2ddc5becf3b   FALSE
5  c1c87e01-efc3-433f-ba79-4a79b292870b   FALSE
6  c6d0652b-d41a-4706-a5b2-5d86b10fae96   FALSE
7  1e94ef05-bdba-4811-b1da-d2380d0d5fbe   FALSE
8  993d2cba-b4f8-4a46-994b-b97bb9f10d34   FALSE
9  38bf35cd-b2f7-4532-9bff-d95cfe2cafd5   FALSE
10 050f26d9-b105-412a-9e5b-36840a1843e3   FALSE
11 71bc4fd0-374f-423e-a8b4-ae1bceceda83   FALSE
                  pathology_report_uuid created_datetime tumor_descriptor
1                                  <NA>               NA               NA
2                                  <NA>               NA               NA
3  10CCB12F-77E0-4100-A87A-0D36E5AF7F8B               NA               NA
4  89122E71-1246-44A5-9D44-4F95284EB02E               NA               NA
5  C84F6D0C-3879-4D63-8CE8-10D03D3C71A3               NA               NA
6  4f5574a0-e1d7-427c-8f03-d08cb0b264a4               NA               NA
7                                  <NA>               NA               NA
8  956F45E5-A8C6-4A4A-9D1F-D31912180584               NA               NA
9  5CBC6417-4E3D-4E9C-AE93-A56B777EF2F4               NA               NA
10 a848a22c-c92d-42cc-8e0b-8e260b1f5622               NA               NA
11 93133C58-B1BF-4A5C-8EFC-AE17EC6A0B64               NA               NA
                                       sample_type state current_weight
1  Primary Blood Derived Cancer - Peripheral Blood  live             NA
2                             Blood Derived Normal  live             NA
3                                    Primary Tumor  live             NA
4                                    Primary Tumor  live             NA
5                                    Primary Tumor  live             NA
6                                    Primary Tumor  live             NA
7                             Blood Derived Normal  live             NA
8                                    Primary Tumor  live             NA
9                                    Primary Tumor  live             NA
10                                   Primary Tumor  live             NA
11                                   Primary Tumor  live             NA
   composition time_between_clamping_and_freezing shortest_dimension tumor_code
1           NA                                 NA               <NA>         NA
2           NA                                 NA               <NA>         NA
3           NA                                 NA               <NA>         NA
4           NA                                 NA               <NA>         NA
5           NA                                 NA               <NA>         NA
6           NA                                 NA                0.5         NA
7           NA                                 NA               <NA>         NA
8           NA                                 NA               <NA>         NA
9           NA                                 NA               <NA>         NA
10          NA                                 NA                0.6         NA
11          NA                                 NA               <NA>         NA
   tissue_type days_to_sample_procurement freezing_method preservation_method
1           NA                         NA              NA                  NA
2           NA                         NA              NA                  NA
3           NA                         NA              NA                  NA
4           NA                         NA              NA                  NA
5           NA                         NA              NA                  NA
6           NA                         NA              NA                  NA
7           NA                         NA              NA                  NA
8           NA                         NA              NA                  NA
9           NA                         NA              NA                  NA
10          NA                         NA              NA                  NA
11          NA                         NA              NA                  NA
   days_to_collection initial_weight longest_dimension
1                  NA             NA              <NA>
2                1508             NA              <NA>
3                3607            140              <NA>
4                 283             80              <NA>
5                   1            450              <NA>
6                  NA             NA               1.2
7                  NA             NA              <NA>
8                1003            330              <NA>
9                 121            270              <NA>
10                 NA             NA                 1
11               2652            240              <NA>

@brisk022
Copy link
Author

Thanks for the explanation! Could you add a similar example to the vignette? I think it is general enough and it would be very useful because a lot of data is nested quite deeply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants