Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making GenomicDataCommons less daunting to use for the average user #94

Open
hermidalc opened this issue Feb 22, 2022 · 0 comments
Open

Comments

@hermidalc
Copy link

hermidalc commented Feb 22, 2022

GenomicDataCommons is a very powerful library and can query pretty much anything at the GDC, though many users prefer other libraries like e.g.,TCGAbiolinks because the GenomicDataCommons query results data structure, a deeply nested list of recursive lists of data frames of lists (an R representation of JSON), can be quite daunting for the average user to work with.

Users generally want to get and look at data frames. While all query results cannot easily be transformed into a single data frame, many can. It would help users a lot to show how to do this.

I use rrapply to recursively alter anything I need inside the query results data structure (which is a nice library made for dealing with things like GenomicDataCommons recursive data structures), then make a data frame like e.g.:

stopifnot(GenomicDataCommons::status()$status == "OK")
gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id %in% project_ids
        & cases.samples.sample_type %in% sample_types
        & analysis.workflow_type == workflow_type
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "analysis.workflow_type",
        "cases.project.project_id",
        "cases.case_id",
        "cases.submitter_id",
        "cases.samples.sample_id",
        "cases.samples.submitter_id",
        "cases.samples.sample_type",
        "cases.samples.is_ffpe",
        "cases.samples.portions.is_ffpe",
        "cases.samples.portions.analytes.aliquots.aliquot_id",
        "cases.samples.portions.analytes.aliquots.submitter_id"
    ))
gdc_results <- results_all(gdc_query)

gdc_results <- rrapply(
    gdc_results, f=function(x) NA, condition=is.null, how="replace"
)

gdc_df <- data.frame(
    file_uuid=gdc_results$file_id,
    file_name=gdc_results$file_name,
    workflow_type=gdc_results$analysis$workflow_type,
    project_id=vapply(
        sapply(gdc_results$cases, `[[`, "project"), `[`, "project_id"
    ),
    case_uuid=sapply(gdc_results$cases, `[[`, "case_id"),
    case_submitter_id=sapply(gdc_results$cases, `[[`, "submitter_id"),
    sample_uuid=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "sample_id"
    ),
    sample_submitter_id=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "submitter_id"
    ),
    sample_type=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "sample_type"
    ),
    sample_is_ffpe=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "is_ffpe"
    ),
    portion_is_ffpe=sapply(
        sapply(
            sapply(
                gdc_results$cases, `[[`, "samples"
            ), `[[`, "portions"
        ), `[[`, "is_ffpe"
    ),
    aliquot_uuid=sapply(
        sapply(
            sapply(
                sapply(
                    sapply(
                        gdc_results$cases, `[[`, "samples"
                    ), `[[`, "portions"
                ), `[[`, "analytes"
            ), `[[`, "aliquots"
        ), `[[`, "aliquot_id"
    ),
    aliquot_submitter_id=sapply(
        sapply(
            sapply(
                sapply(
                    sapply(
                        gdc_results$cases, `[[`, "samples"
                    ), `[[`, "portions"
                ), `[[`, "analytes"
            ), `[[`, "aliquots"
        ), `[[`, "submitter_id"
    ),
    row.names=gdc_results$file_id,
    stringsAsFactors=FALSE
)

For single query results that cannot be easily transformed back to a data frame using the above method (due to GDC DB key relationship structure), I make multiple queries, transform them to individual data frames like above, and then do joining of the data frames to get a single one the way I need it.

Anyway, this is probably a bit challenging for the average user? Maybe there is an easier way to work with GenomicDataCommons that I've totally missed. But if not, and what I'm doing is generally a good way, maybe it's worth adding to the vignette some examples of how to make data frames from query results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant