Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow engines fail to run workflows referenced by certain TRS URLs #247

Open
svonworl opened this issue Apr 25, 2024 · 5 comments
Open

Comments

@svonworl
Copy link

svonworl commented Apr 25, 2024

As detailed in dockstore/dockstore#5594, some workflow engines, including miniwdl and Cromwell, fail to run a valid Dockstore workflow when:

  • the workflow is referenced via the TRS URL of its primary descriptor.
  • the workflow imports files with paths that contain parent directory references (..).

For example, the following invocation fails:

miniwdl run 'https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor'
(https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor Ln 31 Col 1) Failed to import ../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl
HTTP Error 404: Not Found

Turns out the problem is bigger - the workflow engines also fail to run workflows that import files with relative paths.

The root cause is that, when the engines calculate the URL of an import, they interpret the specified TRS URL as a file path. However, a TRS URL doesn't represent a file path, so the engines miscalculate the import URLs and fail when they attempt to load them.

For example, given the TRS URL

https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor

and an import referenced by a relative path

../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl

The engines calculate the import URL, by applying typical file resolution semantics, as:

https://dockstore.org/api/ga4gh/tasks/broad/UnmappedBamToAlignedBam.wdl

The above URL is a corrupt TRS URL, because parts of the original TRS URL have been deleted. During the import URL calculation, the engine drops the trailing descriptor portion of the TRS URL because it looks like a filename, and when the engine normalizes the URL prior to the request, it collapses the parent directory references and more of the original TRS URL is deleted.

Per the TRS spec, a relative path can be appended to the TRS primary descriptor URL, and it will resolve the file relative to the primary descriptor and return its contents. So, the correct URL is:

https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor/../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl

Note that when miniwdl is run with a URL that references the raw github files, it works as expected:

miniwdl run 'https://raw.githubusercontent.com/broadinstitute/warp/tw_GL-2036_create_rtools_docker/pipelines/broad/dna_seq/germline/single_sample/wgs/WholeGenomeGermlineSingleSample.wdl'

Why does it work? The github URL ends with the absolute path of the workflow file, allowing the import urls to be resolved using typical file resolution semantics.

This issue should probably be addressed in the next major TRS revision.

In lieu of that, here are some possible solutions that would help the engines to correctly run a workflow referenced by a "bare" TRS primary descriptor URL:

  • Change the engines' import resolution code so it doesn't modify the specified workflow URL during file resolution calculations, but instead appends the calculated path to it. This could be something the engine does conditionally, perhaps either when it detects a TRS URL or when instructed via a flag.
  • Modify the TRS endpoint to redirect to a URL that ends with the absolute path of the primary descriptor (and responds with the raw file content like the original TRS endpoint). At startup, an engine can check if the specified workflow URL redirects, and if so, use the "target" URL for all subsequent import URL calculations.

┆Issue is synchronized with this Jira Story
┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service
┆Issue Number: TRS-70

@uniqueg
Copy link
Collaborator

uniqueg commented May 18, 2024

I agree that the way the TRS spec references the relative location of accessory workflow files with respect to the primary descriptor is anywhere from confusing to limiting (see below). However, the issue you describe is probably mostly a server implementation issue (at Dockstore) and should probably be raised there (see TRS URLs section below).

On another note, I would suggest that clients (Cromwell, miniwdl etc.) support TRS URIs for user convenience (see TRS URIs section below).

TRS URLs

I agree with you that servers should properly resolve relative paths in TRS URLs of the form /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}, as per the specs.

Limitation

However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:

A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor

So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).

Either way, I guess Dockstore should probably respond with a 400 error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.

@denis-yuen

Summary

Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.

I don't know or remember what speaks against supporting parent directories (perhaps a security "feature"?), but it seems to me that given the current specs and a fully compliant TRS implementation, you are out of luck when there are secondary descriptors in a parent or sibling directory directory relative to the primary descriptor (like in your use case, @svonworl).

TRS URIs

Usage

  • As per the TRS URI documentation, TRS URIs reference a specific TRS resource, either a tool (with all its versions), or a specific tool version.
  • They come in one of the following two forms:
    • Unversioned: trs:///
    • Versioned: trs:////<version_id>
  • Clients translate TRS URIs into the corresponding, fully qualified TRS endpoints according to the rules described in the TRS spec and the TRS URI documentation.
  • If supported by clients, TRS URIs make it easy for users to point to a specific tool (or workflow) or tool (or workflow) version.

Current limitations

  • A workflow could potentially be encoded in multiple workflow languages. I never came across a real world example for this scenario, but we do indeed have one or two simple example workflows that we encoded in multiple languages, for testing purposes. Unfortunately (or not, I am not sure), due to its structure, the TRS API supports bundling multiple workflow types under the same resource (including under the same version), so that, e.g., a workflow "MyWorkflow v1.0.0" available at a given (versioned or unversioned) TRS URI could point to multiple workflows (e.g., in WDL and Nextflow). In that case, currently, it would be up to the client to choose the workflow type and the user, when using TRS URIs, may not be able to control which one is used (unless an option to influence that behavior is implemented). Of course, this could be considered a feature, rather than a limitation, because presumably the assumption would be that multiple workflow types under the same TRS resource version would behave equivalently, such that the user shouldn't care and the client would simply have more options.
  • Similarly, a tool could come with multiple container image descriptor flavors and again, the TRS URI does (currently) not allow specifying a particular one.

Summary

TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.

@denis-yuen
Copy link
Member

However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:

A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor

So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).

Either way, I guess Dockstore should probably respond with a 400 error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.

Ah interesting. We actually 400 or 404, so there's no security issue.
Good catch though, I totally forgot about that limitation in there!

Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.

So far, we've been downloading the workflows in zip format from the files endpoint.

TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.

👍

@uniqueg
Copy link
Collaborator

uniqueg commented May 21, 2024

So far, we've been downloading the workflows in zip format from the files endpoint.

Yes, you can do that. However, information about each file, including its path, should still be available via GET .../files when you request an application/json response. And according to the description of this operation

description: Get a list of objects that contain the relative path and file type.
  The descriptors are intended for use with the
  /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
  endpoint. Returns a zip file of all files when format=zip is specified.

this information is

intended for use with the /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}

which, as described above, prohibits references to files that are not located in the primary descriptor's directory or a child directory thereof.

For reference, the GET .../files operation with application/json is supposed to return an instance of ToolFile:

ToolFile:
  type: object
  properties:
    path:
      type: string
      description: Relative path of the file.  A descriptor's path can be used with
        the GA4GH .../{type}/descriptor/{relative_path} endpoint.
    file_type:
      type: string
      enum:
        - TEST_FILE
        - PRIMARY_DESCRIPTOR
        - SECONDARY_DESCRIPTOR
        - CONTAINERFILE
        - OTHER
    checksum:
      $ref: "#/components/schemas/Checksum"

So while nothing is stopping an implementation from returning a ZIP archive containing whatever directory structure a workflow requires when clients call GET .../files with application/zip, I don't see how you can, in a spec-compliant way, allow clients to self-assemble the work directory themselves via ToolFile.path (from GET .../files with application/json) and FileWrapper (from GET .../descriptor/{relative_path}) if the workflow resource includes files that are not in the top-level directory relative to the location of the primary descriptor.

To maintain consistency with GET .../files with application/zip, implementers would have to ignore the limitation and put non-compliant values for ToolFile.path in such cases; and then support them via GET .../descriptor/{relative_path}, which is tricky, because clients often/typically resolve paths before requests are even sent out, meaning that clients would need to be instructed to URL-encode paths containing references to parent directories (i.e., those containing ..).

To address this issue without having to ever deal with parent directory referenes, perhaps we could require resources to use an implicit workflow top-level directory (e.g., a Git repository root directory) as an anchor when constructing relative paths for workflow files to be made available via ToolFile.path. Clients can then infer the required top-level directory for the file objects they are interested in and construct an appropriate directory structure for their use case.

This approach would have the added benefit of us only needing to change descriptions, and so perhaps we could get away with just sneaking in such a change in for a future minor version bump of the TRS specs.

@adamnovak
Copy link

I'm running into this problem trying to add support for TRS lookups to the Toil workflow runner.

Since Toil already knows how to read and run a CWL or WDL workflow from a plain HTTP-hosted directory tree, I want to query TRS and somehow get from it a URL to the main workflow descriptor file in its appropriate directory context.

Dockstore kind of supports this already; the TRS IDs on its workflow pages link to URLs like https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fvgteam%2Fvg_wdl%2FGiraffe/versions/giraffedv/PLAIN_WDL/descriptor//workflows/giraffe.wdl. Note the extra leading / in the TRS "relative" path. When the workflow references ../tasks/sometask.wdl, normal hypertext URL resolution rules can be applied, and the machinery to download the workflow source doesn't actually need to speak TRS specifically at all.

But this seems to be a Dockstore extension, and TRS doesn't itself provide a way to get that /workflows/giraffe.wdl absolute path from the common parent directory to the primary descriptor, since it doesn't anticipate you ever needing it.

If the API is up for redesign here, I would favor designs that look as much like plain HTTP as possible, over designs that officially support fetching files in higher-level directories but require a sort of TRS-specific virtual filesystem implementation in the client.

@denis-yuen
Copy link
Member

Link for presentation dockstore/dockstore#6005 (dockstore workaround)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants