Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cwlprov:relationship sketch #2

Open
mr-c opened this issue Oct 2, 2018 · 4 comments
Open

cwlprov:relationship sketch #2

mr-c opened this issue Oct 2, 2018 · 4 comments

Comments

@mr-c
Copy link
Member

mr-c commented Oct 2, 2018

Together with #1 this attempts to find a way to pre-define domain-specific provenance that would be generated at workflow run time. The idea is define a set of relationships that will be added onto the produced outputs of a step to relate it to other data values or concepts at creation time.

These can use domain-specific ontologies like EDAM ontology or BioSchemas, or more generic ones likes PROV or schema.org

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

inputs:
  first_input: File
  second_input: long

steps: []

outputs:
  first_output:
    type: File
    outputSource: first_input
    cwlprov:relationships:
       prov:wasDerivedFrom: [ '#inputs.second_input' ]
       prov:wasInfluencedBy: [ '#inputs.second_output' ]

$namespaces:
  prov: http://www.w3.org/ns/prov#
  cwlprov: https://w3id.org/cwl/prov#

$schemas:
  - http://www.w3.org/ns/prov.owl
@stain
Copy link
Member

stain commented Oct 2, 2018

As this is a relationship to be generated between values of first_output and second_output, I think some kind of template or expression?

JSON-LD with $expansions

cwlprov:relationship:
  { "@id": "$second_output",
    "prov:wasDerivedFrom": "$first_output" }

Or if we assume the current port is the subject and you can't do arbitrary structures you can just have property-object references (no literals in this case):

cwlprov:relationship: {
    "prov:wasDerivedFrom": "$first_output",
    "example:foo": "edam:topic_0091",
  }

Namespaces like prov and edam here must be defined in CWL $namespaces. The template is expanded based on identifiers for the produced values (e.g. urn:uuid:8c97eb7a-94d8-40bf-a932-7e888445f2ec).

If we have:

{ "first_output": { 
    "@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
  },
  "second_output": {
    "@id": "urn:uuid:6e076c8b-d3fe-47f0-844b-b0e1561d3181"
  }
}

Then with expansion of namespaces and $variables we get:

{ "first_output": { 
    "@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
  },
  "second_output": {
    "@id": "urn:uuid:6e076c8b-d3fe-47f0-844b-b0e1561d3181",
    "http://www.w3.org/ns/prov#wasDerivedFrom":  {
      "@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
     },
    "http://example.com/foo":  {
      "@id": "http://edamontology.org/topic_0091"
    }
  }
}

[ updated by @mr-c to add missing commas, make the UUIDs unique ]

@mr-c
Copy link
Member Author

mr-c commented Oct 3, 2018

@stain Thank you for the json-ld example.

I've updated my sketch to show that we might want to set relationships between an output and another output and also an input

stain added a commit to common-workflow-language/cwltool that referenced this issue May 1, 2019
@stain
Copy link
Member

stain commented May 23, 2019

OK, in 036af7c78a3e1c5125009ae05dbdb853afca6790 I try to sketch out how this can be recorded as templates in the CWL, and then add these to the PROV. There is an issue in what to call these (here cwlprov:relationships and how to reference the variables to fill in at execution time (here using a direct reference #inputs.first_input).

But this leads to fairly misleading information in cwlprov --print-rdf in that it would claim the output parameter definition has a "relationship" to an anonymous object, which then "is derived from" (or whatever property is used) an input parameter definition. This is acceptable if we think of the input/object parameter as a "superobject" of every object that passes through it, as in every file object prov:specializationOf the parameters it is input or output at.

(this is like saying Stian is a specialisation of CustomerOfTesco because I went shopping at Tesco once)

See also PROV-Template which would use a special var namespace for pre-existing variables, which we could bind directly to the input/output objects using existing CWL Expressions (e.g. $(inputs.message) -> var:inputs.message)

@stain
Copy link
Member

stain commented May 23, 2019

Here are some of the mappings we should be able to do https://gist.github.com/stain/f0b0d966a103b1533d684aa6d7197364

The data concepts are often more complex expressions than pure typing from EDAM ontology or BioSchemas - so it might be we need to support more than 1 triple-level expressions as explored here and in #1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants