-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify Biolink Model 3 (qualifier aware) KP test data format #60
Comments
The biolink model sensitive portions of the current KP JSON test configuration file are the test edge data that look something like this: {
...
"edges": [
{
"subject_category": "biolink:SmallMolecule",
"object_category": "biolink:Disease",
"predicate": "biolink:treats",
"subject": "CHEBI:3002", # beclomethasone dipropionate
"object": "MESH:D001249" # asthma
},
...other test edges
]
} The general strategy to fix the format to align with Biolink Model 3.0++ is perhaps to ensure that test edge data closely mirror the corresponding TRAPI 1.3++ schema data model:
Qualifier ConstraintsThe TRAPI model (excerpt) for query graph (QEdge) qualifier specification is: QEdge:
type: object
...
properties:
...
qualifier_constraints:
type: array
description: >-
A list of QualifierConstraints that provide nuance to the QEdge.
If multiple QualifierConstraints are provided, there is an OR
relationship between them. If the QEdge has multiple
predicates or if the QNodes that correspond to the subject or
object of this QEdge have multiple categories or multiple
curies, then qualifier_constraints MUST NOT be specified
because these complex use cases are not supported at this time.
items:
$ref: '#/components/schemas/QualifierConstraint'
default: [] where QualifierConstraint:
additionalProperties: false
description: >-
...
properties:
qualifier_set:
type: array
description: >-
A set of Qualifiers that serves to add nuance to a query,
by constraining allowed values held by Qualifiers
on queried Edges.
items:
$ref: '#/components/schemas/Qualifier'
nullable: false and Qualifier:
additionalProperties: false
description: >-
An additional nuance attached to an assertion
type: object
properties:
qualifier_type_id:
type: string
description: >-
...
nullable: false
qualifier_value:
type: string
description: >-
...
nullable: false
required:
- qualifier_type_id
- qualifier_value This suggests one direct specification of qualifiers in the test data edge might look something like this: "qualifier_constraints": [
{
"qualifier_set": [
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "agonist"
},
{
"qualifier_type_id": "biolink:catalyst_qualifier"
"qualifier_value": "CHEBI:63951" # estrogen receptor agonist
}
]
},
...other qualifier constraint sets?
] Assuming (perhaps foolishly?) that qualifiers can only be specified once in a query, then perhaps the following alternate direct compact representation of the qualifier_type_id = qualifier_value pair could be envisioned: "qualifier_constraints": {
"biolink:causal_mechanism_qualifier": "inhibition",
...other qualifier type_id:value pairs?
}
But this doesn't allow for duplicate keys, so perhaps we use the full TRAPI syntax for a single On the other hand, we pose the question whether or not test edge input data ought (or need) to have complex multi-qualifier constraints, namely, arrays of multiple QualifierConstraint An initial implementation of test edge data format supporting qualifiers could limit itself to one qualifier, for example: "edges": [
{
"subject_category": etc...
"qualifier": {
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "inhibition"
}
} # end of test edge spec
... more test edges
] A second generation implementation of test edge data format supporting qualifiers could allow a single "edges": [
{
"subject_category": etc...
"qualifier_set": [
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "inhibition"
},
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "inhibition"
}
]
} # end of test edge spec
... more test edges
] A third iteration, maybe additional qualifier sets: "edges": [
{
"subject_category": etc...
"qualifier_constraints": [
{
"qualifier_set": [
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "agonist"
},
{
"qualifier_type_id": "biolink:catalyst_qualifier"
"qualifier_value": "CHEBI:63951" # estrogen receptor agonist
}
]
},
... more qualifier sets
]
} # end of test edge spec
... more test edges
] Attribute ConstraintsAlthough QEdge:
type: object
...
properties:
...
attribute_constraints:
type: array
description: >-
A list of attribute constraints applied to a query edge.
If there are multiple items, they must all be true (equivalent
to AND)
items:
$ref: '#/components/schemas/AttributeConstraint'
default: [] where AttributeConstraint:
type: object
description: >-
Generic query constraint for a query node or query edge
properties:
id:
allOf:
- $ref: '#/components/schemas/CURIE'
description: >-
CURIE of the concept being constrained. For properties
defined by the Biolink model this SHOULD be a biolink CURIE.
otherwise, if possible, from the EDAM ontology. If a suitable
CURIE does not exist, enter a descriptive phrase here and
submit the new type for consideration by the appropriate
authority.
example: EDAM:data_0844
name:
type: string
description: >-
Human-readable name or label for the constraint concept.
If appropriate, it SHOULD be the term name of the CURIE used
as the 'id'. This is redundant but required for human
readability.
example: molecular mass
not:
type: boolean
default: false
operator:
type: string
description: >-
... see TRAPI for full definition
value:
example: 57.0
description: >-
Value of the attribute. May be any data type, including a list.
If the value is a list and there are multiple items, at least one
comparison must be true (equivalent to OR) unless the '==='
operator is used. If 'value' is of data
type 'object', the keys of the object MAY be treated as a list.
A 'list' data type paired with the '>' or '<' operators will
encode extraneous comparisons, but this is permitted as it is in
SQL and other languages.
unit_id:
example: UO:0000222
description: >-
...
nullable: true
unit_name:
example: kilodalton
description: >-
...
nullable: true
required:
- name
- id
- operator
- value
additionalProperties: false The SRI Testing harness validation generates various permutations of TRAPI test queries based on test edge data. As such, it may need to embrace a more modest objective of supporting a small number of high value use cases (one of which the system has already has, in the focused validation of string knowledge source provenance data). One new use case may be the validation of Association Constraint below). Other explicit use cases could initially limit validation to a subset of use cases, for example, to implicitly always assume use of the This suggests an initial specification of attribute constraints in the test data edge might look something like this: "attribute_constraints": [
{
"id": "<attribute_type_id>"
"value": "<value>"
},
...other attribute constraint id/value pairs?
] Association ConstraintAlthough TRAPI querying of a biolink:Association constraint may initially need to be specified as an attribute constraint (see above), there is nothing that prevents us here from adding a specific 'association' tag to the test edge data JSON (see below). SummaryElaborating our original test data example above, the newly formatted test edge data might finally look something like this: {
...
"edges": [
{
"subject_category": "biolink:SmallMolecule",
"object_category": "biolink:Disease",
"predicate": "biolink:treats",
"subject_id": "CHEBI:3002", # beclomethasone dipropionate
"object_id": "MESH:D001249" # asthma
"association": "biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation",
"qualifier_constraints": [
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "inhibition"
},
...other qualifier constraint type_id/value pairs?
],
"attribute_constraints": [
{
"id": "<attribute_type_id>"
"value": "<value>"
},
...other qualifier constraint type_id/value pairs
]
},
...other test edges
]
} |
I think that the basic structure is correct. You really want to allow for all the qualifiers for that edge to be in the doc here. (in other words the part that says "... other qualifier constraint type_id/value pairs?" needs to exist. But that's not complicated, it just means that there are more entries in that list. The attribute_constraints are useful only if we want to write tests querying by attribute. We do have that capability in TRAPI so maybe it makes sense. But we maybe should think about what tests we might expect to run there before we have everybody implement that? |
I guess one other thing- why do we have association in here? We (ranking agent) don't model it explicitly. |
I agree with @cbizon, the "Association" stuff in biolink-model (with children like ChemicalAffectsGeneAssociation) isn't used at all by BTE / Service Provider. Also, I'm confused because I thought the test-triples were meant to be similar to Edges, not QEdges. So the test-triples would have So Service Provider has a link to a "qualifiers" test-triple file right now. |
With the test harness, we want to be able to verify that all the properties of an edge (including the nodes that are connected by it) are valid according to the model. The Association objects in the model help us constrain the properties of an edge, including the subject and object domain/range, the predicate, and the kinds of qualifiers that are expected. It would be really handy to have the association name to lookup and test against for each statement. An alternative to supplying the association in the test data could be that the testing harness tries to match the incoming test statements against all possible associations in the model, and throws an "invalid statement" error if it can't one. But I am not sure that would be very performant against a large set of test data? And the "invalid statement" error might not be as specific as we could make it if we had the association -- for example, if the association match 80%, we could report which qualifiers or changes would be needed to make it a valid association. |
We can start by making the 'association' category optional, assuming only |
In an absolute sense, @colleenXu, you are correct that input test data ought to be considered 'real' edges, not query edges. There's a messy issue hiding behind the test edges files and that is that TRAPI knowledge graphs cannot have In fact, the SRI Test Harness can insist on "strict" validation in which case, use of such As for the use of ' |
I guess one basic comment I'd have on some of the above, is that most of the new extra stuff is 'optional'. Whether or not KPs implement any form of it at a given point in time will be purely pragmatic. In such a situation, they can use the (optional) extensions to queries. I don't have a strong opinion about attribute constraints, except that our current treatment of them in the SRI Testing harness is fairly minimalist: mainly some tests for knowledge_source annotation consistency (i.e provenance). Not much else. If use of the constructs are optional, we can probably extend testing on a compelling case-by-case basis (much like the qualifiers starting with 'drug-gene' relationships...) |
@sierra-moxon the problem with including associations is that none (?) of the tools actually use them, so that the inference problem you describe would now have to be done by everybody simultaneously to create the test data. @RichardBruskiewich I think that there's some practical disagreement about whether Knodes can be mixins or abstract entities. It would be nice if we could have a rule, but I can think of a couple of cases where mixin or abstract is the best one can do (I can elaborate if need be). |
Like @sierra-moxon, I sense that the onset of statement qualifiers have made a significantly stronger case for tagging edges with their biolink:Association (child) category. We know that nobody really uses them yet, but one could say the same about qualifiers! As I hinted above, we could make the association tagging of test edge data optional to start (or perhaps, indefinitely) but if provided by a few hardy pioneering KP's, we could leverage it a bit to test ideas about how it may facilitate validation. One can imaging tagging edges in one (or both) of two ways:
Furthermore, one can imagine that the more specialized KPs may often (although not always) only have a small handful of applicable The biolink:Association |
This discussion sounds a bit above my pay grade - maybe the Architecture, TRAPI WG and/or Data Modelling meetings can vote on this. SRI Testing can encode whatever policy applies. With respect to Colleen's observation about test edges being KGraph not QGraph relates a bit to the outcome of this decision in that - as you know - the test edges are fed verbatim (in difference combinatorial ways) into the QGraph of TRAPI requests, then (partly) used to assess the TRAPI response (although the main focus there is focused more on TRAPI schema and Biolink Model compliance of request KGraph JSON contents, but not as much, the Result mappings in between QGraph and KGraph - more needs to be done there (some already on the issue tickets "TODO" list) |
I agree with Chris Bizon's comments here, but to add...
|
makes sense: please put in tickets/ PRs if possible to biolink - I am happy to make changes as needed. |
Generally speaking, abstract and mixin category (class) and predicate (slot) definitions are not meant to be instantiated. Abstract classes/slots are generally found within an 'is_a' hierarchy; mixins, only injected under a 'mixins' list in a given class/slot. As Sierra indicates, is probably better to make abstract and mixin elements 'concrete' if they are routinely embedded in knowledge graph nodes and edges. |
Hello @cbizon, @colleenXu and @sierra-moxon (cc: @putmantime),
Thus, the revised proposal for tomorrow's meeting would be the follow: {
... <other necessary test data file fields>...
"edges": [
{
"subject_category": "biolink:SmallMolecule",
"object_category": "biolink:Disease",
"predicate": "biolink:treats",
"subject_id": "CHEBI:3002", # beclomethasone dipropionate
"object_id": "MESH:D001249" # asthma
"association": "biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation",
"qualifiers": [
{
"qualifier_type_id": "biolink:causal_mechanism_qualifier"
"qualifier_value": "inhibition"
},
...other qualifier constraint type_id/value pairs?
]
},
...other test edges
]
} |
I'm fine 👍 with this. Only 1 small note: is that the format we (@sierra-moxon, biolink-model folks) want for "association" - a biolink-prefix and snake_case? I'm good with using any case, particularly if one format is easier to ingest / parse... Previously, when we've discussed Associations in comments, we've used PascalCase (ChemicalAffectsGeneAssociation) with no biolink-prefix. |
good catch @colleenXu - associations are classes in the model, so it would be good to use the PascalCase. |
Fixed. Thanks @colleenXu... I guess we're ready to present this tomorrow. I'll make a PR of the SRI Testing README out of this. |
Biolink Model 3.0 introduced a new 'qualifier' framework for knowledge representation and TRAPI query. This new framework will require additional KP test edge information for testing, thus, a new KP test data format is needed, perhaps more closely mirroring the TRAPI 1.3 support for this new semantic space. We also need to consider filtering on existing attribute content. Further ahead, constraining test edges to match a specified subclass of
biolink:Association
may facilitate validation, including qualifier (secondary statement level) validation.The text was updated successfully, but these errors were encountered: