Data interpretation field for a GenomicAnnotation #2

nsheff · 2024-12-10T17:07:32Z

One of the fields for the Annotation entity is the "data interpretation". What do the entities actually represent? For example, a GenomicAnnotation with geometry of "Point" could represent SNPs, or it could represent CpG dinucleotides. A GenomicAnnotation with a geometry of "Region" could represent "peaks", the results of a peak-calling algorithm, or they could also represent "reads" from a sequencing experiment. (Alternative name possibility: "entity type"?)

There are really probably 2 things here; one is something biological, and the other describes a process.

For example: peak calling: https://bioportal.bioontology.org/ontologies/EDAM?p=classes&conceptid=http%3A%2F%2Fedamontology.org%2Foperation_3222

But this is a process; we really needed things called "peaks" which represents the output of a process.

Then for biological things, we need things like "CpGs" or SNPs.

I did find in EDAM the term SNP:

term is here: https://bioportal.bioontology.org/ontologies/EDAM/?p=classes&lang=en&conceptid=http%3A%2F%2Fedamontology.org%2Fdata_2092&jump_to_nav=true
http://edamontology.org/data_2092
it says obsolete, replace by "Nucleic acid features", which doesn't make sense

So it seems like we would need to make quite a few changes here.

nsheff · 2024-12-17T15:50:37Z

Maybe there are 2 concepts here:

algorithmic viewpoint: this GenomicAnnotation is the result of a peak-calling algorithm
biological concept: this GenomicAnnotation (result of a peak-calling algorithm) represents TF binding sites, or chromatin accessibility, or CpG island annotations, or whatever.

So, these are maybe two different types of data interpretation.

sveinugu · 2024-12-17T20:40:46Z

Maybe there are 2 concepts here:

algorithmic viewpoint: this GenomicAnnotation is the result of a peak-calling algorithm

biological concept: this GenomicAnnotation (result of a peak-calling algorithm) represents TF binding sites, or chromatin accessibility, or CpG island annotations, or whatever.

So, these are maybe two different types of data interpretation.

I agree.

In the FAIRtracks schema, the biological concept was mostly captured jn the technique and target fields of the Experiment schema, see e.g. this example (from here):

{
    "@schema": "https://raw.githubusercontent.com/fairtracks/fairtracks_standard/v1/current/json/schema/fairtracks_experiment.schema.json",
    "global_id": "geo:GSM945229",
    "local_id": "encode:ENCSR000DQP",
    "study_ref": "U54HG004592",
    "sample_ref": "encode:ENCBS192PUU",
    "technique": {
        "term_id": "http://purl.obolibrary.org/obo/OBI_0002017",
        "term_label": "histone modification identification by ChIP-Seq assay"
    },
    "target": {
        "sequence_feature": {
            "term_id": "http://purl.obolibrary.org/obo/SO_0001706",
            "term_label": "H3K4_trimethylation"
        },
        "summary": "H3K4_trimethylation"
    },
    "lab_protocol_description": "https://www.encodeproject.org/documents/8f459e88-6344-434f-8f9f-6375a9ff1880/@@download/attachment/CD20%2B_Stam_protocol.pdf",
    "compute_protocol_description": "https://www.encodeproject.org/documents/6f6351d4-9310-4a3b-a3c2-70ecac47b28b/@@download/attachment/ChIP-seq_Mapping_Pipeline_Overview.pdf"
}

Here, the target is a specialisation of the concept of TF binding site, which is a relation I believe is expressed in the Sequence ontology, in which this term is defined. For other experimental techniques, the target could be e.g. terms like "open chromatin" or "miRNA".

With the biological concept handled by the Experiment entity, we focused on the algorithmic viewpoint in the type_of_condensed_data field in the Track schema. The initial suggested values for this field were:

Sequence-derived regions
Experimentally-derived regions
Predicted regions
Predicted segmentation
Population-derived variants
Individual variants
Peaks
Broad peaks
Narrow peaks
Gapped peaks
Signal values (fold change)
Signal values (p-value)
Signal values (log likelihood)
Signal values (other)
Read coverage
Read counts
Mapped single-end reads
Mapped paired-end reads
Other

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data interpretation field for a GenomicAnnotation #2

Data interpretation field for a GenomicAnnotation #2

nsheff commented Dec 10, 2024

nsheff commented Dec 17, 2024

sveinugu commented Dec 17, 2024 •

edited

Loading

Data interpretation field for a GenomicAnnotation #2

Data interpretation field for a GenomicAnnotation #2

Comments

nsheff commented Dec 10, 2024

nsheff commented Dec 17, 2024

sveinugu commented Dec 17, 2024 • edited Loading

sveinugu commented Dec 17, 2024 •

edited

Loading