Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data interpretation field for a GenomicAnnotation #2

Open
nsheff opened this issue Dec 10, 2024 · 2 comments
Open

Data interpretation field for a GenomicAnnotation #2

nsheff opened this issue Dec 10, 2024 · 2 comments

Comments

@nsheff
Copy link
Collaborator

nsheff commented Dec 10, 2024

One of the fields for the Annotation entity is the "data interpretation". What do the entities actually represent? For example, a GenomicAnnotation with geometry of "Point" could represent SNPs, or it could represent CpG dinucleotides. A GenomicAnnotation with a geometry of "Region" could represent "peaks", the results of a peak-calling algorithm, or they could also represent "reads" from a sequencing experiment. (Alternative name possibility: "entity type"?)

There are really probably 2 things here; one is something biological, and the other describes a process.

For example: peak calling: https://bioportal.bioontology.org/ontologies/EDAM?p=classes&conceptid=http%3A%2F%2Fedamontology.org%2Foperation_3222

But this is a process; we really needed things called "peaks" which represents the output of a process.

Then for biological things, we need things like "CpGs" or SNPs.

I did find in EDAM the term SNP:

So it seems like we would need to make quite a few changes here.

@nsheff
Copy link
Collaborator Author

nsheff commented Dec 17, 2024

Maybe there are 2 concepts here:

  • algorithmic viewpoint: this GenomicAnnotation is the result of a peak-calling algorithm
  • biological concept: this GenomicAnnotation (result of a peak-calling algorithm) represents TF binding sites, or chromatin accessibility, or CpG island annotations, or whatever.

So, these are maybe two different types of data interpretation.

@sveinugu
Copy link
Contributor

sveinugu commented Dec 17, 2024

Maybe there are 2 concepts here:

  • algorithmic viewpoint: this GenomicAnnotation is the result of a peak-calling algorithm
  • biological concept: this GenomicAnnotation (result of a peak-calling algorithm) represents TF binding sites, or chromatin accessibility, or CpG island annotations, or whatever.

So, these are maybe two different types of data interpretation.

I agree.

In the FAIRtracks schema, the biological concept was mostly captured jn the technique and target fields of the Experiment schema, see e.g. this example (from here):

{
    "@schema": "https://raw.githubusercontent.com/fairtracks/fairtracks_standard/v1/current/json/schema/fairtracks_experiment.schema.json",
    "global_id": "geo:GSM945229",
    "local_id": "encode:ENCSR000DQP",
    "study_ref": "U54HG004592",
    "sample_ref": "encode:ENCBS192PUU",
    "technique": {
        "term_id": "http://purl.obolibrary.org/obo/OBI_0002017",
        "term_label": "histone modification identification by ChIP-Seq assay"
    },
    "target": {
        "sequence_feature": {
            "term_id": "http://purl.obolibrary.org/obo/SO_0001706",
            "term_label": "H3K4_trimethylation"
        },
        "summary": "H3K4_trimethylation"
    },
    "lab_protocol_description": "https://www.encodeproject.org/documents/8f459e88-6344-434f-8f9f-6375a9ff1880/@@download/attachment/CD20%2B_Stam_protocol.pdf",
    "compute_protocol_description": "https://www.encodeproject.org/documents/6f6351d4-9310-4a3b-a3c2-70ecac47b28b/@@download/attachment/ChIP-seq_Mapping_Pipeline_Overview.pdf"
}

Here, the target is a specialisation of the concept of TF binding site, which is a relation I believe is expressed in the Sequence ontology, in which this term is defined. For other experimental techniques, the target could be e.g. terms like "open chromatin" or "miRNA".

With the biological concept handled by the Experiment entity, we focused on the algorithmic viewpoint in the type_of_condensed_data field in the Track schema. The initial suggested values for this field were:

  • Sequence-derived regions
  • Experimentally-derived regions
  • Predicted regions
  • Predicted segmentation
  • Population-derived variants
  • Individual variants
  • Peaks
  • Broad peaks
  • Narrow peaks
  • Gapped peaks
  • Signal values (fold change)
  • Signal values (p-value)
  • Signal values (log likelihood)
  • Signal values (other)
  • Read coverage
  • Read counts
  • Mapped single-end reads
  • Mapped paired-end reads
  • Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants