Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All environmental samples have an exact synomym "environmental samples" #56

Closed
cmungall opened this issue Nov 12, 2021 · 11 comments
Closed
Assignees

Comments

@cmungall
Copy link
Member

E.g. http://purl.obolibrary.org/obo/NCBITaxon_743727

Even though our transform generally does not alter the source I think in this case we need to filter this

@hrshdhgd
Copy link
Contributor

Issue arises from another project

@jamesaoverton
Copy link
Collaborator

That repo must be private, because I get a 404.

Issue arises from another project

@cmungall
Copy link
Member Author

That's a repo we use for some internal text mining projects, where we may have corpi that people don't want shared yet. There isn't much context there, other than any time a text says "environmental sample" it false positive matches hundreds of "taxa" because that is the exact synonym that was assigned.

This would mark the first time we don't have a 100% isomorphic translation from the source, so it brings up various questions which are raised here:

In this case I would make a case an exception is justified

  • this is outside the area of actual taxa, where ncbitaxon is an authority
  • ncbitaxon fails robot reports at the moment, as there are multiple terms with the same exact synonym
  • we already make certain judgment calls about mapping synonyms, e.g. which ncbi categories map to which obo ones, this is in the same ball park
  • I'm struggling to think of a case where this change would be detrimental

but we need a process for making these kinds of decisions for this ontology, I will send an email to obo-taxonomy and gather other feedback

@nataled
Copy link

nataled commented Nov 17, 2021

Agree that 'environmental samples' should be an exception. That's not even a synonym for any actual taxon. If deleting it altogether is too much, then perhaps the information could be captured in a comment (or other mechanism). Another alternative is to change it from exact to related. I come across the same issue (for actual synonyms) in my automated processing of UniProtKB for PRO. For these, I examine the whole of what's going to be imported, detect synonyms that duplicate either other synonyms or other labels, and mark them accordingly (can use related or broad; not sure which works best for the problem being resolved by the change in algorithm).

@jamesaoverton
Copy link
Collaborator

I'd prefer that they fix this upstream. If that's not in the cards, then I'm fine with us excluding this particular synonym in the OWL we generate.

@bpeters42
Copy link

  • NCBI fixing anything upstream has a low likelihood. But we can try asking again given their recent asks from the IEDB
  • The current NCBI taxonomy is full of things that are not 'exact synonyms'. Like 'humans' for 'Homo' and 'human' for 'homo sapiens'. I would just map everything they call 'exact synonym' to something more loose.
  • If we are serious about 'exact synonym', then we should look for collisions within an ontology between those and other exact synonyms or other primarily labels as part of the dashboard.

@pmidford
Copy link

I support demoting this to either an annotation or better yet a comment. @bpeters42 has a point about synonyms that aren't exact, but that's a different issue - human is a sloppy synonym for H. sapiens, but it is still a synonym. "Environmental sample" isn't a synonym or even really about the taxon as a whole. It's something else, perhaps metadata about individual(s) in the taxon or collection events.

@bpeters42
Copy link

bpeters42 commented Nov 18, 2021

@pmidford , I actually have zero problems with 'human' being an exact synonym of 'homo sapiens'; my problem is that, at the same time, the parent taxon 'homo' has the exact synonym 'humans'. Which conflates singular vs. plural with a class hierarchy, and leads to craziness (Homo heidelbergensis being a kind of humans, but not a kind of human?)

Completely agree with you that 'environmental sample' is even worse. I was trying to point out that the 'exact synonyms' are not only problematic for 'environmental samples' and the like which are at the edges of what the NCBI taxonomy cares about, but also for organisms at the core of classical taxonomy, like homo sapiens.

And to be a bit more precise with what I mean by 'more loose', I thought 'alternative label'.

@cmungall
Copy link
Member Author

Thanks for your comments

Let's start with an NCBI request - I nominate @bpeters42 or @fbastian since you both have existing relationships. @hrshdhgd can go ahead and make a PR, but we will hold off on merging it until we are sure that NCBI won't remove it.

@jamesaoverton
Copy link
Collaborator

We more closely into this, and the problem is probably on our end.

We get our data from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip. One of the tables in there is names.dmp, and this is how it's described in the taxdump_readme.txt:

names.dmp
---------
Taxonomy names file has these fields:

	tax_id					-- the id of node associated with this name
	name_txt				-- name itself
	unique name				-- the unique variant of this name if name not unique
	name class				-- (synonym, common name, ...)

Here's an example row from names.dmp:

tax_id name_txt unique name name class
33858 environmental samples environmental samples <diatoms,phylum Bacillariophyta> scientific name

We want out rdfs:labels to be unique, so we use the unique name if it's present. But then we also create a synonym from the 'name_txt', and that might not be the right thing to do.

This is the relevant bit of code: https://github.com/obophenotype/ncbitaxon/blob/master/src/ncbitaxon.py#L258

I'm too tired right now, but I'll come back to this tomorrow.

hrshdhgd added a commit that referenced this issue Sep 13, 2022
No synonyms added if name = environmental samples #56
@cmungall
Copy link
Member Author

cmungall commented Jul 3, 2024

This issue was fixed by @hrshdhgd 2 years ago, am closing

@cmungall cmungall closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants