Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create custom code to parse Measles strain names #14

Open
kimandrews opened this issue Feb 7, 2024 · 0 comments
Open

Create custom code to parse Measles strain names #14

kimandrews opened this issue Feb 7, 2024 · 0 comments

Comments

@kimandrews
Copy link
Contributor

As discussed, the WHO requires measles strain names to include the sampling date and geographic location, and in some cases, the strain names could be used to recover dates and/or geographic locations for samples that have empty or ambiguous values for these attributes in the NCBI Datasets program outputs. However, the WHO-formatted strain names do not always appear in the NCBI Datasets output because some GenBank submitters report strain names in the "isolate" field whereas others use the "strain" field, but the NCBI Datasets program only pulls the "isolate" field. The NCBI Datasets team has plans to add the "strain" field sometime this year. After that has been completed, custom code could be written to parse the NCBI Datasets output to do the following for each sample:

  1. Determine whether WHO-formatted strain name is in the "isolate" or "strain" field
  2. Parse date and geographic location from WHO-formatted strain name when these attributes are otherwise empty or ambiguous

This custom code may have minimal impact on the current measles workflow outputs, because very few samples that meet the minimum length requirement (5000bp) have missing dates that could be recovered by this approach. However, if we eventually create gene-specific phylogenies, more samples would be affected. In addition, this code would recover WHO-formatted strain names for many samples (because many samples have strain names in the "strain" field), and there is value in having these strain names present in the metadata retrieved for all samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant