You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed, the WHO requires measles strain names to include the sampling date and geographic location, and in some cases, the strain names could be used to recover dates and/or geographic locations for samples that have empty or ambiguous values for these attributes in the NCBI Datasets program outputs. However, the WHO-formatted strain names do not always appear in the NCBI Datasets output because some GenBank submitters report strain names in the "isolate" field whereas others use the "strain" field, but the NCBI Datasets program only pulls the "isolate" field. The NCBI Datasets team has plans to add the "strain" field sometime this year. After that has been completed, custom code could be written to parse the NCBI Datasets output to do the following for each sample:
Determine whether WHO-formatted strain name is in the "isolate" or "strain" field
Parse date and geographic location from WHO-formatted strain name when these attributes are otherwise empty or ambiguous
This custom code may have minimal impact on the current measles workflow outputs, because very few samples that meet the minimum length requirement (5000bp) have missing dates that could be recovered by this approach. However, if we eventually create gene-specific phylogenies, more samples would be affected. In addition, this code would recover WHO-formatted strain names for many samples (because many samples have strain names in the "strain" field), and there is value in having these strain names present in the metadata retrieved for all samples.
The text was updated successfully, but these errors were encountered:
As discussed, the WHO requires measles strain names to include the sampling date and geographic location, and in some cases, the strain names could be used to recover dates and/or geographic locations for samples that have empty or ambiguous values for these attributes in the NCBI Datasets program outputs. However, the WHO-formatted strain names do not always appear in the NCBI Datasets output because some GenBank submitters report strain names in the "isolate" field whereas others use the "strain" field, but the NCBI Datasets program only pulls the "isolate" field. The NCBI Datasets team has plans to add the "strain" field sometime this year. After that has been completed, custom code could be written to parse the NCBI Datasets output to do the following for each sample:
This custom code may have minimal impact on the current measles workflow outputs, because very few samples that meet the minimum length requirement (5000bp) have missing dates that could be recovered by this approach. However, if we eventually create gene-specific phylogenies, more samples would be affected. In addition, this code would recover WHO-formatted strain names for many samples (because many samples have strain names in the "strain" field), and there is value in having these strain names present in the metadata retrieved for all samples.
The text was updated successfully, but these errors were encountered: