-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gene coverage columns during ingest workflow #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for walking through all the methods you've tried and where we've landed!
I've left some minor comments, but I think my main question is whether or not we still need the E_indicator
column? It seems extra now that we can have the E_coverage
column.
ingest/bin/calculate-gene-converage-from-nextclade-translation.py
Outdated
Show resolved
Hide resolved
#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.
Thanks @joverlee521 ! This PR is ready for the next round of reviews |
#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.
9622474
to
3e273e1
Compare
#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.
3e273e1
to
7b24c9f
Compare
This is using the Nextclade "coverage" as "genome_coverage" and the Nextclade "failedCdses" to check if E_coverage is present or not. fixup: use 1 instead of true
This can be one gene or a set of genes, can then be used to calculate gene_coverage columns.
Move intermediate files to the "data" folder
Adds the following rules for gene coverage * calculate_gene_coverage: calls a python script which takes a Nextclade CDS translation FASTA and calculates (valid AA)/(total length). The percentage is rounded to 3 significant figures. * aggregate_gene_coverage_by_gene: combines the gene_coverage files by gene (e.g. ["E", "NS1"] ) across all serotypes (e.g. denv1-4) * appends_gene_coverage_columns: Add each gene_coverage column (e.g. "E_coverage", "NS1_coverage") to the the final metadata
Co-authored-by: Jover Lee <[email protected]>
#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.
…g params so they don't get out of sync between rules
Encode serotype and gene as part of the directory structure where possible.
7b24c9f
to
f6a620d
Compare
As suggested by #36 (comment) Merge ID should be the first item in the map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for continuing to push on this @j23414! This looks good to merge to me 👍
Description of proposed changes
Several approaches were explored to add
{gene}_coverage
columns during ingest workflow (as opposed to during phylogenetic workflow). The different approaches were summarized by @joverlee521 and @jameshadfield and copied here for context of this PR, along with added comments from @j23414 in [comments]:New Metadata
To view the new "E_coverage" columns, feel free to download the new metadata at:
The new {gene}_coverage columns are the rightmost columns.
Related issue(s)
Checklist