Add gene coverage columns during ingest workflow #36

j23414 · 2024-03-22T15:09:42Z

Description of proposed changes

Several approaches were explored to add {gene}_coverage columns during ingest workflow (as opposed to during phylogenetic workflow). The different approaches were summarized by @joverlee521 and @jameshadfield and copied here for context of this PR, along with added comments from @j23414 in [comments]:

Generalize RSV's extend-metadata to take gene coordinates as input to calculate gene coverage. This will require gene coordinates to be maintained in the config YAML. Follow current pattern of outputing gene coverage columns that can be used for filter [ Opened an issue: Generalize the "extend-metadata.py" script for any {gene}_coverage columns rsv#57 ~ @j23414 ]
Use Nextclade's failedCdses column to determine if E gene has coverage. Outputs E gene included with True/False that can be used for filter. [I went ahead and appended the failedCdses column from Nextclade, so we can still use this method for other genes ~ @j23414 ]
We briefly talked about whether it would be possible for Nextclade to output {CDS}_coverage columns in addition to the full genome coverage column. This will allow the workflow to use the Nextclade columns for filter without having to maintain the gene coordinates or parse the dataset GFF file to get the coordinates
Use the output (translated) CDS alignments from nextclade to add columns to the metadata with amino acid length or similar. This could then be used via augur filter --query .... This approach would be made obsolete by (3), but it's pretty easy to do right now. I [@jameshadfield] think it's preferable to (1) in both the case of compound CDSs and the case where a genome alignment extends both sides of the CDS but actually has very little coverage over the CDS itself. [This PR is following approach 4 ~ @j23414 ]

New Metadata

To view the new "E_coverage" columns, feel free to download the new metadata at:

wget https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst
zstd -d metadata_all.tsv.zst

The new {gene}_coverage columns are the rightmost columns.

Related issue(s)

General Issue: https://github.com/nextstrain/private/issues/102#issuecomment-2010616347
More specific Issue: Add E gene builds #17
Other attempt to solve specific issue: Add E gene trees #18
Related PR in RSV: Allows for CDS (as well as gene) features to generate a new gene reference rsv#55
Related issue in Measles: Consider building gene-specific phylogenies measles#13

Checklist

Checks pass

joverlee521

Thanks for walking through all the methods you've tried and where we've landed!

I've left some minor comments, but I think my main question is whether or not we still need the E_indicator column? It seems extra now that we can have the E_coverage column.

ingest/rules/nextclade.smk

ingest/bin/calculate-gene-converage-from-nextclade-translation.py

ingest/rules/nextclade.smk

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

j23414 · 2024-04-05T20:08:37Z

Thanks @joverlee521 ! This PR is ready for the next round of reviews

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

This is using the Nextclade "coverage" as "genome_coverage" and the Nextclade "failedCdses" to check if E_coverage is present or not. fixup: use 1 instead of true

This can be one gene or a set of genes, can then be used to calculate gene_coverage columns.

Move intermediate files to the "data" folder

… acid FASTA file

Adds the following rules for gene coverage * calculate_gene_coverage: calls a python script which takes a Nextclade CDS translation FASTA and calculates (valid AA)/(total length). The percentage is rounded to 3 significant figures. * aggregate_gene_coverage_by_gene: combines the gene_coverage files by gene (e.g. ["E", "NS1"] ) across all serotypes (e.g. denv1-4) * appends_gene_coverage_columns: Add each gene_coverage column (e.g. "E_coverage", "NS1_coverage") to the the final metadata

Co-authored-by: Jover Lee <[email protected]>

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

…g params so they don't get out of sync between rules

Encode serotype and gene as part of the directory structure where possible.

As suggested by #36 (comment) Merge ID should be the first item in the map

joverlee521

Thank you for continuing to push on this @j23414! This looks good to merge to me 👍

j23414 requested a review from a team March 23, 2024 16:09

joverlee521 reviewed Mar 25, 2024

View reviewed changes

j23414 added a commit that referenced this pull request Mar 27, 2024

fixup: drop the E_indicator column

5d05098

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

j23414 requested a review from a team April 5, 2024 20:01

j23414 added a commit that referenced this pull request Apr 10, 2024

fixup: drop the E_indicator column

468f03d

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

j23414 force-pushed the add-gene-coverage-columns branch from 9622474 to 3e273e1 Compare April 10, 2024 19:18

j23414 added a commit that referenced this pull request Apr 15, 2024

fixup: drop the E_indicator column

688b91d

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

j23414 force-pushed the add-gene-coverage-columns branch from 3e273e1 to 7b24c9f Compare April 15, 2024 18:19

j23414 and others added 14 commits April 16, 2024 15:09

Add alignmentStart,alignmentEnd from nextclade results

9839bd1

Add genome_coverage and indicator (True/blank) variable for E_coverage

2a9eee4

This is using the Nextclade "coverage" as "genome_coverage" and the Nextclade "failedCdses" to check if E_coverage is present or not. fixup: use 1 instead of true

fixup: silence slack notifications for now

12947fa

Add gene configs to allow for calculating gene coverage

d22016f

Output Nextclade gene translations to a fasta files

ddf0fb3

This can be one gene or a set of genes, can then be used to calculate gene_coverage columns.

Only have final files in the "results" directory

1a6d1ef

Move intermediate files to the "data" folder

Add script to calculate gene coverage from Nextclade translated amino…

35bbb83

… acid FASTA file

fixup: Use tsv-append instead

685e218

Co-authored-by: Jover Lee <[email protected]>

fixup: drop the E_indicator column

c861942

#36 (comment) Since we are not using the E_indicator column, drop it. We have separate steps to calculate the E_coverage column.

fixup: gene coverage script docs

e722c76

fixup: avoid awk column number selection

7b94670

fixup: move hard-coded columns to a shared workflow variable or confi…

1e7cde8

…g params so they don't get out of sync between rules

Use serotype/gene/files in directory structure

f6a620d

Encode serotype and gene as part of the directory structure where possible.

j23414 force-pushed the add-gene-coverage-columns branch from 7b24c9f to f6a620d Compare April 16, 2024 22:09

Use a one-to-one mapping of Nextclade input to output columns

db300a5

As suggested by #36 (comment) Merge ID should be the first item in the map

joverlee521 approved these changes Apr 23, 2024

View reviewed changes

j23414 merged commit 4ee7ec5 into main Apr 24, 2024
32 checks passed

j23414 deleted the add-gene-coverage-columns branch April 24, 2024 23:05

j23414 mentioned this pull request May 8, 2024

Add E gene builds #17

Closed

joverlee521 mentioned this pull request Jul 10, 2024

ingest: Standardize steps for adding gene coverage to metadata nextstrain/pathogen-repo-guide#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gene coverage columns during ingest workflow #36

Add gene coverage columns during ingest workflow #36

j23414 commented Mar 22, 2024

joverlee521 left a comment

j23414 commented Apr 5, 2024

joverlee521 left a comment

Add gene coverage columns during ingest workflow #36

Add gene coverage columns during ingest workflow #36

Conversation

j23414 commented Mar 22, 2024

Description of proposed changes

New Metadata

Related issue(s)

Checklist

joverlee521 left a comment

Choose a reason for hiding this comment

j23414 commented Apr 5, 2024

joverlee521 left a comment

Choose a reason for hiding this comment