Skip to content

Commit

Permalink
ingest: properly handle TSVs with csvtk
Browse files Browse the repository at this point in the history
Following Nextstrain data format docs to update handling of TSVs
with `csvtk`.¹ Instead of going through extra `csv2tsv`/`csvtk fix-quotes`
commands, just replace `tsv-select` with `csvtk`. Note that NCBI uses
IANA TSVs,² so the `dataformat` output must go through `csvtk fix-quotes`
to be properly quoted as CSV-like TSVs.

- `tsv-select` to subset columns is directly replaced with `csvtk cut -t`.
- `tsv-select` to reorder columns is replaced with `csvtk mutate --at 1`
to insert the new `accession` column as the first column.

¹ <https://docs.nextstrain.org/en/latest/reference/data-formats.html#id1>
² <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/file-formats/metadata-files/about-json-and-tabular/>
  • Loading branch information
joverlee521 committed Dec 6, 2024
1 parent f55c6c2 commit 241b1bb
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 6 deletions.
2 changes: 1 addition & 1 deletion ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,6 @@ rule subset_metadata:
metadata_fields=",".join(config["curate"]["metadata_columns"]),
shell:
"""
tsv-select -H -f {params.metadata_fields} \
csvtk cut -t -f {params.metadata_fields} \
{input.metadata} > {output.subset_metadata}
"""
6 changes: 2 additions & 4 deletions ingest/rules/fetch_from_ncbi.smk
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,9 @@ rule format_ncbi_dataset_report:
--fields {params.ncbi_datasets_fields:q} \
--elide-header \
| csvtk fix-quotes -Ht \
| csvtk add-header -t -l -n {params.ncbi_datasets_fields:q} \
| csvtk add-header -t -n {params.ncbi_datasets_fields:q} \
| csvtk rename -t -f accession -n accession_version \
| csvtk -t mutate -f accession_version -n accession -p "^(.+?)\." \
| csvtk del-quotes -t \
| tsv-select -H -f accession --rest last \
| csvtk -t mutate -f accession_version -n accession -p "^(.+?)\." --at 1 \
> {output.ncbi_dataset_tsv}
"""

Expand Down
2 changes: 1 addition & 1 deletion ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ rule nextclade_metadata:
--id-column {params.nextclade_id_field:q} \
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| tsv-select --header --fields {params.nextclade_fields:q} \
| csvtk cut -t --fields {params.nextclade_fields:q} \
> {output.nextclade_metadata:q}
"""

Expand Down

0 comments on commit 241b1bb

Please sign in to comment.