Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Improve speed of --output-strains and --output-metadata #1469

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented May 18, 2024

Description of proposed changes

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

Related issue(s)

Checklist

  • Address FIXMEs
  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@victorlin victorlin self-assigned this May 18, 2024
Copy link

codecov bot commented May 18, 2024

Codecov Report

Attention: Patch coverage is 59.52381% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 68.70%. Comparing base (4923408) to head (a48025b).
Report is 1 commits behind head on master.

Files Patch % Lines
augur/filter/io.py 54.05% 14 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1469      +/-   ##
==========================================
- Coverage   68.85%   68.70%   -0.16%     
==========================================
  Files          69       69              
  Lines        7607     7624      +17     
  Branches     1861     1867       +6     
==========================================
  Hits         5238     5238              
- Misses       2086     2100      +14     
- Partials      283      286       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

output_metadata_handle.close()
if output_strains:
output_strains.close()
tsv_join = which("tsv-join")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tsv-utils/tsv-join in Augur

@tsibley and I chatted about this yesterday. Two options:

  1. Detect tsv-join in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation of fasttree/raxml/iqtree/vcftools/etc. except those are explicitly requested by the user while tsv-join could be detected and used automatically as a faster alternative to the Python approach.
  2. We could bundle tsv-join as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.

At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Copy link
Member Author

@victorlin victorlin May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[bundling] seems like the best way to provide this better experience to the most users

I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join in the managed runtimes.

[bundling] follows the pattern of bundling other third-party tools like you mention above.

Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.

What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Great point - I think this will be the easiest way to push the feature through:

  • don't bundle
  • use tsv-join if it's available
  • use Python fallback with a warning to consider downloading tsv-join in the environment if experiencing slowness

We can still consider bundling in a future version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base uses the conda-forge one seamlessly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.

@victorlin victorlin force-pushed the victorlin/update-filter-outputs branch from a48025b to afb010c Compare July 17, 2024 22:14
Copy link
Member Author

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more fragile than I initially expected.

  1. tsv-join will only be used when all of these conditions are met:

    • tsv-join is available
    • xzcat/gzcat/zstdcat is available if the input type is compressed
    • the output type is uncompressed (due to limitations)
  2. Even with uncompressed output, there are some slight differences in behavior when it comes to handling quoted columns: afb010c

Threads for each point below.

augur/filter/io.py Outdated Show resolved Hide resolved
@@ -7,6 +7,8 @@ the default quotechar, any column names with that character may be altered.

Quoted columns containing the tab delimiter are left unchanged.

# FIXME: tsv-join has different behavior here. Test both?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. --output-metadata-attempt-tsv-utils.

augur/filter/io.py Outdated Show resolved Hide resolved
augur/filter/io.py Outdated Show resolved Hide resolved
Write the strain list directly instead of going through the metadata.
This is much faster on large datasets.

The side effect is that --output-strains is sorted alphabetically
instead of retaining the order from the original metadata. That order
was noted to be retained in 24.2.0 changelog but it's not explicitly
said anywhere else.
tsv-join is much faster than the other implementation here (18x faster -
12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16
million rows).
@victorlin victorlin force-pushed the victorlin/update-filter-outputs branch from 022fcd3 to 91dafbf Compare August 3, 2024 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants