-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Improve speed of --output-strains
and --output-metadata
#1469
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1469 +/- ##
==========================================
- Coverage 68.85% 68.70% -0.16%
==========================================
Files 69 69
Lines 7607 7624 +17
Branches 1861 1867 +6
==========================================
Hits 5238 5238
- Misses 2086 2100 +14
- Partials 283 286 +3 ☔ View full report in Codecov by Sentry. |
output_metadata_handle.close() | ||
if output_strains: | ||
output_strains.close() | ||
tsv_join = which("tsv-join") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using tsv-utils/tsv-join in Augur
@tsibley and I chatted about this yesterday. Two options:
- Detect
tsv-join
in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation offasttree
/raxml
/iqtree
/vcftools
/etc. except those are explicitly requested by the user whiletsv-join
could be detected and used automatically as a faster alternative to the Python approach. - We could bundle
tsv-join
as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could bundle tsv-join as part of Augur to avoid the the downsides of (1).
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:
We could bundle tsv-join as part of Augur to avoid the the downsides of (1).
This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.
At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[bundling] seems like the best way to provide this better experience to the most users
I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join
in the managed runtimes.
[bundling] follows the pattern of bundling other third-party tools like you mention above.
Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.
What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
Great point - I think this will be the easiest way to push the feature through:
- don't bundle
- use
tsv-join
if it's available - use Python fallback with a warning to consider downloading
tsv-join
in the environment if experiencing slowness
We can still consider bundling in a future version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base
uses the conda-forge one seamlessly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.
a48025b
to
afb010c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more fragile than I initially expected.
-
tsv-join
will only be used when all of these conditions are met:tsv-join
is availablexzcat
/gzcat
/zstdcat
is available if the input type is compressed- the output type is uncompressed (due to limitations)
-
Even with uncompressed output, there are some slight differences in behavior when it comes to handling quoted columns: afb010c
Threads for each point below.
@@ -7,6 +7,8 @@ the default quotechar, any column names with that character may be altered. | |||
|
|||
Quoted columns containing the tab delimiter are left unchanged. | |||
|
|||
# FIXME: tsv-join has different behavior here. Test both? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. --output-metadata-attempt-tsv-utils
.
Write the strain list directly instead of going through the metadata. This is much faster on large datasets. The side effect is that --output-strains is sorted alphabetically instead of retaining the order from the original metadata. That order was noted to be retained in 24.2.0 changelog but it's not explicitly said anywhere else.
tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).
022fcd3
to
91dafbf
Compare
Description of proposed changes
tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).
Related issue(s)
Checklist