-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Improve speed of --output-strains
and --output-metadata
#1469
Draft
victorlin
wants to merge
4
commits into
master
Choose a base branch
from
victorlin/update-filter-outputs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
29 changes: 29 additions & 0 deletions
29
tests/functional/filter/cram/filter-output-metadata-compressed.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
Setup | ||
|
||
$ source "$TESTDIR"/_setup.sh | ||
|
||
Use the same options with 3 different compression methods. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
> --subsample-max-sequences 5 \ | ||
> --subsample-seed 0 \ | ||
> --output-metadata filtered_metadata.tsv.gz 2>/dev/null | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
> --subsample-max-sequences 5 \ | ||
> --subsample-seed 0 \ | ||
> --output-metadata filtered_metadata.tsv.xz 2>/dev/null | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
> --subsample-max-sequences 5 \ | ||
> --subsample-seed 0 \ | ||
> --output-metadata filtered_metadata.tsv.zst 2>/dev/null | ||
|
||
# The uncompressed outputs are identical. | ||
|
||
$ diff <(gzcat filtered_metadata.tsv.gz) <(xzcat filtered_metadata.tsv.xz) | ||
|
||
$ diff <(gzcat filtered_metadata.tsv.gz) <(zstdcat filtered_metadata.tsv.zst) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,8 @@ the default quotechar, any column names with that character may be altered. | |
|
||
Quoted columns containing the tab delimiter are left unchanged. | ||
|
||
# FIXME: tsv-join has different behavior here. Test both? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. |
||
|
||
$ cat >metadata.tsv <<~~ | ||
> strain "col 1" | ||
> SEQ_1 a | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using tsv-utils/tsv-join in Augur
@tsibley and I chatted about this yesterday. Two options:
tsv-join
in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation offasttree
/raxml
/iqtree
/vcftools
/etc. except those are explicitly requested by the user whiletsv-join
could be detected and used automatically as a faster alternative to the Python approach.tsv-join
as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:
This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.
At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including
tsv-join
in the managed runtimes.Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.
Great point - I think this will be the easiest way to push the feature through:
tsv-join
if it's availabletsv-join
in the environment if experiencing slownessWe can still consider bundling in a future version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue.
conda-base
uses the conda-forge one seamlessly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.