Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Read Giraffe #4323

Open
wants to merge 1,301 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1301 commits
Select commit Hold shift + click to select a range
11fc884
Merge commit 'dfac868' into lr-giraffe
adamnovak Jul 19, 2024
adc6476
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 19, 2024
e9aac7e
Merge commit 'b55be13' into lr-giraffe
adamnovak Jul 19, 2024
e551487
Merge remote-tracking branch 'origin/unlimited-anchors' into lr-giraffe
adamnovak Jul 19, 2024
eb1e302
Make a string identifier for a snarl tree node
xchang1 Jul 19, 2024
b177d07
Fix zipcode identifiers
xchang1 Jul 21, 2024
636abe8
Add chain component to zipcodes
xchang1 Jul 22, 2024
a4da073
Use zipcodes to get chain component for nodes
xchang1 Jul 22, 2024
6e287f5
Add get_chain_component to zipcodes
xchang1 Jul 22, 2024
6565acb
Get chain component from zipcodes for payload
xchang1 Jul 22, 2024
c476d12
Use zipcode chain component
xchang1 Jul 22, 2024
9e2153f
Add failing unit test
xchang1 Jul 22, 2024
2c49a23
Get chain component for irregular snarls
xchang1 Jul 22, 2024
ce9b0ad
Fix debug code
xchang1 Jul 22, 2024
da5f891
Add chain component count to chains zipcodes
xchang1 Jul 23, 2024
3b4855e
Add unit test for looping zipcodes and fix bugs
xchang1 Jul 23, 2024
d135aab
Use zipcodes for chain component in clustering
xchang1 Jul 23, 2024
ddef07f
Use regular snarls to skip finding distances
xchang1 Jul 24, 2024
9761748
Add external connectivity to zipcodes
xchang1 Jul 24, 2024
74d4122
Take out end_in net_handle_t
xchang1 Jul 24, 2024
804c44d
Fix payload with new values
xchang1 Jul 24, 2024
18b4e7e
Add external connectivity for root nodes
xchang1 Jul 24, 2024
56fbd52
Use zipcode for external connectivity
xchang1 Jul 24, 2024
f2bd83a
Don't use distance index for ordering children in chains
xchang1 Jul 24, 2024
07de5de
Fix orientation for simple snarls since we took them out
xchang1 Jul 24, 2024
5203413
Start taking out net_handle_t's from the payload
xchang1 Jul 24, 2024
461de64
Take out grandparent handle
xchang1 Jul 24, 2024
8655513
Add net identifier strings but don't actually use them
xchang1 Jul 25, 2024
5170df0
Use net identifiers as keys for all the lookups
xchang1 Jul 25, 2024
e541b48
Mostly take out finding net handles until they're needed
xchang1 Jul 25, 2024
985ea32
Fix identifiers but its not working
xchang1 Jul 25, 2024
9b36204
Fix parent
xchang1 Jul 26, 2024
d23653a
Fix getting the node identifier
xchang1 Jul 26, 2024
6e09743
Fix getting handle to trivial chain
xchang1 Jul 26, 2024
e433831
Include chain component in identifier
xchang1 Jul 26, 2024
fd9a049
Fix getting net handle for child of root snarl
xchang1 Jul 26, 2024
27f47cf
Fix clustering the root snarl
xchang1 Jul 27, 2024
7e74771
Get chain sorting values for snarls earlier and sort properly for two…
xchang1 Jul 29, 2024
a4eadd3
Get parent from children
xchang1 Jul 29, 2024
bc642aa
Take out distance index when not used
xchang1 Jul 29, 2024
1a32937
Take the distance index out of more functions that don't use it
xchang1 Jul 29, 2024
2a77f66
Use distance index for chain values less
xchang1 Jul 29, 2024
d2fc0a3
turn off debug
xchang1 Jul 29, 2024
82f714e
Get identifier at the same time as payload
xchang1 Jul 30, 2024
0adfea0
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jul 30, 2024
aa87f36
Undo using identifier strings as keys and fix some bugs
xchang1 Jul 31, 2024
b9a2010
Put zipcode and decoder together
xchang1 Jul 31, 2024
a18edec
Fix unit tests
xchang1 Jul 31, 2024
831f231
Fix reserving decoder length
xchang1 Jul 31, 2024
2923cde
Add an int vector that uses a minimal bit width for storing stuff
xchang1 Jul 31, 2024
2bbd62b
Merge branch 'read-quietly' into lr-giraffe
adamnovak Jul 31, 2024
595cafb
Use new int vectors for zipcodes but it doesn't work yet
xchang1 Aug 1, 2024
c5385a3
Merge remote-tracking branch 'origin/low-cplx-context' into lr-giraffe
adamnovak Aug 1, 2024
de6c76f
Fix zipcodes
xchang1 Aug 2, 2024
7aa1fe7
Revert using minint vectors
xchang1 Aug 3, 2024
39ed6c8
Make decoder use fewer bits
xchang1 Aug 3, 2024
116dc01
Only store zipcodes in a separate file
xchang1 Aug 3, 2024
eace413
Serialize zipcode and decoder
xchang1 Aug 3, 2024
a9dffbe
Revert "Serialize zipcode and decoder"
xchang1 Aug 3, 2024
05480d0
Revert "Only store zipcodes in a separate file"
xchang1 Aug 3, 2024
0207c82
Undo putting zipcode and decoder together
xchang1 Aug 3, 2024
69d48f0
Fix typo
xchang1 Aug 3, 2024
e12b23a
Use distance index less for chain values
xchang1 Aug 3, 2024
f4f10f3
Put decoder back with zipcode
xchang1 Aug 5, 2024
d72d232
Serialize decoder
xchang1 Aug 5, 2024
c4a2e48
Actually serialize the decoder
xchang1 Aug 5, 2024
7fd62d2
Put decoder into zipcode payload
xchang1 Aug 5, 2024
020cbb4
Actually serialize the decoder
xchang1 Aug 5, 2024
06a6b04
Add an unpacked zipcode but it doesn't compile yet
xchang1 Aug 6, 2024
d5a9e4d
Get everything to compile
xchang1 Aug 6, 2024
dfde735
Add unit tests and fix bug for unpacked zip codes
xchang1 Aug 6, 2024
1ecc8ae
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 7, 2024
7cc9f4c
Use the unpacked zipcode for clustering instead of the old payload
xchang1 Aug 7, 2024
2c955f9
Use the length of the last chain component as the length of a multico…
xchang1 Aug 8, 2024
cf6b5be
Use unpacked zipcode in SnarlTreeNodeProblem instead of zipcode
xchang1 Aug 8, 2024
e5fe8b6
Reserve memory for unpacked zipcode
xchang1 Aug 9, 2024
8ccb110
Reserve memory
xchang1 Aug 9, 2024
a73b80e
Take out chain component
xchang1 Aug 9, 2024
4286171
Fix bug getting best distanecs
xchang1 Aug 9, 2024
434a861
Make vg filter see empty list annotations as not there
adamnovak Aug 9, 2024
007df92
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 9, 2024
dbc77d7
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 9, 2024
5beb1be
Get the length of a cyclic chain
xchang1 Aug 10, 2024
a0cbb23
Turn off debug
xchang1 Aug 10, 2024
d7e2553
Get values from unpacked zipcode for children of problems
xchang1 Aug 10, 2024
98a5518
Take out unused chain component ints
xchang1 Aug 11, 2024
f5ad687
Reserve memory when getting zipcodes from payload
xchang1 Aug 12, 2024
d02f4f7
Remove unused ints and reserve memory for unpacked zipcode
xchang1 Aug 12, 2024
2762b05
Reserve less memory because it made it slower
xchang1 Aug 12, 2024
f5e5638
Undo unpacking zipcode to get back to 020cbb
xchang1 Aug 13, 2024
c172816
Fix bug getting minimum distances
xchang1 Aug 13, 2024
aff0cc6
Get chains last component length and get chain length from zipcode
xchang1 Aug 13, 2024
7ed8f6e
Make a decoded code type (but not for roots) and use it for building …
xchang1 Aug 13, 2024
a600bbc
Add getters and setters for unpacked codes
xchang1 Aug 13, 2024
4b994b6
Move build info headers from phony targets to top-level shell invocat…
adamnovak Aug 13, 2024
cc418dc
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 13, 2024
aaf3cb9
Remove extra tabs
adamnovak Aug 13, 2024
3cbd7be
Add functions for interpreting one level of the zipcode
xchang1 Aug 13, 2024
72025a0
Add unit tests for unpacked codes
xchang1 Aug 13, 2024
e00dd49
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 13, 2024
ae6b91a
Cut direct, broken static flag to version header dependency
adamnovak Aug 13, 2024
096000a
Use chain_minimum_length in zipcode
xchang1 Aug 14, 2024
0356878
Make unacking const
xchang1 Aug 14, 2024
dca72e7
Used unpacked zipcode for getting chain values
xchang1 Aug 14, 2024
01e7b4d
Use unpacked zipcode for getting snarl values and fix but I missed fi…
xchang1 Aug 14, 2024
62f6c38
Get parent with distance index if its faster
xchang1 Aug 14, 2024
b0c0234
Make a map from connected component number to net handle
xchang1 Aug 14, 2024
a4410a9
Make map from node id to net handle
xchang1 Aug 14, 2024
45cba14
Revert "Make map from node id to net handle"
xchang1 Aug 14, 2024
17120fb
Reserve memory for children
xchang1 Aug 14, 2024
b0f1f70
Revert "Reserve memory for children"
xchang1 Aug 14, 2024
083f2b3
Set up Docker build and CI to get the version build right
adamnovak Aug 14, 2024
1248b0f
Remove wayward tabs
adamnovak Aug 14, 2024
08218c6
Report errors better for vg inject failures
adamnovak Aug 14, 2024
35ae926
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Aug 15, 2024
0e7d66f
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 16, 2024
370f8fa
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Aug 19, 2024
748169d
Implement consistent orientation around WFA connections
adamnovak Aug 19, 2024
4e871f9
Implement consistent orientation for align_sequence_between with a wr…
adamnovak Aug 19, 2024
3a6022f
Fix build errors in implementation
adamnovak Aug 19, 2024
eef142d
Add test for non-WFA orientation fix
adamnovak Aug 19, 2024
b99200a
Add testing for connect_consistently
adamnovak Aug 19, 2024
f36ebdd
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Aug 21, 2024
ff6fa27
Use bitfields for child_info_t and add chain component
xchang1 Aug 22, 2024
247aaef
Check chain component when making zipcode trees
xchang1 Aug 22, 2024
cb1f600
Add unit test for multicomponent chain
xchang1 Aug 22, 2024
292bdd4
Merge remote-tracking branch 'upstream/master' into HEAD
xchang1 Aug 22, 2024
bd8ded6
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
adamnovak Aug 22, 2024
ad62378
Add handling for softclips off the edge of contigs in inject
adamnovak Aug 29, 2024
3bf557e
Fetch out edit
adamnovak Aug 29, 2024
d4b77fb
Add missing continue
adamnovak Aug 29, 2024
7ecf940
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 3, 2024
fab3f5e
Fix empty top-level snarl zip trees
xchang1 Sep 3, 2024
f0e8389
Turn off debug
xchang1 Sep 3, 2024
a7ba479
Remove any empty snarl
xchang1 Sep 3, 2024
1b5387f
Check if previous_index is 0
xchang1 Sep 4, 2024
1bef104
Check orientation of seeds
xchang1 Sep 9, 2024
a325238
Add stuff to go around a looping chain but I haven't tested it yet
xchang1 Sep 9, 2024
20fa2d2
Add unit test for looping chains and fix sorting nodes in multicompon…
xchang1 Sep 10, 2024
dafb61b
Don't count nodes that loop
xchang1 Sep 10, 2024
beec239
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Sep 10, 2024
cd1ffce
Take out looping chains in zipcode trees
xchang1 Sep 12, 2024
92628c9
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 12, 2024
cffccc9
Use more bits for distance index indices
xchang1 Sep 16, 2024
2b9ac09
Try bailing on banded global alignment between seeds if it's going to…
xchang1 Sep 18, 2024
3d118f3
Add dp limit to be the same as the dozeu limit
xchang1 Sep 18, 2024
c1f7cf5
Get cell count estimate right
xchang1 Sep 18, 2024
61df098
Get sequence properly
xchang1 Sep 18, 2024
068d88c
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Sep 18, 2024
c667c6a
Revert trying to stop banded global aligner when the problem is too big
xchang1 Sep 18, 2024
0850aa2
Revert "Try bailing on banded global alignment between seeds if it's …
xchang1 Sep 18, 2024
b2b2ea9
Apply --max-dp-cells to BandedGlobalAligner and set to 100M for R10 a…
adamnovak Sep 18, 2024
42f2915
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 22, 2024
5041ff4
Lower default max-dp-cells
xchang1 Sep 22, 2024
062ba7a
Actually use fragment-min-score
xchang1 Sep 24, 2024
839e934
Bump max-dp-cells back up
xchang1 Sep 24, 2024
4aee9f3
Increase max-dp-cells. Works for r10 reads on non-d9 hprc graph but I…
xchang1 Oct 18, 2024
b05d3e3
Break off most of main() from the x86-64-supporting preflight code
adamnovak Oct 18, 2024
8459d7b
Partially revert "fix build on Ubuntu 24.04.1 / GCC 13.2.0"
adamnovak Oct 18, 2024
527f4a4
Make padded window size parity match anchor parity for orientation in…
adamnovak Oct 21, 2024
ef939ee
Remove extra debugging
adamnovak Oct 21, 2024
09d2e7e
Change short tail pruning to only require being one tail and not both
adamnovak Oct 21, 2024
1ff1009
Add a test for surjection orientation independence
adamnovak Oct 21, 2024
4bb81db
Add option to sort by alignment score but break ties with chain scores
xchang1 Oct 22, 2024
bb1f31d
Fix bug
xchang1 Oct 22, 2024
875f432
Fix order
xchang1 Oct 23, 2024
c25f9ed
Add a script to cut GAM records to node ranges
adamnovak Oct 24, 2024
1221a0c
Add simple local slide check to anchor pruning
adamnovak Oct 24, 2024
04e267c
Increase search length to include gap length when finding the dagifie…
xchang1 Oct 25, 2024
1863c9e
Add the max_gap_length earlier to match what the other calls to align…
xchang1 Oct 25, 2024
de2f21c
Merge pull request #4425 from vgteam/lr-giraffe-missinsert
xchang1 Oct 25, 2024
af6ce11
Add --max-slide option to surject to control duplicate anchor sequenc…
adamnovak Oct 25, 2024
092568c
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Oct 25, 2024
afb393f
Add timing to surject and std:: to move
adamnovak Oct 25, 2024
ace49a2
Turn off debug output
xchang1 Oct 26, 2024
821623a
Merge pull request #4427 from xchang1/lr-giraffe-missinsert
xchang1 Oct 26, 2024
adeed1c
Switch to sort_shuffling_ties for sorting minimizers
xchang1 Oct 28, 2024
d70b6bd
Keep picking seeds until the agglomerations cover most of the read
xchang1 Oct 28, 2024
69fc8d4
Add a target read coverage for minimizers
xchang1 Oct 29, 2024
08257c0
Only keep seeds if there is no coverage in their agglomeratoin
xchang1 Oct 29, 2024
b3578c9
Remember the coverage of reads taken before the minimizer count cutoff
xchang1 Oct 29, 2024
998ca50
Remember to count the bases of seeds
xchang1 Oct 29, 2024
f3836d5
Use a constant flank size instead of agglomeration
xchang1 Oct 29, 2024
73ef8e8
Fix adding coverage
xchang1 Oct 29, 2024
ff0e63d
Fix counting overlapping bases
xchang1 Oct 30, 2024
44e1dc2
Take out target coverage and increase flank
xchang1 Oct 30, 2024
345163b
Make minimizer flank coverage a command line option
xchang1 Oct 30, 2024
d0e7490
Make limiting the max_middle_gap an option
xchang1 Oct 30, 2024
30a8319
Add new defaults
xchang1 Oct 31, 2024
329e434
Make sorting minimizers check key
xchang1 Nov 1, 2024
9f569ed
Limit the number of hits that a minimizer can have when adding it bey…
xchang1 Nov 4, 2024
e667000
Merge pull request #4429 from xchang1/lr-giraffe-seedsort
xchang1 Nov 4, 2024
8e0373a
Only keep extra minimizers if they're as good as the ones we kept
xchang1 Nov 5, 2024
4bcd495
Add flag for minimizer repetitiveness and an hmm repetitiveness finder
xchang1 Nov 7, 2024
aa5e525
Add skippability to anchors and skip skippable anchors when aligning …
xchang1 Nov 8, 2024
656b18e
Fix bug looking off the end of a vector
xchang1 Nov 8, 2024
1432937
Don't skip minimizers if there's no gap
xchang1 Nov 8, 2024
00c1232
Fix bug
xchang1 Nov 8, 2024
defcbad
Fix which score gets used
xchang1 Nov 8, 2024
2eb412e
Set neighbors of a repetitive minimizers as being repetitive
xchang1 Nov 8, 2024
a14a8d5
Merge remote-tracking branch 'upstream/master' into HEAD
adamnovak Nov 8, 2024
77181c8
Only skip minimizers if there is a gap
xchang1 Nov 9, 2024
6c598bb
Fix adding distances and add a parameter for how much graph distance …
xchang1 Nov 11, 2024
fba46b5
Make it the default to not take out seeds
xchang1 Nov 12, 2024
e9fa199
Merge pull request #4441 from xchang1/skip-minimizers
xchang1 Nov 12, 2024
567754e
Add a way for gamcompare to ignore truth contigs
adamnovak Nov 12, 2024
b05cbd3
Don't add reads with all positions dropped to truth table
adamnovak Nov 12, 2024
20bdae0
Add unit test for nested multicomponent chain
xchang1 Nov 14, 2024
59ca07e
Make the zip code tree maker use the prefix sum values in the orienta…
xchang1 Nov 14, 2024
39f2d62
Flip runs of a cyclic snarl to take into account the new order of the…
xchang1 Nov 15, 2024
8413d3b
Add unit tests for what I thought would fail but looks like the bug I…
xchang1 Nov 15, 2024
95dae9e
Get the prefix sum to the right side of a snarl in a chain
xchang1 Nov 19, 2024
0c2c133
Fix getting snarl offsets in children of cyclic snarls to match the o…
xchang1 Nov 19, 2024
2c63e47
Get snarl offsets as a range in the child of a cyclic snarl
xchang1 Nov 19, 2024
a3d55b0
Fix bug getting the grandchild of a snarl
xchang1 Nov 20, 2024
b25d730
Merge remote-tracking branch 'upstream/lr-giraffe' into ziptree-bug
xchang1 Nov 20, 2024
c5194dc
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Nov 22, 2024
0c40b14
Take out old scoring by chain score because it didn't help
xchang1 Nov 22, 2024
abf3e4a
Serialize the zipcodes with the minimizers in autoindex
xchang1 Nov 22, 2024
0303290
Merge pull request #4453 from xchang1/ziptree-bug
xchang1 Nov 24, 2024
ffaab17
Add lr-giraffe option to autoindex with minimizer parameters
xchang1 Nov 25, 2024
626f925
Make giraffe subcommand use the index registry's zipcodes
xchang1 Nov 25, 2024
ae25abf
Fix zipcode file in output of recipe
xchang1 Nov 25, 2024
1c97950
Make giraffe build long read minimizers for long read presets
xchang1 Nov 25, 2024
f3792b8
Make minimizer weighting parameters part of the autoindex parameters
xchang1 Nov 26, 2024
03ed290
Split up short and long read minimizer making
xchang1 Nov 26, 2024
b453f4c
Actually use long read parameters
xchang1 Nov 26, 2024
3f37c9a
Make giraffe pick which indexes to use
xchang1 Nov 26, 2024
be42c30
Add an autoindex option to start from a gbz file
xchang1 Nov 26, 2024
76c6f2c
Make autoindexer guess at and use files that weren't provided
xchang1 Nov 26, 2024
b8048df
Take out unnecesary string constructor
xchang1 Nov 26, 2024
4e19b76
Explain large alignments with summaries in surject
adamnovak Dec 4, 2024
7cc13e8
Limit surjection BGA size guesstimate
adamnovak Dec 4, 2024
5e3a55c
Use Giraffe-style exception-based BGA limit in surject
adamnovak Dec 6, 2024
aa0938e
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Dec 13, 2024
23dd7e5
Select libbdsg we build with
adamnovak Dec 13, 2024
d2dd1dc
Stop passing the same read repeatedly on max-rescue-attempts
adamnovak Dec 13, 2024
dc96947
Handle timing calculations when only 1 thread ever really works
adamnovak Dec 13, 2024
8c56599
Filter out multiple passes from being in multiple pairs
adamnovak Dec 13, 2024
c5b70d5
Add more helpful error message when zipcodes fail
xchang1 Dec 16, 2024
39cdb56
Safely re-enable multithreaded BGZF reading
adamnovak Dec 16, 2024
bf37a08
Try speeding up minimizer building by saving zipcodes by node id
xchang1 Dec 17, 2024
fcd3150
Use Xian's fork of gbwtgraph to update minimizer version
xchang1 Dec 17, 2024
c88b64a
Reserve memory for hash_map
xchang1 Dec 18, 2024
bcc6d66
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Dec 18, 2024
6d2c333
Merge remote-tracking branch 'origin/update-minimizers' into lr-giraffe
adamnovak Dec 18, 2024
508da4e
Merge remote-tracking branch 'origin/update-minimizers' into HEAD
xchang1 Dec 18, 2024
fd229e1
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Dec 18, 2024
1acaa4c
Merge pull request #4455 from xchang1/autoindex-zipcodes
xchang1 Dec 19, 2024
71428eb
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion deps/dozeu
Submodule dozeu updated 2 files
+87 −8 dozeu.h
+65 −58 example.c
2 changes: 1 addition & 1 deletion deps/sublinear-Li-Stephens
4 changes: 4 additions & 0 deletions doc/publish-docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ COMMIT_AUTHOR_EMAIL="[email protected]"
# We expect GITLAB_SECRET_FILE_DOCS_SSH_KEY to come in from the environment,
# specifying the private deploy key we will use to get at the docs repo.

# Make sure no submodules have untracked files from caching
# See <https://gist.github.com/nicktoumpelis/11214362#file-repo-rinse-sh-L2>
git submodule foreach --recursive git clean -xfd

# Find all the submodules that Doxygen wants to look at and make sure we have
# those.
cat Doxyfile | grep "^INPUT *=" | cut -f2 -d'=' | tr ' ' '\n' | grep "^ *deps" | sed 's_ *\(deps/[^/]*\).*_\1_' | sort | uniq | xargs -n 1 git submodule update --init --recursive
Expand Down
60 changes: 52 additions & 8 deletions scripts/giraffe-facts.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,10 @@ def parse_args(args):
parser = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)

parser.add_argument("--input", type=argparse.FileType('r'), default=sys.stdin,
help="line-oriented JSON GAM to process")
parser.add_argument("input", type=str,
help="GAM to process")
parser.add_argument("--vg", type=str, default="vg",
help="vg binary to use")
parser.add_argument("outdir",
help="directory to place output in")

Expand Down Expand Up @@ -286,11 +288,44 @@ def add_in_stats(destination, addend):

def read_line_oriented_json(lines):
"""
For each line in the given stream, yield it as a parsed JSON object.
For each line in the given iterable of lines (such as a stream), yield it as a parsed JSON object.
"""

for line in lines:
yield json.loads(line)
line = line.strip()
if len(line) > 0:
yield json.loads(line)


def read_read_views(vg, filename):
"""
Given a vg binary and a filename, iterate over subsets of the parsed read dicts for each read in the file.

The subsets will have the annotation and time_used fields.
"""

# Extract just the annotations and times of reads as JSON, with a # header
# We don't know all the annotation field names in advance so we have to dump them all.
filter_process = subprocess.Popen([vg, "filter", "--tsv-out", "annotation;time_used", filename], stdout=subprocess.PIPE)

lines = iter(filter_process.stdout)
# Drop header line
next(lines)

for line in lines:
# Parse the TSV and reconstruct a view of the full read dict.
line = line.decode('utf-8')
line = line.strip()
if len(line) == 0:
continue
parts = line.split("\t")
assert len(parts) == 2
read = {"annotation": json.loads(parts[0]), "time_used": float(parts[1])}

yield read

return_code = filter_process.wait()
assert return_code == 0

class Table(object):
"""
Expand Down Expand Up @@ -916,11 +951,20 @@ def main(args):
# Count all the reads
read_count = 0

# Record mapping parameters from at least one read
# Record mapping parameters from special magic GAM chunk, if any, or a read
params = None

for read in read_line_oriented_json(options.input):


# Get the params from a magic chunk.
# TODO: This is a whole pass through a possibly big file!
params_json = subprocess.check_output([options.vg, "view", "--extract-tag", "PARAMS_JSON", "--first", options.input]).decode('utf-8')
lines = params_json.split("\n")
for parsed_params in read_line_oriented_json(lines):
if params is None:
params = parsed_params

for read in read_read_views(options.vg, options.input):
# For the data we need on each read

if params is None:
# Go get the mapping parameters
params = sniff_params(read)
Expand Down
2 changes: 1 addition & 1 deletion scripts/giraffe-wrangler.sh
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ if [[ ! -z "${SIM_GAM}" ]] ; then

# Compute loss stages
# Let giraffe facts errors out
vg view -aj "${WORK}/mapped.gam" | scripts/giraffe-facts.py "${WORK}/facts" >"${WORK}/facts.txt"
scripts/giraffe-facts.py "${WORK}/mapped.gam" "${WORK}/facts" >"${WORK}/facts.txt"
fi

if [[ ! -z "${REAL_FASTQ}" ]] ; then
Expand Down
93 changes: 57 additions & 36 deletions scripts/make_pbsim_reads.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#!/usr/bin/env bash
# make_pbsim_reads.sh: script to simulate reads with pbsim2.
# Mostly theoretical; records commands that would have worked better than what was actually run
# Intended to run on UCSC Courtyard/Plaza systems
# Intended to run on UCSC behind-the-firewall systems
# You may also need to CFLAGS=-fPIC pip3 install --user bioconvert

set -ex
Expand All @@ -10,27 +9,34 @@ set -ex
# You can set these in the environment to override them and I don't have to write a CLI option parser.
# See https://stackoverflow.com/a/28085062

# Graph to simulate from. Can be S3 URLs or local file paths.
# Graph to simulate from. Can be S3 URLs or local file paths. If GRAPH_GBZ_URL
# is set, GRAPH_XG_URL and GRAPH_GBWT_URL are not used.
: "${GRAPH_XG_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg}"
: "${GRAPH_GBWT_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gbwt}"
: "${GRAPH_GBZ_URL:=""}"
# Name to use for graph when downloaded
: "${GRAPH_NAME:=hprc-v1.0-mc-grch38}"
# Sample to simulate from
: "${SAMPLE_NAME:=HG00741}"
# Technology name to use in output filenames
: "${TECH_NAME:=hifi}"
# FASTQ to use as a template, or "/dev/null"
: "${SAMPLE_FASTQ:=/public/groups/vg/sjhwang/data/reads/real_HiFi/tmp/HiFi_reads_100k_real.fq}"
: "${SAMPLE_FASTQ:=/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/real/hifi/HiFi_reads_100k.fq}"
# HMM model to use instead of a FASTQ, or "/dev/null"
: "${PBSIM_HMM:=/dev/null}"
# This needs to be the pbsim2 command, which isn't assumed to be in $PATH
: "${PBSIM:=/public/groups/vg/sjhwang/tools/bin/pbsim}"
# This needs to be the pbsim2 binary, which might not be in $PATH.
# It can be installed with
# git clone https://github.com/yukiteruono/pbsim2.git
# cd pbsim2
# git checkout eeb5a19420534a0f672c81db2670117e62a9ee38
# autoupdate
# automake --add-missing
# autoreconf
# ./configure --prefix=$HOME/.local && make
# The binary will be in src/pbsim
: "${PBSIM:=pbsim}"
# Parameters to use with pbsim for simulating reads for each contig. Parameters are space-separated and internal spaces must be escaped.
: "${PBSIM_PARAMS:=--depth 1 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# This needs to be a command line which can execute Stephen's script that adds qualities from a FASTQ back into a SAM that is missing them.
# Arguments are space-separated and internal spaces must be escaped.
# This script is at https://gist.github.com/adamnovak/45ae4f500a8ec63ce12ace4ca77afc21
: "${ADD_QUALITIES:=python3 /public/groups/vg/sjhwang/vg_scripts/bin/readers/sam_reader.py}"
: "${PBSIM_PARAMS:=--depth 4 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# Directory to save results in
: "${OUT_DIR:=./reads/sim/${TECH_NAME}/${SAMPLE_NAME}}"
# Number of MAFs to convert at once
Expand All @@ -49,33 +55,48 @@ fi
# Make sure scratch directory exists
mkdir -p "${WORK_DIR}"

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
if [[ -z "${GRAPH_GBZ_URL}" ]] ; then

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
fi
fi
fi
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

elif [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Fetch the GBZ
if [[ ${GRAPH_GBZ_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
aws s3 cp "${GRAPH_GBZ_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
ln -s "$(realpath "${GRAPH_GBZ_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}-${SAMPLE_NAME}-as-ref.gbz" ]] ; then
Expand Down Expand Up @@ -150,7 +171,7 @@ function do_job() {
mv "${SAM_NAME}.tmp" "${SAM_NAME}"
fi
set -o pipefail
${ADD_QUALITIES} -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
python3 "$(dirname -- "${BASH_SOURCE[0]}")/reinsert_qualities.py" -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
set +o pipefail
mv "${RENAMED_BAM_NAME}.tmp" "${RENAMED_BAM_NAME}"
else
Expand Down Expand Up @@ -207,14 +228,14 @@ fi
# Work out howe many reads there are
TOTAL_READS="$(vg stats -a "${WORK_DIR}/${SAMPLE_NAME}-reads/${SAMPLE_NAME}-sim-${TECH_NAME}.gam" | grep "^Total alignments:" | cut -f2 -d':' | tr -d ' ')"

if [[ "${TOTAL_READS}" -lt 10500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 10000 reads with buffer!"
if [[ "${TOTAL_READS}" -lt 1000500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 1000000 reads with buffer!"
exit 1
fi
echo "Simulated ${TOTAL_READS} reads overall"

SUBSAMPLE_SEED=1
for READ_COUNT in 100 1000 10000 ; do
for READ_COUNT in 100 1000 10000 100000 1000000 ; do
# Subset to manageable sizes (always)
# Get the fraction of reads to keep, overestimated, with no leading 0, to paste onto subsample seed.
FRACTION="$(echo "(${READ_COUNT} + 500)/${TOTAL_READS}" | bc -l | sed 's/^[0-9]*//g')"
Expand Down
39 changes: 39 additions & 0 deletions scripts/mark_secondaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/python3
# mark_secondaries.py: Mark all but the first alignment with a given name as secondary
"""
Mark duplicate alignments for a given read name as secondary. Useful for GraphAligner output which does not mark its secondaries. Assumes that the first alignment is the primary alignment, ignoring score.

vg view -a unmarked.gam mark_secondaries.py | vg view -JaG - > marked.gam

"""
import sys
import json


def filter_json_gam(infile):
"""
process gam json made with vg view -a my.gam
"""

seen_names = set()

for line in infile:
gam = json.loads(line)

if gam['name'] in seen_names:
gam['is_secondary'] = True
else:
gam['is_secondary'] = False
seen_names.add(gam['name'])

print(json.dumps(gam))

def main():
"""
Main entry point for the program.
"""
filter_json_gam(sys.stdin)

if __name__ == "__main__" :
main()

2 changes: 1 addition & 1 deletion scripts/plot-qq.R
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ print(x$ci)

# Now plot the points as different sizes, but the error bar line ranges as a consistent size
dat.plot <- ggplot(x, aes(1-mapprob+1e-7, 1-observed+1e-7, color=aligner, size=N, weight=N, label=round(mapq,2))) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=1)) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=2)) +
scale_y_log10("measured error", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_x_log10("error estimate", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_size_continuous("number", guide=guide_legend(title=NULL, ncol=4)) +
Expand Down
Loading
Loading