Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Function for optimized retrieval of GT from FORMAT field #105

Merged
merged 5 commits into from
Nov 30, 2024

Conversation

cmdcolin
Copy link
Contributor

@cmdcolin cmdcolin commented Nov 23, 2024

fully parsing out the "FORMAT" field requires a lot of complex memory structures (Object containing Object containing arrays of strings or numbers or ...). this can cause out of memory and lots of garbage collection during parsing large 1000 genomes type data.

this PR proposes a simplified representation called parseGenotypesOptimized

this PR is connected to efforts here, which parses large regions of 1000 genomes type data GMOD/jbrowse-components#4511

It also makes

  • changes SAMPLES to a function instead of a Object.defineProperty. This makes it more explicit that it is a lazy getter operation, and to me makes the code a bit simpler
  • adds a GENOTYPES function call that only parses out the genotypes from the FORMAT field
  • Improved typescript as a result of this

Footnote: overrides #94 probably. In #94, I was very committed to preserving the notion of what the existing Variant class was, but this PR changes it. as a result, it's a major version bump.

fixed #98

@cmdcolin cmdcolin merged commit be2006a into master Nov 30, 2024
1 check passed
@cmdcolin cmdcolin deleted the add_optimized_routine_for_genotypes branch November 30, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimized parseGenotypes routine
1 participant