-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF header error #314
Comments
BTW, as you might infer, the metadata lines in question are put there automatically by GATK. |
It was clarified in VCF 4.3. See § 1.4 "Meta-information lines" (2024-10-09):
noodles uses hash maps for structured lines, taking the ID as the key, so it is not possible to represent those lines in a I'll investigate what other implementations do, but what behavior do you expect when keys are duplicated? |
Ah. I was missing something! My google to put my hand on the specification served me up v4.2 which lacks that clear definition of structured metadata. The 4.3 spec is pretty clear! Here's a tricky thing though: the VCF in question has
And given the ambiguity of the 4.2 version of the spec, the file is in fact well formed with respect to that specification! It would be nice if there was a way to allow Noodles to parse imperfect input, though I am 100% behind making sure constructed/written VCFs are as conformant as possible. I think if you were to add a "tolerant" option to the Builder, I think reasonable behaviour would be to let subsequent lines overwrite earlier ones, or to keep the first one and ignore subsequent ones. Ideally it would be good to have a hook or a method that could allow a client application to get the alternatives so it can report problems, and possibly apply other logic to resolve the problem. On further reflection, another "tolerant" approach would be to "demote" the metadata type to "unstructured". |
This is what htslib does for all VCF versions. E.g.,
Note it is a silent drop.
This is a fair workaround. noodles 0.87.0 / noodles-vcf 0.70.0 introduces a raw VCF header reader ( In this example, for inputs that are VCF <= 4.2, we follow the same behavior as htslib by dropping other header records that return a duplicate ID error. (I maintain it's a strict error for VCF >= 4.3).
|
That looks like a good resolution to me. |
Thanks again for developing a terrific library!
I am parsing a VCF what contains the following header lines (with the bulk of each removed for clarity:
The VCF reader complains with the following error:
I've just been reading through the VCF specification, and I can't see that it requires the ID values for the predefined structured metadata lines be unique, and it doesn't say anything about the semantics of non-predefined structured metadata lines in this respect.
I can see why it makes sense to complain given the semantics of the predefined metadata types (
INFO
,FILTER
, etc), because the uniqueness of the ID is a natural consequence of how those metadata types are used, even though it is not explicitly stated.However I can't see anything that forbids duplicate ID values for non-predefined metadata types.
Have I missed something? (It's quite possible!)
Tom.
The text was updated successfully, but these errors were encountered: