Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF header error #314

Open
drtconway opened this issue Dec 18, 2024 · 5 comments
Open

VCF header error #314

drtconway opened this issue Dec 18, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request vcf

Comments

@drtconway
Copy link

drtconway commented Dec 18, 2024

Thanks again for developing a terrific library!

I am parsing a VCF what contains the following header lines (with the bulk of each removed for clarity:

...
##GATKCommandLine=<ID=CombineGVCFs,CommandLine="CombineGVCFs --output variants/24PS01319-RDN0259-00.merge.dragen.fe24071a.family.gvcf.gz ....",Version="4.2.6.1",Date="October 10, 2024 at 9:27:10 PM AEDT">
##GATKCommandLine=<ID=CombineGVCFs,CommandLine="CombineGVCFs --output variants/24PS01319-RDN0259-00.merge.dragen.g.vcf.gz ...",Version="4.2.6.1",Date="October 10, 2024 at 7:08:29 PM AEDT">
...

The VCF reader complains with the following error:

Error: Custom { kind: InvalidData, error: InvalidRecordValue(DuplicateId("CombineGVCFs")) }

I've just been reading through the VCF specification, and I can't see that it requires the ID values for the predefined structured metadata lines be unique, and it doesn't say anything about the semantics of non-predefined structured metadata lines in this respect.

I can see why it makes sense to complain given the semantics of the predefined metadata types (INFO, FILTER, etc), because the uniqueness of the ID is a natural consequence of how those metadata types are used, even though it is not explicitly stated.

However I can't see anything that forbids duplicate ID values for non-predefined metadata types.

Have I missed something? (It's quite possible!)

Tom.

@drtconway
Copy link
Author

BTW, as you might infer, the metadata lines in question are put there automatically by GATK.

@zaeleus zaeleus added the vcf label Dec 18, 2024
@zaeleus zaeleus self-assigned this Dec 18, 2024
@zaeleus
Copy link
Owner

zaeleus commented Dec 18, 2024

I've just been reading through the VCF specification, and I can't see that it requires the ID values for the predefined structured metadata lines be unique

It was clarified in VCF 4.3. See § 1.4 "Meta-information lines" (2024-10-09):

All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines
with the same "##key=" prefix.

noodles uses hash maps for structured lines, taking the ID as the key, so it is not possible to represent those lines in a vcf::Header.

I'll investigate what other implementations do, but what behavior do you expect when keys are duplicated?

@drtconway
Copy link
Author

Ah. I was missing something! My google to put my hand on the specification served me up v4.2 which lacks that clear definition of structured metadata.

The 4.3 spec is pretty clear!

Here's a tricky thing though: the VCF in question has

##fileformat=VCFv4.2

And given the ambiguity of the 4.2 version of the spec, the file is in fact well formed with respect to that specification!

It would be nice if there was a way to allow Noodles to parse imperfect input, though I am 100% behind making sure constructed/written VCFs are as conformant as possible.

I think if you were to add a "tolerant" option to the Builder, I think reasonable behaviour would be to let subsequent lines overwrite earlier ones, or to keep the first one and ignore subsequent ones. Ideally it would be good to have a hook or a method that could allow a client application to get the alternatives so it can report problems, and possibly apply other logic to resolve the problem.

On further reflection, another "tolerant" approach would be to "demote" the metadata type to "unstructured".

@zaeleus zaeleus added the enhancement New feature or request label Dec 18, 2024
@zaeleus
Copy link
Owner

zaeleus commented Dec 20, 2024

I think reasonable behaviour would be [...] to keep the first one and ignore subsequent ones.

This is what htslib does for all VCF versions. E.g.,

$ cat 314.vcf
##fileformat=VCFv4.2
##group=<ID=A,Value="1">
##group=<ID=A,Value="2">
##group=<ID=A,Value="3">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

$ bcftools view --no-version 314.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##group=<ID=A,Value="1">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

$ echo $?
0

Note it is a silent drop.

Ideally it would be good to have a hook or a method that could allow a client application to get the alternatives so it can report problems, and possibly apply other logic to resolve the problem.

This is a fair workaround. noodles 0.87.0 / noodles-vcf 0.70.0 introduces a raw VCF header reader (Reader::header_reader). This allows for more custom parsing and building of the header, either by manipulating the raw line beforehand or handling the parse error afterward.

In this example, for inputs that are VCF <= 4.2, we follow the same behavior as htslib by dropping other header records that return a duplicate ID error. (I maintain it's a strict error for VCF >= 4.3).

vcf_314.rs
use std::io::{self, BufRead};

use noodles_vcf::{self as vcf, header::FileFormat};

const DATA: &[u8] = br#"##fileformat=VCFv4.2
##group=<ID=A,Value="1">
##group=<ID=A,Value="2">
##group=<ID=A,Value="3">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
sq0	1	.	A	.	.	PASS	.
"#;

fn main() -> io::Result<()> {
    let mut reader = vcf::io::Reader::new(DATA);

    let mut header_reader = reader.header_reader();
    let header = read_header(&mut header_reader)?;

    let stdout = io::stdout().lock();
    let mut writer = vcf::io::Writer::new(stdout);

    writer.write_header(&header)?;

    for result in reader.records() {
        let record = result?;
        writer.write_record(&header, &record)?;
    }

    Ok(())
}

fn read_header<R>(reader: &mut vcf::io::reader::header::Reader<R>) -> io::Result<vcf::Header>
where
    R: BufRead,
{
    use noodles_vcf::header::{
        parser::{Entry, ParseError},
        record::value::collection::AddError,
    };

    const VCF_4_2: FileFormat = FileFormat::new(4, 2);

    let mut parser = vcf::header::Parser::default();
    let mut file_format = FileFormat::default();

    for result in reader.lines() {
        let line = result?;

        match parser.parse_partial(line.as_bytes()) {
            Ok(Entry::FileFormat(ff)) => file_format = ff,
            Ok(_) => {}
            Err(ParseError::InvalidRecordValue(AddError::DuplicateId(_)))
                if file_format <= VCF_4_2 =>
            {
                eprintln!("[WARN] dropping structured header record with duplicate ID: {line}");
            }
            Err(e) => return Err(io::Error::new(io::ErrorKind::InvalidData, e)),
        }
    }

    parser
        .finish()
        .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))
}

@drtconway
Copy link
Author

That looks like a good resolution to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request vcf
Projects
None yet
Development

No branches or pull requests

2 participants