Skip to content

Post‐Imputation BCF processing

Toni Boltz edited this page Sep 4, 2024 · 4 revisions

The output from GLIMPSE2 imputation is a phased BCF file. We filter this BCF to keep only variants with an INFO score of at least 0.8, both to ensure high quality of variants included in downstream analyses and to reduce file sizes. The filtering was performed per chromosome with the following bcftools command:

bcftools filter -e 'INFO/INFO<0.8' neurogap_final_merged_chr{i}.bcf -Oz -o neurogap_final_merged_chr{i}_INFO0.8.vcf.gz

Then for optimal storage we decided to convert the VCF output to BGEN. Conversion from VCF to BGEN and RSID annotation (as well as downstream QC) is performed in Hail with the following python script:

import hail as hl
import pandas as pd
hl.init(spark_conf={'spark.driver.memory': '100g'})

input_path = f"gs://neurogap-bge-imputed-regional/glimpse2/INFO0.8_filtered/neurogap_final_merged_chr*_INFO0.8.vcf.gz"

#import the VCFs as a single MatrixTable
mt = hl.import_vcf(input_path, reference_genome= "GRCh38", force_bgz=True) 
#get variant QC statistics
mt = hl.variant_qc(mt) 

#get database of RSIDs from Hail Google Cloud repository
db = hl.experimental.DB(region='us-central1', cloud='gcp') 

#annotate RSIDs for each variant
mt_rsids = db.annotate_rows_db(mt, 'dbSNP_rsid') 

#remove monomorphic variants
mt_rsids = mt_rsids.filter_rows((mt_rsids.variant_qc.AC[0] > 0) & (mt_rsids.variant_qc.AC[0] < 2 * mt_rsids.count_cols()) )

autosomes = [f'chr{x}' for x in range(1, 23)]

for chrom in autosomes:
    # Filter the MatrixTable to the current chromosome
    mt_chrom = mt_rsids.filter_rows(mt_rsids.locus.contig == chrom)
    
    # Export the filtered MatrixTable as a BGEN file with RSIDs 
    output_path = f'gs://neurogap-bge-imputed-regional/glimpse2/bgen/INFO0.8_filtered_unphased/neurogap_final_merged_INFO0.8_rsIDs_{chrom}'
    hl.export_bgen(mt_chrom, output_path,rsid=(mt_chrom['dbSNP_rsid'].rsid[0]))

These are the final BGEN files that we use as a starting point for the PRS and GWAS analyses. If you are interested in using the unfiltered BCF output from GLIMPSE2, those files can also be made available on the Terra workspace.

Clone this wiki locally