-
Notifications
You must be signed in to change notification settings - Fork 0
Estimate local ancestry using LAMPLD
- A system to run bash script (linux-based, mac os or cygwin on ms windows)
- R and ggplot2 package
- Plink software (http://www.cog-genomics.org/plink/1.9)
- SHAPEIT software for phasing (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download)
- GRCH37 genetic map from HAPMAP project for SHAPEIT (ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz)
- LAMPLD software (http://lamp.icsi.berkeley.edu/lamp/)
- Demo data (https://github.com/asthmacollaboratory/Resources/local_ancestry_LAMPLD)
-- Reference population genotypes (reference data, usually from 1000 Genomes)
-- Sample population genotypes (your data/ sample data)
IMPORTANT: The demo data in this workflow only contains 340 markers and is provided for demonstrating the data formatting steps only. Higher marker density is necessary to run a more accurate estimation on local ancestry. You may also run the demo analysis distributed with the LAMPLD software.
0. Download sample data and start your analysis in the demo_data directory
1. Merge reference and sample data
See step 1 of https://github.com/asthmacollaboratory/Resources/wiki/Estimate-global-ancestry-using-admixutre
2. Remove AT/CG SNPs
Identify AT/CG SNPs to be removed
../bin/remove_ATGC.sh out.v1.merge
Remove the AT/CG SNPs using plink
fpI=out.v1.merge
plink --bfile $fpI --chr 22 --exclude $fpI.removeATCG.txt --make-bed --out $fpI.xATCG.22
3. Extract reference sample for phasing step
plink --bfile $fpI.xATCG.22 --keep demo.refNew.txt --make-bed --out $fpI.xATCG.ref.22
4. Format genetic map for phasing using shapeit. A copy of the output file of this step genetic_map_GRCh37_chr22.mod.txt
is also provided.
for chr in {1..22};
do
cat genetic_map_GRCh37_chr$i.txt | awk '{print "\t""\t"}' > genetic_map_GRCh37_chr$i.mod.txt
done
5. Phasing using SHAPEIT
shapeit -B $fpI.xATCG.ref.22 -M genetic_map_GRCh37_chr22.mod.txt -O $fpI.xATCG.ref.phased.22 -T 10
6. Create LAMPLD input files for ancestral data (one file for each ancestral population)
Rscript ../bin/r_Format_LAMPLD_Ancestrals.R $fpI.xATCG.ref.phased demo.refNew.ceu.txt demo.refNew.yri.txt demo.refNew.nam.txt
7. Make sample genotype data in numeric format using the merged data from step 1 and reference allele from step 6. Use “--keep” flag to format sample genotype data only so ancestral data are not included.
plink --bfile $fpI.xATCG.22 --keep demo.sampleNew.txt --a1-allele $fpI.xATCG.ref.phased.allele.22.txt --recode A --out $fpI.xATCG.sample.22
8. Format the file from step 7 for LAMPLD (sample.hap)
sed '1d' $fpI.xATCG.sample.22.raw > temp1.txt
cut -d' ' -f7- temp1.txt > temp2.txt
sed 's/ //g' temp2.txt > temp3.txt
sed 's/NA/?/g' temp3.txt > $fpI.xATCG.sample.22.hap
rm -rf temp1.txt
rm -rf temp2.txt
rm -rf temp3.txt
9. Extract sample ID using the file from step 7 for formatting LAMPLD output later
cut -d' ' -f1,2 $fpI.xATCG.sample.22.raw | sed '1d' > $fpI.xATCG.sample.s.22.txt
10. Run LAMPLD
nAnc=3
winSize=290
unolanc $nAnc $winSize 15 $fpI.xATCG.ref.phased.pos.22.txt $fpI.xATCG.ref.phased.CEU.22.txt $fpI.xATCG.ref.phased.YRI.22.txt $fpI.xATCG.ref.phased.NAM.22.txt $fpI.xATCG.sample.22.hap $fpI.xATCG.LAMPLD.22
11. Format LAMPLD raw output using perl script available in LAMPLD package
If convertLAMPLDout.pl is not found in your path, replace convertLAMPLDout.pl
with the full path.
In the long format file, each individual is represented by two lines. 0, 1 and 2 represents the ancestral populations used in the LAMPLD command line input. In our example, 0=ceu, 1=yri, 2=nam
perl convertLAMPLDout.pl $fpI.xATCG.LAMPLD.22 $fpI.xATCG.LAMPLD.long.22.txt
12. For 3 ancestral populations, replace the population label by numeric local ancestry
0=ceu, 1=yri, 2=nam
sed 's/0/9/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/1/0/g' temp1.txt > temp2.txt
sed 's/2/0/g' temp2.txt > temp3.txt
sed 's/9/1/g' temp3.txt > $fpI.xATCG.LAMPLD.ceu.22.txt
sed 's/1/0/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/2/1/g' temp1.txt > $fpI.xATCG.LAMPLD.nam.22.txt
sed 's/2/0/g' $fpI.xATCG.LAMPLD.long.22.txt > $fpI.xATCG.LAMPLD.yri.22.txt
rm -rf temp*.txt
13. Format the file from step 12 into dosage file for each ancestry
../bin/format_LAMPLD_dosage.sh $fpI
Credits: Bogdan Pasanius' Lab and Donglei Hu