-
Notifications
You must be signed in to change notification settings - Fork 0
Estimate local ancestry using LAMPLD
- A system to run bash script (linux-based, mac os or cygwin on ms windows)
- R and ggplot2 package
- Plink software (http://www.cog-genomics.org/plink/1.9)
- SHAPEIT software for phasing (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download)
- GRCH37 genetic map from HAPMAP project for SHAPEIT (ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz)
- LAMPLD software (http://lamp.icsi.berkeley.edu/lamp/)
- Demo data (https://github.com/asthmacollaboratory/Resources/local_ancestry_LAMPLD)
-- Reference population genotypes (reference data, usually from 1000 Genomes)
-- Sample population genotypes (your data/ sample data)
IMPORTANT: The demo data in this workflow only contains 340 markers and is provided for demonstrating the data formatting steps only. Higher marker density is necessary to run a more accurate estimation on local ancestry. You may also run the demo analysis distributed with the LAMPLD software.
0. Download sample data and start your analysis in the demo_data directory
1. Merge reference and sample data
See step 1 of https://github.com/asthmacollaboratory/Resources/wiki/Estimate-global-ancestry-using-admixutre
2. Remove AT/CG SNPs
Identify AT/CG SNPs to be removed
../bin/remove_ATGC.sh out.v1.merge
Remove the AT/CG SNPs using plink
fpI=out.v1.merge
plink --bfile $fpI --chr 22 --exclude $fpI.removeATCG.txt --make-bed --out $fpI.xATCG.22
3. Extract reference sample for phasing step
plink --bfile $fpI.xATCG.22 --keep demo.refNew.txt --make-bed --out $fpI.xATCG.ref.22
4. Format genetic map for phasing using shapeit. A copy of the output file of this step genetic_map_GRCh37_chr22.mod.txt
is also provided.
for chr in {1..22};
do
cat genetic_map_GRCh37_chr$i.txt | awk '{print "\t""\t"}' > genetic_map_GRCh37_chr$i.mod.txt
done
5. Phasing using SHAPEIT
shapeit -B $fpI.xATCG.ref.22 -M genetic_map_GRCh37_chr22.mod.txt -O $fpI.xATCG.ref.phased.22 -T 10
6. Create LAMPLD input files for ancestral data (one file for each ancestral population)
Rscript ../bin/r_Format_LAMPLD_Ancestrals.R $fpI.xATCG.ref.phased demo.refNew.ceu.txt demo.refNew.yri.txt demo.refNew.nam.txt
7. Make sample genotype data in numeric format using the merged data from step 1 and reference allele from step 6. Use “--keep” flag to format sample genotype data only so ancestral data are not included.
plink --bfile $fpI.xATCG.22 --keep demo.sampleNew.txt --a1-allele $fpI.xATCG.ref.phased.allele.22.txt --recode A --out $fpI.xATCG.sample.22
8. Format the file from step 7 for LAMPLD (sample.hap)
sed '1d' $fpI.xATCG.sample.22.raw > temp1.txt
cut -d' ' -f7- temp1.txt > temp2.txt
sed 's/ //g' temp2.txt > temp3.txt
sed 's/NA/?/g' temp3.txt > $fpI.xATCG.sample.22.hap
rm -rf temp1.txt
rm -rf temp2.txt
rm -rf temp3.txt
9. Extract sample ID using the file from step 7 for formatting LAMPLD output later
cut -d' ' -f1,2 $fpI.xATCG.sample.22.raw | sed '1d' > $fpI.xATCG.sample.s.22.txt
10. Run LAMPLD
nAnc=3
winSize=290
unolanc $nAnc $winSize 15 $fpI.xATCG.ref.phased.pos.22.txt $fpI.xATCG.ref.phased.CEU.22.txt $fpI.xATCG.ref.phased.YRI.22.txt $fpI.xATCG.ref.phased.NAM.22.txt $fpI.xATCG.sample.22.hap $fpI.xATCG.LAMPLD.22
11. Format LAMPLD raw output using perl script available in LAMPLD package
If convertLAMPLDout.pl is not found in your path, replace convertLAMPLDout.pl
with the full path.
perl convertLAMPLDout.pl $fpI.xATCG.LAMPLD.22 $fpI.xATCG.LAMPLD.long.22.txt
12. For 3 ancestral populations, replace the population label by numeric local ancestry
In the long format file, each individual is represented by two lines. 0, 1 and 2 represents the ancestral populations used in the LAMPLD command line input. In our example, 0=ceu, 1=yri, 2=nam
sed 's/0/9/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/1/0/g' temp1.txt > temp2.txt
sed 's/2/0/g' temp2.txt > temp3.txt
sed 's/9/1/g' temp3.txt > $fpI.xATCG.LAMPLD.ceu.22.txt
sed 's/1/0/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/2/1/g' temp1.txt > $fpI.xATCG.LAMPLD.nam.22.txt
sed 's/2/0/g' $fpI.xATCG.LAMPLD.long.22.txt > $fpI.xATCG.LAMPLD.yri.22.txt
rm -rf temp*.txt
13. Format the file from step 12 into dosage file for each ancestry
../bin/format_LAMPLD_dosage.sh $fpI
Credits: Bogdan Pasanius' Lab and Donglei Hu