Skip to content
This repository has been archived by the owner on Mar 7, 2023. It is now read-only.

Estimate local ancestry using LAMPLD

AM edited this page Aug 9, 2019 · 8 revisions

What you need:


IMPORTANT: The demo data in this workflow only contains 340 markers and is provided for demonstrating the data formatting steps only. Higher marker density is necessary to run a more accurate estimation on local ancestry. You may also run the demo analysis distributed with the LAMPLD software.

Steps


0. Download sample data and start your analysis in the demo_data directory


1. Merge reference and sample data
See step 1 of https://github.com/asthmacollaboratory/Resources/wiki/Estimate-global-ancestry-using-admixutre


2. Remove AT/CG SNPs

Identify AT/CG SNPs to be removed
../bin/remove_ATGC.sh out.v1.merge

Remove the AT/CG SNPs using plink
fpI=out.v1.merge
plink --bfile $fpI --chr 22 --exclude $fpI.removeATCG.txt --make-bed --out $fpI.xATCG.22


3. Extract reference sample for phasing step
plink --bfile $fpI.xATCG.22 --keep demo.refNew.txt --make-bed --out $fpI.xATCG.ref.22


4. Format genetic map for phasing using shapeit. A copy of the output file of this step genetic_map_GRCh37_chr22.mod.txt is also provided.
for chr in {1..22};
do
cat genetic_map_GRCh37_chr$i.txt | awk '{print "\t""\t"}' > genetic_map_GRCh37_chr$i.mod.txt
done


5. Phasing using SHAPEIT
shapeit -B $fpI.xATCG.ref.22 -M genetic_map_GRCh37_chr22.mod.txt -O $fpI.xATCG.ref.phased.22 -T 10


6. Create LAMPLD input files for ancestral data (one file for each ancestral population)
Rscript ../bin/r_Format_LAMPLD_Ancestrals.R $fpI.xATCG.ref.phased demo.refNew.ceu.txt demo.refNew.yri.txt demo.refNew.nam.txt


7. Make sample genotype data in numeric format using the merged data from step 1 and reference allele from step 6. Use “--keep” flag to format sample genotype data only so ancestral data are not included.
plink --bfile $fpI.xATCG.22 --keep demo.sampleNew.txt --a1-allele $fpI.xATCG.ref.phased.allele.22.txt --recode A --out $fpI.xATCG.sample.22


8. Format the file from step 7 for LAMPLD (sample.hap)
sed '1d' $fpI.xATCG.sample.22.raw > temp1.txt
cut -d' ' -f7- temp1.txt > temp2.txt
sed 's/ //g' temp2.txt > temp3.txt
sed 's/NA/?/g' temp3.txt > $fpI.xATCG.sample.22.hap
rm -rf temp1.txt
rm -rf temp2.txt
rm -rf temp3.txt


9. Extract sample ID using the file from step 7 for formatting LAMPLD output later
cut -d' ' -f1,2 $fpI.xATCG.sample.22.raw | sed '1d' > $fpI.xATCG.sample.s.22.txt


10. Run LAMPLD
nAnc=3
winSize=290
unolanc $nAnc $winSize 15 $fpI.xATCG.ref.phased.pos.22.txt $fpI.xATCG.ref.phased.CEU.22.txt $fpI.xATCG.ref.phased.YRI.22.txt $fpI.xATCG.ref.phased.NAM.22.txt $fpI.xATCG.sample.22.hap $fpI.xATCG.LAMPLD.22


11. Format LAMPLD raw output using perl script available in LAMPLD package

If convertLAMPLDout.pl is not found in your path, replace convertLAMPLDout.pl with the full path.

perl convertLAMPLDout.pl $fpI.xATCG.LAMPLD.22 $fpI.xATCG.LAMPLD.long.22.txt


12. For 3 ancestral populations, replace the population label by numeric local ancestry

In the long format file, each individual is represented by two lines. 0, 1 and 2 represents the ancestral populations used in the LAMPLD command line input. In our example, 0=ceu, 1=yri, 2=nam
sed 's/0/9/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/1/0/g' temp1.txt > temp2.txt
sed 's/2/0/g' temp2.txt > temp3.txt
sed 's/9/1/g' temp3.txt > $fpI.xATCG.LAMPLD.ceu.22.txt
sed 's/1/0/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/2/1/g' temp1.txt > $fpI.xATCG.LAMPLD.nam.22.txt
sed 's/2/0/g' $fpI.xATCG.LAMPLD.long.22.txt > $fpI.xATCG.LAMPLD.yri.22.txt
rm -rf temp*.txt


13. Format the file from step 12 into dosage file for each ancestry
../bin/format_LAMPLD_dosage.sh $fpI

Credits: Bogdan Pasanius' Lab and Donglei Hu

Clone this wiki locally