Skip to content
This repository has been archived by the owner on Mar 7, 2023. It is now read-only.

Estimate local ancestry using LAMPLD

AM edited this page Aug 9, 2019 · 8 revisions

What you need:


IMPORTANT: The demo data in this workflow only contains 340 markers and is provided for demonstrating the data formatting steps only. Higher marker density is necessary to run a more accurate estimation on local ancestry. You may also run the demo analysis distributed with the LAMPLD software.

Steps


0. Download sample data and start your analysis in the demo_data directory


1. Merge reference and sample data
See step 1 of https://github.com/asthmacollaboratory/Resources/wiki/Estimate-global-ancestry-using-admixutre


2. Remove AT/CG SNPs

Identify AT/CG SNPs to be removed
../bin/remove_ATGC.sh out.v1.merge

Remove the AT/CG SNPs using plink
fpI=out.v1.merge
plink --bfile $fpI --chr 22 --exclude $fpI.removeATCG.txt --make-bed --out $fpI.xATCG.22


3. Extract reference sample for phasing step
plink --bfile $fpI.xATCG.22 --keep demo.refNew.txt --make-bed --out $fpI.xATCG.ref.22


4. Format genetic map for phasing using shapeit. A copy of the output file of this step genetic_map_GRCh37_chr22.mod.txt is also provided.
for chr in {1..22};
do
cat genetic_map_GRCh37_chr$i.txt | awk '{print "\t""\t"}' > genetic_map_GRCh37_chr$i.mod.txt
done


5. Phasing using SHAPEIT
shapeit -B $fpI.xATCG.ref.22 -M genetic_map_GRCh37_chr22.mod.txt -O $fpI.xATCG.ref.phased.22 -T 10


6. Create LAMPLD input files for ancestral data (one file for each ancestral population)
Rscript ../bin/r_Format_LAMPLD_Ancestrals.R $fpI.xATCG.ref.phased demo.refNew.ceu.txt demo.refNew.yri.txt demo.refNew.nam.txt


7. Make sample genotype data in numeric format using the merged data from step 1 and reference allele from step 6. Use “--keep” flag to format sample genotype data only so ancestral data are not included.
plink --bfile $fpI.xATCG.22 --keep demo.sampleNew.txt --a1-allele $fpI.xATCG.ref.phased.allele.22.txt --recode A --out $fpI.xATCG.sample.22


8. Format the file from step 7 for LAMPLD (sample.hap)
sed '1d' $fpI.xATCG.sample.22.raw > temp1.txt
cut -d' ' -f7- temp1.txt > temp2.txt
sed 's/ //g' temp2.txt > temp3.txt
sed 's/NA/?/g' temp3.txt > $fpI.xATCG.sample.22.hap
rm -rf temp1.txt
rm -rf temp2.txt
rm -rf temp3.txt


9. Extract sample ID using the file from step 7 for formatting LAMPLD output later
cut -d' ' -f1,2 $fpI.xATCG.sample.22.raw | sed '1d' > $fpI.xATCG.sample.s.22.txt


10. Run LAMPLD
nAnc=3
winSize=290
unolanc $nAnc $winSize 15 $fpI.xATCG.ref.phased.pos.22.txt $fpI.xATCG.ref.phased.CEU.22.txt $fpI.xATCG.ref.phased.YRI.22.txt $fpI.xATCG.ref.phased.NAM.22.txt $fpI.xATCG.sample.22.hap $fpI.xATCG.LAMPLD.22


11. Format LAMPLD raw output using perl script available in LAMPLD package

If convertLAMPLDout.pl is not found in your path, replace convertLAMPLDout.pl with the full path.

In the long format file, each individual is represented by two lines. 0, 1 and 2 represents the ancestral populations used in the LAMPLD command line input. In our example, 0=ceu, 1=yri, 2=nam
perl convertLAMPLDout.pl $fpI.xATCG.LAMPLD.22 $fpI.xATCG.LAMPLD.long.22.txt


12. For 3 ancestral populations, replace the population label by numeric local ancestry

0=ceu, 1=yri, 2=nam
sed 's/0/9/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/1/0/g' temp1.txt > temp2.txt
sed 's/2/0/g' temp2.txt > temp3.txt
sed 's/9/1/g' temp3.txt > $fpI.xATCG.LAMPLD.ceu.22.txt
sed 's/1/0/g' $fpI.xATCG.LAMPLD.long.22.txt > temp1.txt
sed 's/2/1/g' temp1.txt > $fpI.xATCG.LAMPLD.nam.22.txt
sed 's/2/0/g' $fpI.xATCG.LAMPLD.long.22.txt > $fpI.xATCG.LAMPLD.yri.22.txt
rm -rf temp*.txt


13. Format the file from step 12 into dosage file for each ancestry
../bin/format_LAMPLD_dosage.sh $fpI

Credits: Bogdan Pasanius' Lab and Donglei Hu

Clone this wiki locally