Polyploidization and inheritance mode model estimation using Approximate Bayesian Computation. Developed for studying speciation in the North American gray treefrog complex Hyla chrysoscelis/versicolor
If used please cite the following work: https://academic.oup.com/mbe/article/39/2/msab316/6427635
This repository contains scripts and information on how to run an ABC analysis to estimate the mode of polyploid formation and the pattern of chromosomal inheritance using sequence data. As of now, this analysis only works for tetraploids.
This analysis was modified from methods outlined in Roux and Pannel (2015)[1] and Leroy et al. (2017)[2]. This readme is also modified from examples laid out in their repositories: https://github.com/popgenomics/ABC_WGD, https://github.com/ThibaultLeroyFr/WhiteOaksABC
The following summary statistics are calculated (modified from https://github.com/popgenomics/ABC_WGD)
Statistics | Description |
---|---|
bialsites | number of SNPs in the alignment |
sf AB | number of fixed differences between species A and B / locus length |
sx A | number of exclusively polymorphic positions in species A / locus length |
sx B | number of exclusively polymorphic positions in species B / locus length |
ss AB | number of shared biallelic positions between species A and B/ locus length |
successive ss | maximum number of successive shared biallelic positions for a locus |
pi A | Tajima’s Theta (pi) within species A |
pi B | Tajima’s Theta (pi) within species B |
theta A | Watterson’s Theta within species A |
theta B | Watterson’s Theta within species A |
pearson_r_pi_AB | correlation’s coefficient for pi over orthologs between A and B |
pearson_r_theta_AB | correlation’s coefficient for theta over orthologs between A and B |
Dtaj A | Tajima’s D for species A |
Dtaj B | Tajima’s D for species B |
divAB | raw divergence Dxy measured between A and B |
netdivAB | net divergence Da measured between A and B |
minDivAB | smallest divergence measured between one individual from A and one from B |
maxDivAB | highest divergence measured between one individual from A and one from B |
GminAB | minimum divergence between one sequence from A and one from B (minDivAB divided by the average divAB) |
GmaxAB | maximum divergence between one sequence from A and one from B (maxDivAB divided by the average divAB) |
FST | FST between A and B compute as 1-(pi_A + pi_B) / (2 * pi_AB) |
D3a | absolute value of the D3 statistic with the diploid and two polyploid subgenomes |
pearson_r_divAB_netDivAB | correlation’s coefficient for divAB and netDivAB |
pearson_r_divAB_FST | correlation’s coefficient for divAB and FST |
pearson_r_netDivAB_FST | correlation’s coefficient for newDivAB and FST |
ss_sf | proportion of loci with both shared biallelic posistions and shared fixed differences |
ss_noSf | proportion of loci with shared biallelic posistions but no shared fixed differences |
noSs_sf | proportion of loci with no shared biallelic posistions but with shared fixed differences |
noSs_noSf | proportion of loci with neither shared biallelic posistions or shared fixed differences |
Python
- pypy (be sure to change dependency location in mscalc_wgd_v3.py)
- numpy
R
- seqinr
- ape
- nnet
- abc/abcrf
msnsam[3]: https://github.com/rossibarra/msnsam
This script is used to simulate sequence data for a specific model of polyploidization mode and chromosomal inheritance pattern. On the command line, running this would be as follows:
./run_ABC_polyploid_v3.py allo disomic migA F bpFile_EC_AE.txt 25000 1 40 F
The script takes 9 arguments:
- mode of polyploid formation: allo, auto, polyP
- (allo) allopolyploid
- (auto) autopolyploid from a sister lineage of the sampled diploid
- (polyP) autopolyploid from the sampled diploid lineage
- pattern of chromosomal inheritance: disomic, heterosomic, tetrasomic
- migration pattern: noMig, migA, migAB
- (noMig) no migration between diploids and tetraploids
- (migA) migration between the diploid and the A subgenome of the tetraploid. When the polyploidization mode is allopolyploid, this means migration between the diploid and the tetraploid subgenome inhereted from the sampled diploid
- (migAB) migration between the diploid and both tetraploid subgenomes
- One-way migration from diploid to tetraploid only: T/F
- The bpfile, which contains information on the number of sampled loci, loci length, sampled individuals, and the clock-rate for the chosen loci
- The number of multilocus simulations to run per repeat.
- Beginning repeat number (Typically begins at 1, but on the chance the full number of simulations fail, it might be useful to start at a higher number so as not to start the simulation over)
- End repeat number. Total number of repeats = {8} - {7} - 1. The total number of simulations is {6} * ({8}-{7} + 1). In the above example, the total number of multilocus simulations is 25000 * (40-1+1) = 1,000,000
- Whether or not to keep or delete the joint site frequency spectrum file. This file is large, so if you aren't using it for a seperate analysis I recommend removing it
This script will generate a folder titled "allo_disomic_migA" with 40 folders, each containing the following files:
- ABCstat.txt: summary statistics for each multilocus simulation, one simulation per line
- output.txt: prior values used to run each simulation, one simulation per line
- ABCjsfs: the joint site frequency spectrum, one simulation per line (unless removed)
Within the script, it is also necessary to set the bounds for the prior distributions. You will need to set upper and lower limits for:
- n1: effective population size of the diploid
- n2: effective population size of the tetraploid
- nA: effective population size of the ancestral populations (used for both the ancestral pop. size prior to Tsplit and for the tetraploid prior to Twgd.
- tau: bounds of the possible Tsplit and Twgd times
- TauDist: Use a uniform or exponential distribution for Tau. If exponential, the lower limit for Twgd is the first argument, and the mean of the distribution is the second argument.
- M: scalar for the beta distribution for migration rate
- shape1: prior for the first shape parameter of the beta distribution for the number of migrants
- shape2: prior for the second shape parameter of the beta distribution for the number of migrants
- symModel: is migration symmetric or assymetric (asym/sym)
- MVariation: does migration vary across the genome? (homo/hetero)
The run_ABC_polyploid_v3.py script will pass arguments to this file to generate the values of each parameter for every simulation, this information will then get passed to msnsam to simulate the given model using TBS arguments
After running each simulation, mscalc will calculate the values for each summary statistic to generate the ABCstat.txt files
To generate statistics for the observed data, it is necessary to first convert your sequences into the ms format that can be read by mscalc_wgd_v3.py. This ensures all summary statistics are calculated in the same way as the simulated data and in the same format. Do not use summary statistics calculated by R packages or other software as their way of handling missing data or other aspects of genetic data may not be the same. Furthermore, it is just easier to convert to ms format.
To convert to the ms format, run the script convert_to_ms_from_fasta.R. The script only works for sequences in the fasta format, and it would be easiest to convert your sequence files to fasta. However, it should be easily modifiable if your data are in a different format.
Additionally, you will need a population input file in csv format that contains 3 columns: individual names, population they came from, and whether or not to include that individual (set to 1 to include). See example file "pops_all_MinMax.csv" in the examples folder
This script will automatically generate your bpfile as well as generate the ms file to calculate the observed data along with a table detailing which loci were used.
To generate the observed sumstat file, make a folder in the directory with the bpfile and ms file titled "Obs" (in this example) and run the following code:
cat EC_AE_msFile.txt | ./mscalc_wgd_v3.py Obs bpFile_EC_AE.txt
That will generate the ABCjsfs.txt file for your data and the ABCstat.txt file. I recommend changing the name so you can keep track of these files.
ABC_nnet_FULL_.R will run the final analysis using the ABC neural network approach. It is designed in a way such that all your simulated and observed data for two populations of interest will be in a folder popA_popB (e.g. EC_AE in the example), with the bpfile as bpFile_popA_popB.txt, and your observed data as popA_popB_ObsData.txt. This way you only have to change a single line in the R script to automatically run the analysis. It is set up to run 5 seperate analysis for 5 different models(noMig, migA, migAB, migA_1W, migAB_1W). You will need to modify it if you run different models.
You will need to set the number of replicates closest to the observed data to extract with the weighted Epanechnikov kernel (EpReplicates), the number of models for the individual analysis (nModels), and the number of models for the final analysis (nModelsFinal). The percentage of the replicate simulations closest to the observed values of the summary statistics is then (EpReplicates/nModels * the number of multilocus simulations per model). In the example EpReplicates is 4500, so for 9 models and 1,000,000 multilocus simulations this is 0.5% of the total number of replicates. This percentage is maintained for the final analysis by setting nModelsFinal.
You will also need to set the number of replicates for the neural network analysis as well as the number of folders for each dataset. In the example, there are 20 nnet replicates and 40 folders.
This analysis will generate a text file for all individual analyses and the final analyses with the model probabilities on each line in the order input into the nnet analysis. In the example, the order is: allo_disomic, allo_heterosomic, allo_tetrasomic, auto_disomic, auto_heterosomic, auto_tetrasomic, polyP_disomic, polyP_heterosomic, polyP_tetrasomic. summarize by loading the text file into R and taking the column means (colMeans()).
Finally, it will also automatically generate posterior distributions for the top 3 models in individual text files for each posterior neural network replicate. to view the posterior distribution for each parameter(example Tsplit) run something like:
setwd("/gpfs/research/scratch/wwb15/ABC_WGD/50Loci/Exponential/EC_AE/")
postSum = NULL
nPostReps <- 10
chosenModel <- "allo_disomic_migAB_1W"
for(i in 1:nPostReps){
postSum <- rbind(read.table(paste(c("Posterior_EC_AE",chosenModel,i), collapse = "")))
}
colnames(postSum) <- colnames(read.table(paste(c("Posterior_EC_AE",chosenModel,"_Header"), collapse = ""), header = T))
hist(postSum$Twgd)
To compare to the prior, import the output.txt files in each folder using a for loop for that model and generate similar histograms.
It's important to note that prior values are set with the assumption that the effective population size multiplier is 100,000. That means, for instance, if we set our upper bound of n1 to 5, that means it is actually 500,000. To change this, you will need to edit the n0 value at the top of the priorgen python script as well as in the convert_to_ms_from_fasta.R file. Furthermore, this affects interpretation of Tsplit and Twgd times, which are multiplied by 4 * n0. So, if you set the upper limit of tau to 1, that means it is actually 400,000 generations (since our clock rate is in substitutions/yr it can be approximated as 400,000 ybp).
The bpFile has 5 lines. The first is a comment line which you can change to anything you like. The second and third rows are the sampled number of individuals from A and B respectively. The third and fourth rows are information on theta and rho, which are 4 * n0 * clockRate
I have tried to make this type of analysis as user friendly as possible given the number of different steps that need to be taken and the amount of customization each dataset requires. However, I expect individual analyses will take some tweaking to work properly. If you have any questions, feel free to email me at [email protected] and I will do my best to respond to you in a timely manner.
There are also scripts included for running an ABC random forest analysis, however this was not used for our study apart from generating LDA plots. An example script for this is in the examples folder.
[1] Roux, C. and Pannell, J. R. (2015). Inferring the mode of origin of polyploid species from next-generation sequence data. Molecular Ecology 24(5), 1047–1059.
[2] Leroy, T., Roux, C., Villate, L., Bodénès, C., Romiguier, J., Paiva, J. A. P., Dossat, C., Aury, J.-M., Plomion, C. and Kremer, A. (2017). Extensive recent secondary contacts between four European white oak species. The New Phytologist 214(2), 865–878.
[3] Ross-Ibarra J, Wright SI, Foxe JP, Kawabe A, DeRose-Wilson L, et al. (2008). Patterns of Polymorphism and Demographic History in Natural Populations of Arabidopsis lyrata . PLoS ONE 3(6): e2411