The public facing repository for looking at two-locus and haplotype copying properties in models with temporal sampling.
If you are interested in a particular figure from the manuscript, the relevant iPython notebook is listed below:
Main Figures
Supplementary Figures
The data here represents intermediate data sources to generate the CSV files in results
. These are typically in the form of tables that represent genetic map coordinates or sample names.
The results
directory houses all of the files that are necessary to recreate the plots in both the main text and the supplementary materials. They represent the final output of snakemake
rules that perform either simulations or estimate parameters from the data. If you are interested in the raw data used to generate the plots, this is where you want to take a look.
The files in the snakefiles
directory are not directly used in this setting, but can be used in conjunction with snakemake
to rerun the entire analysis and replicate our simulation results fully.
To re-run the full analysis (not using the pre-generated results): you can run:
snakemake -s main.smk all_sim_results -j <number of cores>
Note that you will also want to change the tmpdir
parameter in the config.yml
file so that you have a place where you can write XXX Gb of data. Be warned that re-running all of the simulation analyses takes ~4-5 hours on a computing cluster with 200 parallel jobs (so is likely to take longer on a single laptop).
For our results on real ancient data, we have not chosen to store the data within this repository as it breaks some file-size limits on github, but have provided a fast snakemake
rule to download the data from Dropbox and unpack it (~ 6 GB of data):
snakemake -s main.smk download_data -j <number of cores>
If you are interested in re-creating the results CSV files with the newly downloaded ancient male X-chromosome data:
snakemake -s main.smk infer_jump_rates_real_data_all -j <number of cores>
This recreation of the haplotype-copying inference data will also generally take quite some time (~10 hours on a computing cluster with 200 parallel jobs).
The src
directory contains implementations of:
- Coalescent simulations with serial sampling using
msprime
(including two-locus simulations) - A python-based implementation of the Li-Stephens model (using
numba
) - Theoretical formulas for the correlation in tree-length and tree height across two loci in scenarios with serial sampling (
coal_cov.py
)
- Matthias Steinrücken
- John Novembre
- Novembre, Steinrucken, Berg Labs @ UChicago
- NIH GRTG