This is the open-source software HiCMC. Through sophisticated biological modeling we enable highly efficient compression of Hi-C contact matrices.
For a smooth quick start, we provide a test file that can be downloaded and extracted.
We have tested this software on Ubuntu
operating system with conda
software.
First, clone the repository and enter the directory:
git clone https://github.com/sXperfect/hicmc
cd hicmc
Create a virtual environment using conda
and install necessary libraries
conda create -y -n hicmc python=3.11
conda activate hicmc
conda install -y -c conda-forge cmake gxx_linux-64 gcc_linux-64 zlib curl
Install python libraries
pip install -r requirements.txt
pip install hic2cool cooltools
pip install --pre bitstream
Note: At the time of writing, the bitstream
library has a bug that is fixed in the pre-release.
Future versions of bitstream
may not require installation with the --pre
option.
Run setup script setup.sh
:
bash setup.sh
Create data folder and download domain information data based on Insulation score:
mkdir -p data && cd data
wget https://www.tnt.uni-hannover.de/staff/adhisant/hicmc/domain_info.tar.gz
tar xzvf domain_info.tar.gz
Note: Insulation score can be computed using cooltools
Download hic
data from GEO:
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525%5FGM12878%5Finsitu%5Fprimary%2Ehic
Convert hic
data to mcool
:
hic2cool convert GSE63525_GM12878_insitu_primary.hic GSE63525_GM12878_insitu_primary.cool
Note: This step is necessary because HiCMC currently only supports cooler as input file. This can be extended by integrating parsers or readers for other formats, especially for the hic format using straw.
Go back to the root directory
cd ..
Encode the mcool
data at 250kb
with HiCMC:
python -m hicmc ENCODE --insulation-file data/GM12828-insitu_primary/250000/insulation.tsv --insulation-window 1000000 --weights-precision 12 --domain-values-precision 18 --distance-table-precision 10 --domain-mask-threshold 45 --balancing KR data/GSE63525_GM12878_insitu_primary.mcool 250000 results/GM12878-insitu_primary-250kb
Note: The value of --insulation-window
is a multiplication of the resolution. In the paper we mention the multiplier value instead of the exact window size value.
The open-source HiCMC codec is made available before scientific publication.
This pre-publication software is preliminary and may contain errors. The software is provided in good faith, but without any express or implied warranties. We refer the reader to our license.
The goal of our policy is that early release should enable the progress of science. We kindly ask to refrain from publishing analyses that were conducted using this software while its development is in progress.
Python 3.8 or higher is required.
It is recommended that you create a virtual environment using conda.
For conda users, the cmake
, gcc
, zlib
, curl
, and gxx
libraries are required and can be installed through:
conda install -y -c conda-forge cmake gxx_linux-64 gcc_linux-64 zlib curl
See requirements.txt for the list of required Python libraries.
Our tool accept mcool
data as the input.
For hic
data, transcoding to mcool
is necessary using hic2cool
tool:
hic2cool convert <hic_file> <mcool_file>
Before encoding with our tools, a domain information based on a TAD caller (in this case Insulation score) is required. Please refer to this link on how to generate the domain file.
To run our tools, please use the following command on the directory:
python -m hicmc <mode>
where mode
is either ENCODE
or DECODE
.
Use --help
to show help.
ENCODE Compress a cooler file with a specific resolution
usage: HiCMC ENCODE [-h] [--check-result] [--insulation-file INSULATION_FILE] [--insulation-window INSULATION_WINDOW] [--weights-precision WEIGHTS_PRECISION] [--domain-mask-statistic {average,sparsity,deviation}] [--domain-mask-threshold DOMAIN_MASK_THRESHOLD] [--domain-values-precision DOMAIN_VALUES_PRECISION] [--distance-table-precision DISTANCE_TABLE_PRECISION]
[--balancing BALANCING]
input_file resolution output_directory
positional arguments:
input_file input file path (.cool or .mcool)
resolution
output_directory
options:
-h, --help show this help message and exit
--check-result Check the decoded contact matrix equals the original matrix
--insulation-file INSULATION_FILE
--insulation-window INSULATION_WINDOW
--weights-precision WEIGHTS_PRECISION
--domain-mask-statistic {average,sparsity,deviation}
--domain-mask-threshold DOMAIN_MASK_THRESHOLD
--domain-values-precision DOMAIN_VALUES_PRECISION
Number of bits used for floating-point compression
--distance-table-precision DISTANCE_TABLE_PRECISION
Number of bits used for floating-point compression
--balancing BALANCING
Select a balancing method, default: KR
DECODE Decompress HiCMC encoded payload
usage: HiCMC DECODE [-h] input output
positional arguments:
input Path to the HiCMC encoded payload
output Output directory
options:
-h, --help show this help message and exit
Currently HiCMC supports only cooler as input file. This can be extended by integrating parsers or readers for other formats, especially for the hic format using straw.
The data can be dowloaded from NCBI.
NCBI Accession Number | Cell line | Filename |
---|---|---|
GSE63525 | CH12 | GSE63525_CH12-LX_combined.hic |
GSE63525 | GM12878 (Insitu-DpnII) | GSE63525_GM12878_insitu_DpnII_combined.hic |
GSE63525 | GM12878 (Primary) | GSE63525_GM12878_insitu_primary.hic |
GSE63525 | GM12878 (Replicate) | GSE63525_GM12878_insitu_replicate.hic |
GSE63525 | HMEC | GSE63525_HMEC_combined.hic |
GSE63525 | HUVEC | GSE63525_HUVEC_combined.hic |
GSE63525 | IMR90 | GSE63525_IMR90_combined.hic |
GSE63525 | K562 | GSE63525_K562_combined.hic |
GSE63525 | KBM7 | GSE63525_KBM7_combined.hic |
GSE63525 | NHMEK | GSE63525_NHEK_combined.hic |
Yeremia Gunawan Adhisantoso <[email protected]>
Fabian Müntefering <[email protected]>
Jan Voges <[email protected]>