Altai (Allele-specific Transcript Assembly Instrument) is a reference-based allele-specific transcript assembler. It incorporates variations, e.g. SNPs in a vcf file, in the transcript assembly to assemble the full-length transcript sequences by assigning those variations to their correct phases, i.e. alleles. Its output transcript sequences are expected to be the actual allele-specific sequences rather than subsequences of the reference genome.
Part of Altai is developed based on Scallop (license).
Altai uses additional libraries of Boost, htslib. If they have not been installed in your system, you first need to download and install them.
If Boost has not been downloaded/installed, download Boost (license) from (www.boost.org). Uncompress it somewhere (compiling and installing are not necessary).
If htslib has not been installed, download htslib (license) from (www.htslib.org/) with version 1.5 or higher. Note that htslib relies on zlib. So if zlib has not been installed in your system, you need to install zlib first.
Use the following commands to build htslib:
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no
make
make install
The default installation location of htslib is /usr/lib
. If you would install it to a different location, replace the above configure
line with the following (by adding --prefix=/path/to/your/htslib
to the end):
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no --prefix=/path/to/your/htslib
In this case, you also need to export the runtime library path (note that there is an additional lib
following the installation path):
export LD_LIBRARY_PATH=/path/to/your/htslib/lib:$LD_LIBRARY_PATH
Use the following to compile Altai:
./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost
make
If some of the dependencies are installed in the default system directory (for example, /usr/lib
),
then the corresponding --with-
option might not be necessary. The executable file altai
will appear at src/altai
.
The usage of altai
is:
./altai -i <input.bam> -j <variants.vcf> [--chr_exclude <comma,seperated,chr,without,space>] [-G <genome.fa>] -o <output-prefix> [options]
The --chr_exclude
is a list of chromosome names. For example you may want to at least exclude chrY and chrX from the assembly (--chr_exclude chrY,chrX
) for a male sample, and exclude chrX for female sample.
If you would like to output the transcript sequences in fasta
format, -G <genome.fa>
is necessary. Otherwise, it's optional.
The input.bam
is the read alignment file generated by some RNA-seq aligner, (for example, TopHat2, STAR, or HISAT2). We recommand using STAR with a personalized vcf. STAR can extract variant information directly from the read.
# STAR with vcf is recommended. Other aligners are accepted.
STAR --runThreadN 8 \
--outSAMstrandField intronMotif \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--waspOutputMode SAMtag \
--outSAMattributes NH HI AS nM NM MD jM jI XS MC ch vA vG vW \
--genomeDir your_genome_dir \
--outFileNamePrefix your_output_prefix \
--varVCFfile your_sample_specific_vcf \
--readFilesIn your_readfile_R1 your_readfile_R2
Make sure that the bam file is sorted; otherwise run samtools
to sort it:
samtools sort input.bam > input.sort.bam
The reconstructed allele-specific transcripts shall be written as gtf format into output-prefix.gtf
. Their sequences will be written as fasta format into output-prefix.fa
.
The variants.vcf
is a variant calling format file generated by some variant caller from DNA or RNA, or downloaded from a database. Variants should be phased. The 10th column has GT
field and the 11th column have sample-specific genotype, e.g. 1|0
means allele 1 has alternative genotype and allele2 has reference genotype.
Make sure the vcf file is also sorted (with contigs in the same order as the bam file); otherwise run bcftools to sort it:
bcftools sort input.vcf > output.vcf
The reconstructed allele-specific transcripts shall be written as gvf format into output-prefix.gvf
. Their sequences will be written as fasta format into output-prefix.fa
.