Pseudoalignment with Kallisto

Pseudoalignment with Kallisto


Kallisto is a software tool from Lior Pachter’s lab at UC Berkeley and is described in this 2016 Nature Biotechnology paper. Kallisto and other tools like it (e.g. Salmon) have revolutionized the analysis of RNA-seq data by using extremely lightweight ‘pseudomapping’ that effectively allows analyses to be carried out on a standard laptop. If you are reluctant to try pseudomapping out of concern that it won’t produce accurate quantifications of transcripts, rest assured it will.

Installing Kallisto on a Mac or Windows OS

You can find detailed instructions for installing Kallisto (as well as other useful pieces of software) here. In general, we recommend using Conda to install Kallisto and its dependencies, but we also provide alternatives that are also relatively straightforward (e.g. using Homebrew to install Kallisto on a Mac OS).

Build an index from reference transcriptome .fasta file

If you are working on the PennVet CHMI linux cluster, we have prebuilt kallisto indicies from mouse, human and several other species located in /data/reference_db/kallisto.

Get reference transcriptome files from here Search for your organism, select cDNA, then download the file that ends in “cDNA.all.fa.gz”

Build the index as follows:

kallisto index -i inputFastaName.index inputFastaName.fasta
It’s a good idea to name your index exactly the same as your input fasta file, just replacing .fasta or .fa, with .index. This naming convention makes it clear to others (and your future self) exactly which reference transcriptome was used for mapping.

Mapping single-end reads

Run the following command for pseudoalignment of single-end reads to index.

kallisto quant -i inputFastaName.index -o sample1 -t 4 --single -l 275 -s 30 sample1_read1.fastq.gz

Once read mapping is complete, you will see a short report printed to your screen that indicates the number of reads kallisto saw in the fastq file, and the number of these that mapped to the reference. Often times it is useful to automatically store this information in a log file so that it can be parsed by other programs, such as the incredibly useful multiQC. To do this, simply append &> sample1.log if you're on a MacOS, or > sample.log 2>&1 if you're on Windows, at the end of your alignment code above. The outcome will be the same, but instead of displaying on the screen it will be piped to a log file.

avoid putting hyphens in the name of the Kallisto output, as this could cause problems later
If you have multiple cores on your computer, you can speed up the alignment with the -t argument followed by the number of threads you have on your machine. If you’re on a mac, you can find out the number of virtual cores (i.e. threads) using the sysctl hw.ncpu command directly in the terminal

Mapping paired-end reads

kallisto quant -i inputFastaName.index -o sample1 sample1_read1.fastq.gz sample1_read2.fastq.gz

Stranded alignments and bigwigs

In some cases, you may want to carry out a stranded alignment with the end goal of viewing read ‘pileups’ on a genome browser track. This can also be done using Kallisto, but requires a few other programs and steps to get from the Kallisto alignment to bigwig.

If you’ve been using your laptop thus far, now would be a good time to consider finding some more compute resources. Working with .bam files may be too much for your laptop to handle.

Stranded alignment using pseudobam. SAM creation is also possible at this step by piping (|)the Kallisto output directly to samtools

kallisto quant -i [yourindex] -o [outputname] --fr-stranded --pseudobam [input1] [input2] | samtools view -Sb - > []
kallisto quant -i [yourindex] -o [outputname] --rf-stranded --pseudobam [input1] [input2] | samtools view -Sb - > [kallisto.rf.bam]

Sort and index using samtools

samtools sort -@ 24 [] []
samtools sort -@ 24 [kallisto.rf.bam] [kallisto.rf.sorted]
samtools index []
samtools index []

BAM to BigWig conversion using deeptools

bamCoverage –b [] –o [] -p max
bamCoverage –b [kallisto.rf.sorted.bam] –o [] -p max