Aligning reads using STAR

Intro
Build an index from reference genome .fasta file
align single-end reads
Summarize reads to genome features using HTSeq
OPTIONAL: automate for many fastq files

Intro

The

Build an index from reference genome .fasta file

📌

If you are working on the PennVet CHMI linux cluster, we have prebuilt STAR indicies from mouse, human and several other species located in /data/reference_db/star

Get reference genome fastq files from here. Search for your organism, select DNA 'fasta', then download the file that ends in '.primary_assembly.fa'

Build the index using STAR's 'genomeGenerate' function

# in the code below, we use the human genome as an example, and this resides in a reference directory on our linux server
STAR --runMode genomeGenerate \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--limitGenomeGenerateRAM=131160107050 \
--runThreadN 24

📌

It’s a good idea to name your index exactly the same as your input fasta file, just replacing .fasta or .fa, with .index. This naming convention makes it clear to others (and your future self) exactly which reference trancriptome was used for mapping.

align single-end reads

Run the following command for alignment of single-end reads to index.

STAR --runMode alignReads \
--genomeLoad LoadAndKeep \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \
--readFilesIn /path/to/your/fastq/file \
--runThreadN 24 \
--outFileNamePrefix NameYourOutput

‼️

avoid putting hyphens in the name of the STAR output, as this could cause problems later

Summarize reads to genome features using HTSeq

htseq-count --mode union \
--format bam \
--order name \
--idattr ID \
--a 10 \
--stranded=no \
YourBAMfileFromSTAR.bam \
CryptoDB-37_CparvumIowaII.gff > mySample.counts

OPTIONAL: automate for many fastq files

You can easily make a shell script that repeats the commands above for each fastq fill that you want to align and summarize. However, depending on the number of samples you need to align, it may be easier and less error prone if you use a loop

STAR --genomeLoad LoadAndExit \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u)
do
    STAR [...]
done

STAR --genomeLoad Remove \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \