- Intro
- Build an index from reference genome .fasta file
- align single-end reads
- Summarize reads to genome features using HTSeq
- OPTIONAL: automate for many fastq files
Intro
The
Build an index from reference genome .fasta file
If you are working on the PennVet CHMI linux cluster, we have prebuilt STAR indicies from mouse, human and several other species located in /data/reference_db/star
Get reference genome fastq files from here. Search for your organism, select DNA 'fasta', then download the file that ends in '.primary_assembly.fa'
Build the index using STAR's 'genomeGenerate' function
# in the code below, we use the human genome as an example, and this resides in a reference directory on our linux server
STAR --runMode genomeGenerate \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--limitGenomeGenerateRAM=131160107050 \
--runThreadN 24
Itβs a good idea to name your index exactly the same as your input fasta file, just replacing .fasta or .fa, with .index. This naming convention makes it clear to others (and your future self) exactly which reference trancriptome was used for mapping.
align single-end reads
Run the following command for alignment of single-end reads to index.
STAR --runMode alignReads \
--genomeLoad LoadAndKeep \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \
--readFilesIn /path/to/your/fastq/file \
--runThreadN 24 \
--outFileNamePrefix NameYourOutput
avoid putting hyphens in the name of the STAR output, as this could cause problems later
Summarize reads to genome features using HTSeq
htseq-count --mode union \
--format bam \
--order name \
--idattr ID \
--a 10 \
--stranded=no \
YourBAMfileFromSTAR.bam \
CryptoDB-37_CparvumIowaII.gff > mySample.counts
OPTIONAL: automate for many fastq files
You can easily make a shell script that repeats the commands above for each fastq fill that you want to align and summarize. However, depending on the number of samples you need to align, it may be easier and less error prone if you use a loop
STAR --genomeLoad LoadAndExit \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \
for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u)
do
STAR [...]
done
STAR --genomeLoad Remove \
--genomeDir /data/reference_db/star/Homo_sapiens.GRCh38.dna.primary_assembly.index \