This document accompanies the first lab exercise from the DIYtranscriptomics course.
- Helpful command line tips
- Understanding file permissions
- Getting to know Bash parameters
- Getting to know 'for loops'
- Using a loop to automate QC and alignment of multiple fastq files
- Some other useful loops for aligning reads using other software tools
Helpful command line tips
If you’re new to Bash, there are lots of online resources for learning, but here are a few of the commands that will help you move around and carry out basic tasks. Note that some of these commands may only work if run as sudo
.
Note: for Windows OS, you will need to install either Windows Subsystem for Linux (WSL) or Git for Windows. I recommend WSL, but if you choose Git for Windows, you’ll need to add C:\Program Files\Git\usr\bin to your system variables path for all of the common bash commands to work
typing this (if you're on a Mac) | or this (if you run Windows) | does this |
---|---|---|
|
| takes you to the root directory |
| no direct shortcut | takes you to your home directory |
|
| takes you up one level in your file directory |
|
| takes you up two levels in your file directory |
|
| take you to some folder on your computer |
|
| unzip a .tar file |
|
| Compresses a file to be filename.gz |
|
| list all files and folders in your working diretory with info on permissions. -a option shows hidden files |
|
| counts files in a directory |
|
| Moves a file or folder to a new location. Important, if the new location doesn't exist then mv renames your file to the destination name |
| no direct shortcut | lists all files and folders in your working directory sorted by size |
|
| simpler version of the command above. lists all files in a folder and shows their file size |
|
| view free/used disk space by drive |
|
| lists all files and folders in your working directory as a tree structure |
| no direct shortcut | lists drives and their size (as well as used/free space on each) |
pressing up arrow | pressing up arrow | recalls previous command |
|
| edits permissions on file. See graphic below for the appropriate numbers to use in place of ### |
|
| makes you the owner of a file |
|
| assigns you as the group for the file |
|
| removes a folder and all of its contents |
| no direct shortcut | downloads a file from the web |
|
| opens up a text file in a text editor directly in your Bash application |
| no direct shortcut | add a new piece of software to the system PATH so it is executable from anywhere |
|
| add lines like this to your ~/.bash_profile to create a keyboard shortcut, in this case typing 'something' actually does 'something else' |
Understanding file permissions
If you're trying to do something to a file or folder in Bash (e.g., delete, move, edit, create) and are unable to, chances are you need to modify the permissions (chmod
), the owner (chown
) or the group (chgrp
). See below and read more about permissions here
Getting to know Bash parameters
Create your first parameter.
DNA="ATCG"
Use the echo program to see what you created.
echo $DNA
echo $DNA are my favorite bases
Try closing and relaunching your Bash application window. Notice that your 'dna' variable no longer exists, because it is a 'shell variable', which means it is contained exclusively within the shell in which it was set or defined. You can read more about variable types here.
Before moving on, go ahead an re-create your dna variable.
We forgot about poor uracil...what kind of RNA-seq class is this?! Let's add it to our variable now.
echo $DNAU #note that this didn't work!
We can fix this by taking advantage of parameter expansion using curly brackets {}
echo ${DNA}U
Getting to know 'for loops'
For loops allow you to iterate over a list of items (in our case, files in a folder) and carry out any number of actions on those files. Here's the general format of a for loop.
# don't try running this code
for item in [LIST]
do
[COMMANDS]
done
Now we'll create a simple for loop that actually runs
for i in TACG CTAG CCTC GAAT
do
echo "$i is a DNA oligonucleotide"
done
Let's apply the concept of parameter expansion to our new loop in order to easily find/replace 'T's with 'U's. Two things to take note of here. First, we're declaring a new parameter within the loop, and it will only exist when the loop is running. Second, we have taken advantage of some handy syntax that lets us find (//) and replace (/) parts of our original parameter.
for i in TACG CTAG CCTC GAAT
do
FIX=${i//T/U}
echo "$FIX is a RNA oligonucleotide"
done
Using a loop to automate QC and alignment of multiple fastq files
Now we put this all together to make the creation of shell script really simple.
for FASTQ in *fastq*
do
OUT=${FASTQ//.fastq.gz/_mapped}
LOG=${FASTQ//.fastq.gz/_mapped.log}
echo "fastqc $FASTQ"
echo "kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o $OUT --single -l 250 -s 30 $FASTQ -t 8 &> $LOG"
echo "done mapping reads for $FASTQ"
done
>
operator echo
and the " "
enclosing the fastqc and kallisto commands?screen
program. I’ll demonstrate for you.For loops can be combined with conditional statements (if/then) to loop over only certain files in a directory
for FASTQ in *fastq*
if [[ $FASTQ == *subsample* ]];
then
echo "fastqc $FASTQ"
else
echo "$FASTQ was not processed"
fi
Some other useful loops for aligning reads using other software tools
Bowtie2
for READ1 in *fastq*
do
READ2=${READ1//1.fastq.gz/2.fastq.gz}
SAM=${READ1//_1.fastq.gz/.sam}
BAM1=${READ1//_1.fastq.gz/_mapped_and_unmapped.bam}
BAM2=${BAM1//_mapped_and_unmapped.bam/_unmapped.bam}
BAM3=${BAM2//_unmapped.bam/_umapped_sorted.bam}
FQ1=${BAM3//_umapped_sorted.bam/_dehosted_R1.fastq}
FQ2=${BAM3//_umapped_sorted.bam/_dehosted_R2.fastq}
bowtie2 -x /venice/reference_db/Homo_sapiens/ensembl_release108/human -1 $READ1 -2 $READ2 -S $SAM -p 24
samtools view -@ 24 -bS $SAM > $BAM1
samtools view -@ 24 -b -f 12 -F 256 $BAM1 > $BAM2
samtools sort -@ 24 -n $BAM2 -o $BAM3
bedtools bamtofastq -i $BAM3 -fq $FQ1 -fq2 $FQ2
done