Bash basics

This document accompanies the lab exercise from the DIYtranscriptomics course.

Helpful command line tips
Understanding file permissions
Getting to know Bash parameters
Getting to know 'for loops'
Using a loop to automate QC and alignment of multiple fastq files
Some other useful loops for aligning reads using other software tools

Helpful command line tips

If you’re new to Bash, there are lots of online resources for learning, but here are a few of the commands that will help you move around and carry out basic tasks. Note that some of these commands may only work if run as sudo.

Note: for Windows OS, you must install Windows Subsystem for Linux (WSL).

typing this (if you're on a Mac)	or this (if you run Windows)	does this
`cd /`	`cd /`	takes you to the root directory
`cd ~`	no direct shortcut	takes you to your home directory
`cd ..`	`cd ..`	takes you up one level in your file directory
`cd ../../`	`cd ../../`	takes you up two levels in your file directory
`cd path/to/some/folder`	`cd path/to/some/folder`	take you to some folder on your computer
`tar -xvzf [fileName.tar.gz]`	`tar -xvzf [fileName.tar.gz]`	unzip a .tar file
`gzip [filename]`	`gzip [filename]`	Compresses a file to be filename.gz
`ls` and `ls -l` and `ls -a`	`ls` and `ls -l` and `ls -a` or `dir`	list all files and folders in your working diretory with info on permissions. -a option shows hidden files
`ls -l \| wc -l`	`ls -l \| wc -l`	counts files in a directory
`mv [fileName or folderName] [directory]`	`mv [fileName or folderName] [directory]`	Moves a file or folder to a new location. Important, if the new location doesn't exist then mv renames your file to the destination name
`du -a -h \| sort -hr`	no direct shortcut	lists all files and folders in your working directory sorted by size
`du -sh *`	`du -sh *`	simpler version of the command above. lists all files in a folder and shows their file size
`df -h`	`df -h`	view free/used disk space by drive
`tree -d`	`tree -d`	lists all files and folders in your working directory as a tree structure
`lsblk`	no direct shortcut	lists drives and their size (as well as used/free space on each)
pressing up arrow	pressing up arrow	recalls previous command
`chmod ### [fileName]`	`chmod ### [fileName]`	edits permissions on file. See graphic below for the appropriate numbers to use in place of ###
`chown [yourUserName] [fileName]`	`chown [yourUserName] [fileName]`	makes you the owner of a file
`chgrp [yourUserName] [fileName]`	`chgrp [yourUserName] [fileName]`	assigns you as the group for the file
`rm -rf [directoryName]`	`rm -rf [directoryName]`	removes a folder and all of its contents
`wget [URLtoFile]`	no direct shortcut	downloads a file from the web
`nano [file.txt]`	`nano [file.txt]`	opens up a text file in a text editor directly in your Bash application
`export PATH="/path/to/your/software/:$PATH"`	no direct shortcut	add a new piece of software to the system PATH so it is executable from anywhere
`alias something="something else"`	`doskey something=something else`	add lines like this to your ~/.bash_profile to create a keyboard shortcut, in this case typing 'something' actually does 'something else'

Understanding file permissions

If you're trying to do something to a file or folder in Bash (e.g., delete, move, edit, create) and are unable to, chances are you need to modify the permissions (chmod), the owner (chown) or the group (chgrp). See below and read more about permissions here

Getting to know Bash parameters

Create your first parameter.

DNA="ATCG"

Use the echo program to see what you created.

echo $DNA 
echo $DNA are my favorite bases

Try closing and relaunching your Bash application window. Notice that your 'DNA' variable no longer exists, because it is a 'shell variable', which means it is contained exclusively within the shell in which it was set or defined. You can read more about variable types here.

Before moving on, go ahead an re-create your DNA variable.

We forgot about poor uracil...what kind of RNA-seq class is this?! Let's add it to our variable now.

echo $DNAU #note that this didn't work!

We can fix this by taking advantage of parameter expansion using curly brackets {}

echo ${DNA}U

Getting to know 'for loops'

For loops allow you to iterate over a list of items (in our case, files in a folder) and carry out any number of actions on those files. Here's the general format of a for loop.

# don't try running this code
for item in [LIST]
do
  [COMMANDS]
done

Now we'll create a simple for loop that actually runs

for i in TACG CTAG CCTC GAAT
do 
	echo "$i is a DNA oligonucleotide"
done

Let's apply the concept of parameter expansion to our new loop in order to easily find/replace 'T's with 'U's. Two things to take note of here. First, we're declaring a new parameter within the loop, and it will only exist when the loop is running. Second, we have taken advantage of some handy syntax that lets us find (//) and replace (/) parts of our original parameter.

for i in TACG CTAG CCTC GAAT
do 
	FIX=${i//T/U}
	echo "$FIX is a RNA oligonucleotide"
done

Using a loop to automate QC and alignment of multiple fastq files

Now we put this all together to make the creation of shell script really simple. Let’s test this loop out in a folder that only contains our subsampled fastq files from today’s lab, and our Kallisto index.

for FASTQ in *fastq*
do
	OUT=${FASTQ//.fastq.gz/_mapped}
	LOG=${FASTQ//.fastq.gz/_mapped.log}
	echo "fastqc $FASTQ"
	echo "kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o $OUT --single -l 250 -s 30 $FASTQ -t 8 &> $LOG"
	echo "done mapping reads for $FASTQ"
done

❓

Now that’s all great, but we don’t really want the output of this loop to print to our terminal screen (a.k.a. STDOUT). Instead we’d like to save it to a shell script so that we have a record of what was done. Let’s do that now using the > operator

❓

How would you incorporate MulitQC into this loop?

❓

What do you think would happen if you removed echo and the " " enclosing the fastqc and kallisto commands?

❓

Since a single loop could kick off a very long running compute job, you’ll want to know how to use the screen program. I’ll demonstrate for you.

⚠️

If you want to test out this code on real data, use the subsampled fastq files (only 10,000 reads in each file) described in the lab exercise associated with this lesson. Just keep in mind that you’ll need to have the human reference transcriptome index available in your working directory.

For loops can be combined with conditional statements (if/then) to loop over only certain files in a directory

for FASTQ in *fastq*
if [[ $FASTQ == *subsample* ]];
then 
  echo "fastqc $FASTQ"
else 
  echo "$FASTQ was not processed"
fi

Some other useful loops for aligning reads using other software tools

Bowtie2

for READ1 in *fastq*
do
    READ2=${READ1//1.fastq.gz/2.fastq.gz}
    SAM=${READ1//_1.fastq.gz/.sam}
    BAM1=${READ1//_1.fastq.gz/_mapped_and_unmapped.bam}
    BAM2=${BAM1//_mapped_and_unmapped.bam/_unmapped.bam}
    BAM3=${BAM2//_unmapped.bam/_umapped_sorted.bam}
    FQ1=${BAM3//_umapped_sorted.bam/_dehosted_R1.fastq}
    FQ2=${BAM3//_umapped_sorted.bam/_dehosted_R2.fastq}
    bowtie2 -x /venice/reference_db/Homo_sapiens/ensembl_release108/human -1 $READ1 -2 $READ2 -S $SAM -p 24
    samtools view -@ 24 -bS $SAM > $BAM1
    samtools view -@ 24 -b -f 12 -F 256 $BAM1 > $BAM2
    samtools sort -@ 24 -n $BAM2 -o $BAM3
    bedtools bamtofastq -i $BAM3 -fq $FQ1 -fq2 $FQ2
done