Alignment-free analysis of raw reads using Sourmash

Before starting

The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your WGS data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a Sequence Bloom Tree (SBT) for rapid searching, from RefSeq or Genbank in order to help assign identity to a sample. SBT databases of microbial genomes from all of Genbank and RefSeq are available directly on our server for convenience.

If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.

Prepare sample ‘sketches’

Here, we’ll use sourmash’s compute function to prepare a sketch of each fastq file in our directory. The --scaled option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.

sourmash compute --scaled 1000 *.gz

Compare sketches to each other

The Sourmash compare command computes the jaccard index between two or more sketches generated above.

sourmash compare *.sig -k 31 -o cmp

Visualize sample relatedness

sourmash plot --pdf --labels cmp

Search against all of RefSeq

This Sequence Bloom Tree (SBT) database contains approximately 60,000 microbial genomes (including viral and fungal) from NBCI’s RefSeq, including:

  • 53865 bacterial genomes
  • 5463 viral genomes
  • 475 archael genomes
  • 177 fungal genomes
  • 72 protist genomes
sourmash search *.sig /data/reference_db/sourmash_refs/refseq-d2-k31.sbt.json -n 20 
#shows top 20 hits

Search against all of Genbank

This Sequence Bloom Tree (SBT) database contains approximately 100,000 microbial genomes (including viral and fungal) from NBCI’s GenBank.

sourmash search *.sig /data/reference_db/sourmash_refs/genbank-d2-k31.sbt.json -n 20
#shows top 20 hits

Estimate which genomes are in your sample

Often times there are questions about whether a preparation used for WGS was contaminated with another organism. Perhaps Sanger sequencing of 16S rDNA gave murky results, or restreaking an isolate showed more than one distinct colony morphology. Alternatively, you may know that your sample has multiple genomes present (e.g. metagenomics). In either case, the Sourmash gather function allows you to ask exactly which organisms might be present in a sample based only on the sequence data. Like the search function above, gather needs a reference database to search against.

sourmash gather -k 31 *.sig /data/reference_db/sourmash_refs/refseq-d2-k31.sbt.json #using refseq
sourmash gather -k 31 *.sig /data/reference_db/sourmash_refs/genbank-d2-k31.sbt.json #using genbank

OPTIONAL: check for contamination with fastqc_screen

Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes. Although this has nothing to do with Sourmash or minHask sketches per se, it can be a useful way to confirm findings from Sourmash, or to check for organisms not well represented in the SBT reference databases provided above.

You can read more about fastq_screen in this recent paper

fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our server to which you can easily compare.

We’ve taken care of configuring fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at /usr/local/bin/fastq_screen_v0.12.0/fastq_screen.conf

# make sure you're in the directory where your fastq files are
fastq_screen --threads 24 --outdir fastq_screen *gz  
DO NOT modify the fastq_screen.conf file. If you want to use fastq_screen against a bowtie2 reference genome that is not listed below, please contact us for help
The reference genomes listed below have already been added to the configuration file.
  • Human
  • Mouse (Mus musculus)
  • Dog (Canis familiaris)
  • Cow (Bos taurus)
  • Horse (Equus caballus)
  • Pig (Sus scrofa)
  • Chicken (Gallus gallus)
  • Fruitfly (Drosophila melanogaster)
  • Yeast (Saccharomyces cerevisiae)
  • E. coli (strain K12)
  • Staph (Staphyloccous aureus strain NCTC 8325)
  • Lambda phage (Enterobacteriophage lambda)
  • PhiX
  • Contaminants
  • plasmids/vectors

OPTIONAL: filtering reads with fastq_screen

Depending on the results you get with fastq_screen above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the --tag and --filter options for fastq_screen. See the documentation for fastq_screen to understand how to properly use —-tag and --filter

First, tag each read in each fastq with the genome to which it aligns (from the available references described above)

fastq_screen --tag sampleX.fastq.gz

Next, filter based on tags that were assigned above

fastq_screen --filter 1000 sampleX.fastq.gz