- Before starting
- Prepare sample ‘sketches’
- Compare sketches to each other
- Visualize sample relatedness
- Search against all of RefSeq
- Search against all of Genbank
- Estimate which genomes are in your sample
- OPTIONAL: check for contamination with fastqc_screen
- OPTIONAL: filtering reads with fastq_screen
The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your WGS data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a Sequence Bloom Tree (SBT) for rapid searching, from RefSeq or Genbank in order to help assign identity to a sample. SBT databases of microbial genomes from all of Genbank and RefSeq are available directly on our server for convenience.
If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.
Prepare sample ‘sketches’
Here, we’ll use sourmash’s
compute function to prepare a sketch of each fastq file in our directory. The
--scaled option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.
sourmash compute --scaled 1000 *.gz
Compare sketches to each other
compare command computes the jaccard index between two or more sketches generated above.
sourmash compare *.sig -k 31 -o cmp
Visualize sample relatedness
sourmash plot --pdf --labels cmp
Search against all of RefSeq
This Sequence Bloom Tree (SBT) database contains approximately 60,000 microbial genomes (including viral and fungal) from NBCI’s RefSeq, including:
- 53865 bacterial genomes
- 5463 viral genomes
- 475 archael genomes
- 177 fungal genomes
- 72 protist genomes
sourmash search *.sig /data/reference_db/sourmash_refs/refseq-d2-k31.sbt.json -n 20 #shows top 20 hits
Search against all of Genbank
This Sequence Bloom Tree (SBT) database contains approximately 100,000 microbial genomes (including viral and fungal) from NBCI’s GenBank.
sourmash search *.sig /data/reference_db/sourmash_refs/genbank-d2-k31.sbt.json -n 20 #shows top 20 hits
Estimate which genomes are in your sample
Often times there are questions about whether a preparation used for WGS was contaminated with another organism. Perhaps Sanger sequencing of 16S rDNA gave murky results, or restreaking an isolate showed more than one distinct colony morphology. Alternatively, you may know that your sample has multiple genomes present (e.g. metagenomics). In either case, the Sourmash
gather function allows you to ask exactly which organisms might be present in a sample based only on the sequence data. Like the search function above, gather needs a reference database to search against.
sourmash gather -k 31 *.sig /data/reference_db/sourmash_refs/refseq-d2-k31.sbt.json #using refseq sourmash gather -k 31 *.sig /data/reference_db/sourmash_refs/genbank-d2-k31.sbt.json #using genbank
OPTIONAL: check for contamination with fastqc_screen
Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes. Although this has nothing to do with Sourmash or minHask sketches per se, it can be a useful way to confirm findings from Sourmash, or to check for organisms not well represented in the SBT reference databases provided above.
You can read more about fastq_screen in this recent paper
fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our server to which you can easily compare.
We’ve taken care of configuring fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at
# make sure you're in the directory where your fastq files are fastq_screen --threads 24 --outdir fastq_screen *gz
- Mouse (Mus musculus)
- Dog (Canis familiaris)
- Cow (Bos taurus)
- Horse (Equus caballus)
- Pig (Sus scrofa)
- Chicken (Gallus gallus)
- Fruitfly (Drosophila melanogaster)
- Yeast (Saccharomyces cerevisiae)
- E. coli (strain K12)
- Staph (Staphyloccous aureus strain NCTC 8325)
- Lambda phage (Enterobacteriophage lambda)
OPTIONAL: filtering reads with fastq_screen
Depending on the results you get with
fastq_screen above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the
--filter options for fastq_screen. See the documentation for fastq_screen to understand how to properly use
First, tag each read in each fastq with the genome to which it aligns (from the available references described above)
fastq_screen --tag sampleX.fastq.gz
Next, filter based on tags that were assigned above
fastq_screen --filter 1000 sampleX.fastq.gz