- Before starting
- Pre-process raw fastq files
- Run fastqc to check quality of trimmed sequences
- Run MetaPhlAn4
- Merge metaphlan outputs
- Collapse to species
- Make a heatmap
- Run StrainPhlAn to track individual bacterial strains
- run sample2markers
- run strainphlan
- extract markers from the metaphlan2 database.
- run StrainPhlAn
- Remove samples with a high probability of containing multiple strains
Before starting
The MetaPhlAn4 documentation is great and you should definitely familiarize yourself with their documentation, as well as their most updated manuscript. Below is our streamlined protocol.
Pre-process raw fastq files
We use Trimmomatic to filter out low quality reads and trim reads
for f in *.fastq.gz
do
java -jar /usr/local/bin/Trimmomatic/trimmomatic-0.39.jar SE -threads 104 -phred33 $f ${f%.fasta.gz}_trimmed.fastq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done
Run fastqc to check quality of trimmed sequences
fastqc *trimmed.fastq.gz -t 104 -o /trimmed
Run MetaPhlAn4
Important: MetaPhlAn4 is installed along with Humann3 and StrainPhlAn in a single Conda environment on our Linux server called ‘biobakery’. You will need to run conda activate biobakery
in order to use all the code that follows
# this loop will create a single text file output for each fastq file in a directory
for FASTQ in *trimmed.fastq.gz
do
OUT=${FASTQ//trimmed.fastq.gz}
metaphlan $FASTQ --bowtie2out ${OUT}.bowtie2.bz2 --nproc 104 --input_type fastq -o ${OUT}_profile.txt
done
Merge metaphlan outputs
In order to make the analysis of multiple samples more manageable, you will want to merge all of your profile.txt documents into a single file that contains all of the samples in your study. This won't overwrite the individual files, so you can merge different sets of samples if you are subsetting your dataset.
This command will merge all files ending in 'profile.txt' to a single abundance table.
merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt
Collapse to species
The merged abundance file contains relative abundance data for all taxa from Kingdom to strain. If you are only interested in the relative abundance of species, for example, in your sample, you can use grep
to extract only the species data from your merged table by taking only the lines that start with 's__'.
grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt
You can now use this file to perform many analyses, such as identifying differentially-abundance taxa between groups.
Make a heatmap
metaphlan comes with some other useful tools, including generating a heatmap based on the relative abundance of taxa.
# the --ftop argument determines the top n taxa that will be displayed in the heatmap
hclust2.py -i merged_abundance_table_species.txt -o merged_abundance_heatmap_species_1%.txt.png --ftop 32 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 1.5 -l --flabel_size 5 --slabel_size 4 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 1200
Run StrainPhlAn to track individual bacterial strains
Because you ran metaphlan with the —bowtie2out option, you can revisit your marker gene alignments at anytime to pull more information. We’ll take advantage of this to get strain level information with StrainPhlAn.
run sample2markers
Provide the sam output files from the prior step to the sample2markers script will generate a marker file for each sample. The marker files contain the consensus of unique marker genes for each species found in the sample, which will be used for SNP profiling.
for f in *.sam.bz2
do
sample2markers.py --nproc 5 --ifn_samples $f --input_type sam --output_dir /strainphlan
done
run strainphlan
Strainphlan will identify the clades that were detected in all samples, providing the marker files generated in the prior step, to see which clades can be SNP-profiled.
strainphlan.py --nprocs_main 4 --ifn_samples *.markers --output_dir /strainphlan --print_clades_only > clades.txt
extract markers from the metaphlan2 database.
E. coli, for example, was present in high abundance across all samples in the dataset we used.
bowtie2-inspect /usr/local/bin/metaphlan2/db_v20/mpa_v20_m200 > /strainphlan/all_markers.fasta
extract_markers.py --mpa_pkl /usr/local/bin/metaphlan2/db_v20/mpa_v20_m200.pkl --ifn_markers /strainphlan/all_markers.fasta --clade s__Escherichia_coli --ofn_markers s__Escherichia_coli.markers.fasta
run StrainPhlAn
strainphlan.py --ifn_samples *.markers --ifn_markers s__Escherichia_coli.markers.fasta --output_dir /strainphlan --clades s__Escherichia_coli --relaxed_parameters
Remove samples with a high probability of containing multiple strains
You may choose to re-generate the phylogeny only using samples that contain a single strain of your microbe-of-interest
build_tree_single_strain.py --ifn_alignments s__Escherichia_coli.fasta --nprocs 10 --log_ofn build_tree_single_strain.log