MetaPhlAn4: taxomomic analysis of shotgun metagenomic data using clade-specific marker genes

Before starting

The MetaPhlAn4 documentation is great and you should definitely familiarize yourself with their documentation, as well as their most updated manuscript. Below is our streamlined protocol.

Pre-process raw fastq files

We use Trimmomatic to filter out low quality reads and trim reads

for f in *.fastq.gz
do
		java -jar /usr/local/bin/Trimmomatic/trimmomatic-0.39.jar SE -threads 104 -phred33 $f ${f%.fasta.gz}_trimmed.fastq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done

Run fastqc to check quality of trimmed sequences

fastqc *trimmed.fastq.gz -t 104 -o /trimmed

Run MetaPhlAn4

Important: MetaPhlAn4 is installed along with Humann3 and StrainPhlAn in a single Conda environment on our Linux server called ‘biobakery’. You will need to run conda activate biobakery in order to use all the code that follows

# this loop will create a single text file output for each fastq file in a directory
for FASTQ in *trimmed.fastq.gz
do
		OUT=${FASTQ//trimmed.fastq.gz}
		metaphlan $FASTQ --bowtie2out ${OUT}.bowtie2.bz2 --nproc 104 --input_type fastq -o ${OUT}_profile.txt
done

Merge metaphlan outputs

In order to make the analysis of multiple samples more manageable, you will want to merge all of your profile.txt documents into a single file that contains all of the samples in your study. This won't overwrite the individual files, so you can merge different sets of samples if you are subsetting your dataset.

This command will merge all files ending in 'profile.txt' to a single abundance table.

merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt

Collapse to species

The merged abundance file contains relative abundance data for all taxa from Kingdom to strain. If you are only interested in the relative abundance of species, for example, in your sample, you can use grep to extract only the species data from your merged table by taking only the lines that start with 's__'.

grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt

You can now use this file to perform many analyses, such as identifying differentially-abundance taxa between groups.

Make a heatmap

metaphlan comes with some other useful tools, including generating a heatmap based on the relative abundance of taxa.

# the --ftop argument determines the top n taxa that will be displayed in the heatmap
hclust2.py -i merged_abundance_table_species.txt -o merged_abundance_heatmap_species_1%.txt.png --ftop 32 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 1.5 -l --flabel_size 5 --slabel_size 4 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 1200

Run StrainPhlAn to track individual bacterial strains

Because you ran metaphlan with the —bowtie2out option, you can revisit your marker gene alignments at anytime to pull more information. We’ll take advantage of this to get strain level information with StrainPhlAn.

run sample2markers

Provide the sam output files from the prior step to the sample2markers script will generate a marker file for each sample. The marker files contain the consensus of unique marker genes for each species found in the sample, which will be used for SNP profiling.

for f in *.sam.bz2
do
		sample2markers.py --nproc 5 --ifn_samples $f --input_type sam --output_dir /strainphlan
done

run strainphlan

Strainphlan will identify the clades that were detected in all samples, providing the marker files generated in the prior step, to see which clades can be SNP-profiled.

strainphlan.py --nprocs_main 4 --ifn_samples *.markers --output_dir /strainphlan --print_clades_only > clades.txt

extract markers from the metaphlan2 database.

E. coli, for example, was present in high abundance across all samples in the dataset we used.

bowtie2-inspect /usr/local/bin/metaphlan2/db_v20/mpa_v20_m200 > /strainphlan/all_markers.fasta
extract_markers.py --mpa_pkl /usr/local/bin/metaphlan2/db_v20/mpa_v20_m200.pkl --ifn_markers /strainphlan/all_markers.fasta --clade s__Escherichia_coli --ofn_markers s__Escherichia_coli.markers.fasta

run StrainPhlAn


strainphlan.py --ifn_samples *.markers --ifn_markers s__Escherichia_coli.markers.fasta --output_dir /strainphlan --clades s__Escherichia_coli --relaxed_parameters

Remove samples with a high probability of containing multiple strains

You may choose to re-generate the phylogeny only using samples that contain a single strain of your microbe-of-interest

build_tree_single_strain.py --ifn_alignments s__Escherichia_coli.fasta --nprocs 10 --log_ofn build_tree_single_strain.log