Genome Clustering

This interface allows the user to select a set of genomes and display a tree that group them by genomic similarity. The tree is constructed from the pairwise distances (see Pairwise Genome Distance and ANI) between the selected genomes using a neighbor joining algorithm (see Tree Construction).

Moreover, the genomes are grouped in “species cluster” according to the pairwise distance (see Clustering Genomes). Those clusters are called MicroScope Genome Cluster (MICGC for short). The interface also display the cluster to which the organism belong.

Note that genomes for which CheckM detected more than 5% contamination or less than 90% completeness are not assigned to MICGC clusters. Such genomes will however appear in the organism selector and are displayed in black in the tree. You can consult CheckM results in the Genome Overview page.

../../_images/genoclust-workflow.png

Microscope Genome Cluster (MICGC) workflow.

Interface Overview

Below is a screenshot of the genome selection interface. It allows to select organisms according to the NCBI taxonomy, by strain name or by MICGC cluster. The upper list is the list of all the available organisms in MicroScope. The lower list is the list of currently selected organisms that will be used for the tree. The user can add organisms or remove organisms in the lower list using the green and red arrows.

../../_images/organism-selector-1.png

Both list are searchable. For each criteria, the user can choose exact matching or partial matching and can use several criteria to refine the search. In this example, the user searched MicroScope organisms in the Actinobactearia phylum and whose strain name contains bifi.

../../_images/organism-selector-2.png

Next by clicking “Save and Run”, the tree is computed. Below is a screenshot of the tree obtained with those organisms. The user can navigate within the tree. Next to each organism, the name of the MICGC cluster is displayed. The user can click on the species cluster to get more information (in this example, the user selected the cluster MICGC13). Contaminated or incomplete genomes (not associated to MICGC clusters) are displayed in black in the tree.

../../_images/genoclust-modif.png

Pairwise Genome Distance and ANI

In order to quickly calculate the pairwise genome distance, we use Mash. Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and a statistical significance test. Mash distance strongly correlates with the Average Nucleotide Identity (ANI). If \(D\) denotes the Mash distance then \(D \simeq 1 - \text{ANI}\).

ANI represents the average nucleotide identity between homologous genomic regions shared by two genomes and offers robust resolution between strains of the same or closely related species (80-100% ANI). It closely reflect the traditional microbiological concept of DNA-DNA hybridization relatedness for defining species (\(94\% \text{ANI} \simeq70\% \text{DNA-DNA hybridization}\) ).

To know now more about Mash, see here.

Reference:

  1. Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A 102, 2567–2572 (2005).
  2. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016).

Tree Construction

A tree is constructed from the Mash distance matrix. This tree is computed dynamically directly in the browser using a rapid neighbour joining algorithm.

This algorithm can assign a negative length to a branch. In order to avoid that and to keep the total distance between an adjacent pair of terminal nodes unchanged, we set negative branch length to zero and transfer the difference to the adjacent branch (see here for more information).

Clustering Genomes

Typically, two bacteria belong to the same species when \(\text{ANI} \geq 95\%\) (i.e. \(D \leq 0.05\)).

In order to construct these species clusters, we remove the pairwise genome distances that don’t match this ANI threshold. Then we extract communities from that network.

From our test, we obtain better results to reconstruct Progenome species clusters with a threshold of 0.06 (i.e. \(\text{ANI} \geq 94\%\)) for Mash distances, kmer size = 18 and sketch size = 5000.

To detect the communities, we use the louvain community detection algorithm.

Export

By clicking on the “Export” button:

  • the tree can be exported in SVG or Newick format
  • the distances can be exported in TSV format (as a matrix or as a pairwise list)

Reference:

  1. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).