by Christoph
In the fourth and, for the time being, last episode of this Ecolimania in STC, I take a look around: Do we know the diversity of the species E. coli ? And what can we learn from analyses of its pangenome ? Reminder: Roberto looked into the past of How E. coli's Rose to Prominence, plus An Addendum, and I looked ahead in Just One More Addendum.
This is not about the diversity of E. coli in all its splendor. I will limit myself to the quantitative aspects of its genomic diversity. AllTheBacteria, the publicly accessible database just published by Hunt et al. (2024) and containing 300,000+ E. coli genomes, triples the number of the previous "661k" collection. The sheer volume of these "stamp collections" should suffice to cover the genetic diversity of E. coli within the limits of the known sampling bias: most genome sequences were obtained from pathogenic and commensal isolates from humans, domestic mammals and birds.
A note in advance: the 'circular plot' images in this post are mainly illustrative and a look at the overall picture is sufficient. Clicking on the images enlarges them so that they can be read better, but the large number of data points in each image cannot really be grasped. All images represent compromises in regard to the resolution (in pixels) that can be achieved in print or the file size for webpages. Furthermore, it is not yet technically feasible to reproduce interactive images on websites – and is impossible in print – that would allow zooming in, which would be crucial for a detailed evaluation of the images.
Do we know the genomic diversity of the species E. coli ?
In 2017, Dunne et al. (2017) sequenced the genome of E. coli NCTC 86 and compared it to the genomes of six other strains (Figure 2.1). E. coli NCTC 86 is the strain originally described by Theodor Escherich as Bacterium coli commune, later renamed Escherichia coli, and in 1919 deposited in the NCTC. Its genome has a size of 5.144 Mbp, is classified as belonging to phylotype A (see below for "phylotypes") and is related to K‑12 strain MG1655 (4.625 Mbp), but is not at or close‑to the base of a (phylogenetic) E. coli species tree (Figure 2.1b). Hence, it is not "the wild type" researchers could or should refer to.
The circular alignment of the seven E. coli genomes in Figure 2.1 offers an instructive and exemplary overview of their similarity and likewise their diversity. The sequences are to a high degree co‑linear, resulting in a syntenic arrangement of the (colored) orthologous genes like pearls on strings (in the paper, the authors do not disclose the significance of the two color shades in each of the 6 circular plots. I assume this means differences in sequence identity). Dunne et al. (2017) count 41 regions of divergence (RODs) that coincide in position in all genomes. They found differences in gain/loss of single or a handful of genes that are widely scattered throughout the genomes. Together, these ROSs and the numerous single gene gain/loss events contribute to the ~0.5 Mbp differences in genome size between the seven strains compared.
If you look really close at Figure 2.1 you see what appear as glitches in the graphics are actually small localized genomic inversions. Ignored in the figure for clarity but mentioned in the text is a large 2.8 Mbp inversion in the E. coli NCTC 86 genome that is not present in the other six genomes (see here). Intra‑genomic inversions of any size occur comparatively frequently in the genomes of Enterobacteriaceae, usually coinciding with the presence of multiple transposable elements in the genome, and are not restricted to distinct regions. But since it is in most cases impossible to determine when inversions occurred during speciation of a lineage because they can be overlaid by later partial overlapping inversions at any time, their phylogenetic signal is fuzzy at best. The large inversion in the genome of E. coli NCTC 86 can only be classified as a late "acquisition" because it is not found in any of the other six genomes. (As mentioned here in STC, the large intra‑genomic inversion in E. coli K-12 strain W3110, which stands out when compared to the genome of K‑12 strain MG1655, is also such a late "acquisition".)
Ghomi et al. (2024) have recently taken a different approach to tackling the genomic diversity of E. coli. The authors compared the essentiality of genes (ORFs) in three E. coli strains with that in strains of closely related species. Of particular interest is their result of a genome alignment that clearly visualizes species boundaries (Figure 2.2). Orthologs are identifiable in the genomes of other Enterobacteriaceae species for the vast majority of E. coli genes (here irrespective of their essentiality). On average, however, the orthologous genes in the three E. coli genomes (the genome of E. coli BW2113 serving as reference) are significantly more similar to each other than the corresponding orthologs in the genomes of ten strains of four other species. This conclusion would not change if the identity classes (100%, >85%, >70%) had been chosen differently and with more gradations (but would be less easily to visualize in a figure).
To get the most out of Figure 2.2, you need to know that although Ghomi et al. (2024) call it an "alignment" in the legend, the layout of the graphic is somewhat more complicated. It is a projection of an ortholog from each of the other 12 genomes onto the position of this ortholog in the genome of E. coli BW25113, regardless of the position of this ortholog in its own genome (I wish I could describe it in a less twisted way). This becomes immediately clear when one looks at the co-linearity of the genomes, which, although almost complete for S. Typhimurium LT2 and E. coli BW 25113 except for an inversion in the terminus region, is largely in disarray in the case of Citrobacter rodentium, for example (see Figure 2.2b). And then I see nine "gaps" in the circle for the reference strain E. coli BW25113, where the authors apparently took orthologs from the other E. coli strains as the reference (without explaining this).
Taken together, the genomes of the three E. coli strains stand out as a homogeneous group and the alignment visualizes species boundaries on the background of a high number of orthologous genes (ORFs) in the other Enterobacteriaceae. However, species boundaries cannot be simply quantified by an averaging calculation on this limited set of genome sequences, not the least since individual orthologs are by not equally conserved.
It has been known for more than a century that the species E. coli is phenotypically diverse, and that a simple division into "commensal" and "pathogenic" isolates is insufficient. Extensive attempts to distinguish and differentiate E. coli isolates based on morphological or metabolic criteria did not lead to a fully consistent differentiation/typification scheme; nor did serotyping, which led to the known serotypes O, K and H, think of E. coli O157:H7.
Over time, phylotyping became the method of choice for differentiating E. coli isolates/strains by their genotypes. The method is based on PCR analyses, today also as digital PCR (dPCR) of a few selected genes, and was explored in the initial studies by Clermont et al. (2000). The number of E. coli phylotypes has grown from initially six to now eight, and these are sufficient to cover >90% of the genomic diversity of the species E. coli (see Figure 2.1b). All phylotypes contain vastly different numbers of "members", and this is primarily due to the (above mentioned) known sampling bias for E. coli, rather than to real differences in their occurrence in natural habitats. And in case you wonder, it is now confirmed by numerous phylogenetic studies that Shigella, isolated by Kiyoshi Shiga in 1898 (see here in STC), belongs to the species E. coli and is therefore listed among the E. coli phylotypes. Only colleagues in medical microbiology want to keep Shigella as a (taxonomic) species because they are not inclined to abandon the terminology for established diagnostics and therapies of shigellosis.
What can we learn from analyses of the pangenome of the species E. coli ?
First of all, it must be noted that to date there is no published pangenome analysis on all ~105 E. coli genomes in the above-mentioned "661k" collection. All published studies were performed on a significantly smaller subsets of genomes. The algorithms necessary for computational pangenome analyses have been, and are still being, developed and evaluated on subsets. A major reason for the lack of comprehensive pangenome studies for E. coli is that the necessary computing times on hi‑performance computers are hard to come by. The main focus of bioinformaticians is therefore to make their algorithms ever faster without compromising on quality. It's not unlike wet-lab microbiologists using all the tricks to make their fastidious pets grow with generation times of days rather than weeks or months.
Pangenome analyses (or "pan‑genome" as mentioned here in STC) have introduced a number of terms that are now in common use: "...the core genome comprises the genes that are present in (nearly) all genomes analyzed. Some authors consider a softcore genome (>95% occurrence) to avoid dismissing gene famiies due to sequencing/annotation artifacts. The shell genome consists of the genes shared by the majority of genomes (10–95% occurrence). The gene families present in only one genome or <10% occurrence are described as accessory or cloud genome" (adopted from Wikipedia).
Tantoso et al. (2022) found, when analyzing 1,324 genomes, that the E. coli pangenome is an "open pangenome", meaning that with every genome added the number of cloud genes increases without reaching a plateau while the number of core genes (or softcore genes) asymptotically flattens out to a minimum (see here). All the studies I am aware of found an "open pangenome" for E. coli. And, according to current estimates, the E. coli cloud genome comprises ~80,000 genes/gene families, while the core genome comprises ~2,500–3,000 genes/gene families.
With 10,667 E. coli genomes analyzed, Abram et al. (2021) come closest to a comprehensive quantitative pangenome study and the (presently) least blurry "snapshot" of the genomic diversity of the species. The authors obtained this subset by "cleaning", that is by removing duplicates and inconsistent "complete genomes" from the 12,602 E. coli and Shigella genomes in the then‑available NCBI RefSeq collection. Since it is virtually impossible to align and compare this number of genomes one-to-one and subsequently determine the degree of relatedness by conventional cluster analysis, the authors resorted to alignment-free tools that are routinely used in metagenomics (see here for a primer in STC). Briefly: the genomes were converted to "sketches" in an approach based on k‑mers (k-mer size of 21, sampling size of 10,000). (Note that it is in principle possible to "reverse engineer" the original sequence form such "sketches"). From these "sketches", a distance matrix for all 10,667 genomes was calculated using Python, with the accession numbers of the genomes as columns and rows. This matrix of was then clustered with Mash, using hierarchical clustering to produce a heatmap of 'Mash distances', which illustrates the population structure of these genomes (Figure 2.3).
As indicated in the legend to Figure 2.3, colors in the heatmap are based on the pairwise Mash distance between the genomes. Teal colors represent similarity between genomes with the darkest teal corresponding to identical genomes reporting a Mash distance of 0. Brown colors represent low genetic similarity per Mash distance, with the darkest brown indicating a maximum distance of ∼0.039. Since you cannot read-out numbers from Figure 2.3, here they are: 70% of the 10,667 E. coli genomes belong to four phylogroups: B1 (28%), A (21%), B2-2 (13%), and Shig2 (8%). phylogroup C is represented by 5% and E2(O157) by 7% of the genomes. The remaining 18% are distributed unevenly over eight phylogroups and include the 9% "unassignable" cases. The methodology used in this work is thus able to reproduce known phylogroups, as well as to identify previously uncharacterized E. coli phylogroups, with an acceptable amount of "bottom sediment" (the authors use the terms 'phylotype' and 'phylogroup' interchangeably).
The authenticity of these 14 phylogroups is supported by three different lines of evidence: 1. phylogroup-specific core genes, 2. a phylogenetic tree constructed with 2613 single‑copy core genes, and 3. differences in the rates of gene gain/loss/duplication. It goes without saying that the analysis of entire genomes allows a much greater depth of analysis than the PCR-based method, which is limited to a few gene segments. Note that Abrams et al. (2021) counted "core genes", i.e. genes that are present in all genomes of a phylogroup, and not, as Ghomi et al. (2024) "essential genes" (see above).
To classify the 95,525 unassembled genomes, the authors took an interesting approach: rather than averaging the Mash values over a phylogroup, they defined a "medoid" as the genome that has the lowest average distance to all other genomes in its phylogroup when using the Mash values obtained for the entire set of 10,667 assembled genomes. By using the medoids of the 14 phylogroups as a proxy to classify the unassembled genomes, they found that two-thirds (67%) belong to phylogroups A (23%), B1 (15%), C (15%), and E2(O157) (14%). Strains belonging to phylogroups B1, C, and E2(O157) are often pathogenic and of interest to medical research and epidemiology, whereas phylogroup A includes strains frequently used in the lab or genetically modified strains. Researchers working with these strains are often not interested in fully assembling the genomes, which could explain the uneven distribution across the phylogroups compared to that of the complete genomes. (the remaining ~31,000 unassembled genomes were distributed at lower percentages among the remaining phylogroups or did not meet the cutoff).
To come back to the initial question of whether we know the genomic diversity of the species E. coli, I should give a two-part answer:
yes, as far as the approximate size of the core genome is concerned (note that "core gene" and "essential gene" are not synonymous). There are about 2,500–3,000 core genes/gene families across all E. coli phylotypes, but since the function of their gene products is only known for a fraction of the genes, although a major fraction now, it is not possible to say exactly how much redundancy there is in the core genome. And in our attempts to extrapolate from genomic to genetic diversity, we still lack knowledge of many regulatory pathways and knowledge of their respective redundancies.
Also a cautious yes for the approximate size of the cloud genome. Estimates based on the number of genomes available in 2021 suggest that there are about 80,000 cloud genes/gene families, with the tendency of a steadily slowing increase with more genomes added to the collections. Since E. coli has an open pangenome, we will have to be content with estimates also in the future.
Pangenome analyses carried out in recent years have already led to compelling results, despite the fact that this type of analysis is in its infancy. Or rather, is in pre-school age, to be fair. The approach of a Mash-based analysis of multiple genomes chosen by Abram et al. (2021) can, if the parameters are set accordingly, reproduce the known E. coli phylotypes to a satisfactory extent and is suitable for identifying new phylotypes. Since genome sequencing is now a lab routine, it is not presumptuous to expect that this analytical technique will replace PCR-based "ClermonTyping" over time, especially when the "medoids" of the 14 phylogroups are employed to reduce the computational demand. A Mash-based analysis may even help with the genomes of E. coli strains that defy PCR‑based phylotyping. It is not possible to predict whether the number of 14 phylogroups proposed by Abrams et al. (2021) will persist, whether they will need further differentiation, or whether new ones will be added when genomes of E. coli from previously underrepresented habitats are added to the "basket".
The sheer number of core genes/gene families plus the daunting number of cloud genes/gene families in the E. coli pangenome and our current (at best) spotty understanding of the function(s) of the latter fraction are a strong reminder of what W. Ford Doolittle said in a 1999 paper, namely "that all prokaryotic taxa are in essence imprecisely bounded and ephemeral. We might thus realistically look at all prokaryotes as one ‘global superorganism’ ... divided into subpopulations – within and between which genes are exchanged at different frequencies." (exchanged by horizontal gene transfer (HGT), that is). One of Doolittle's 'subpopulations' is Escherichia coli, which, according to Kostas Konstantinides, is a "sequence-discrete population" (95–100% ANI) and should therefore be considered a taxonomic species. Doolittle's 'imprecisely bounded' is reflected in the huge cloud genome of E. coli, and that this species is 'ephemeral', well, yes, this is what evolution is about after all.
By a pangenome analysis of a selected subset of 400 E. coli genome sequences Hall et al. (2021) revealed numerous groups of genes that preferentially co‑occur together in a genome (Figure 2.4), and reciprocally, groups of genes that preferentially do not co‑occur with other gene groups in a genome (click here for Figure 2.4b). These authors have cut first deep trails into this seemingly impenetrable jungle: there is structure in the E. coli pangenome, shaped by its evolution, and it's not simply a vast heap of core and cloud genes/gene families. Not that we already understand this structure well enough...
Do you want to comment on this post? We would be happy about it! Please comment on Mastodon, Bluesky, or on 𝕏 (formerly Twitter).
Comments