by Christoph Weigel
When biochemists deciphered the genetic code in the 1960s (the triplet 'alphabet' for amino acids whose defined order make up the protein 'words' ) it was – and still is – the most compelling evidence for a common "descent with modifications" (Charles Darwin) of all life on Earth: the alphabet is the same in all life forms (Figure 1). This holds true despite some whimsical spelling variants (as we know for every human language as well ). Not so common are the grammar and syntax but this should not bother you here (gene expression and its regulation, the how & when to make use of the words, differ considerably among life forms ). Yet unveiling biology's "universal" alphabet came at a hefty price, namely, losing the possibility to trace, unambiguously, the evolutionary history of the 'speakers', be they E. coli or elephants (Jacques Monod, 1954). If everybody understand all words, in principle, the dependence on the parentally inherited vocabulary (vertical gene transmission ) becomes less of a constraint. New words cannot only "evolve" from within the inherited vocabulary (by mutation, domain swapping, and gene duplication ) but can also be picked up from whatever source is within reach (lateral or horizontal gene transfer, here abbreviated HGT ). And, sure enough, lots of words have been exchanged across all domains of life along 3 billion years of evolution, and in whatever form the words came along: either carried by naked "environmental" DNA (set free by lysed cells and disrupted viruses ), by viruses and bacteriophages, by 'gene transfer agents' (GTA), or by conjugative plasmids (think of the alarming spread of antibiotic resistance ).
We have mentioned HGT so often in STC that I pick here just three startling examples: the pea aphid that adopted fungal genes to color its skin either green or pink (Color Me Aphid ), the stick insect that adopted bacterial genes to better cope with its vegan life style (Picky Sticks ), and the case of "porous" genetic barriers between the bacteria Campylobacter jejuni and Campylobacter coli (Species Come and Species Go ). Molecular phylogeneticists soon realized that any attempts to derive organismal phylogenies from gene histories are flawed: there have been way too many HGT events over evolutionary time (that's particularly true for microbial species or lineages ). It is still debated whether the species concept can be retained at all for the prokaryotes... (maybe yes, see our recent post on the 'species problem' )
Let me turn away from these grandiose views (sensing mild hypoxia already ) and gradually descend from the peaks of theory to the lower-lying valleys and meandering rivers of experimentation. To complete this "Tour d'Horizon", we will need to complete three stages. In the first stage (this post ), I will show you around a lush valley populated by a most astonishing collection of trees, a plantation generated by bioinformaticians who tackle the impact of HGT on the larger scale (Figure 2). The second and third stages of the tour will come later... In the second stage, I will undertake an archaeological excursion along the Helicase River to show you two of its HGT tributaries. In the final stage, I will invite you to peek into experiments that unveil aspects of how HGT actually works on the molecular level.
The most astonishing trees
In their 2008 study, Tal Dagan and her colleagues asked how substantial was the impact of HGT on the evolution of prokaryotic genomes. They did not laboriously collect indications for HGT by comparing contradictory phylogenetic trees of individual genes, or gene families. Rather, they chose an algorithmic approach to comprehensively analyze 181 prokaryotic genomes ‒ 159 Bacteria, 22 Archaea ‒ available back then (today, there are >1,500 fully sequenced prokaryotic genomes ). From the annotated genes (that is, 'open reading frames' coding for proteins ) of these 181 genomes they obtained a total of 539,723 protein sequences (the exact number is given without an error margin; note that even well-curated genomes usually contain ~2% misannotations ). They then used the 'reciprocal best BLAST hit approach' (BBH) with a threshold of E<1‒10 (for an explanation, see here; scroll to "What is the Expect (E) value?" ) in a all‑against‑all procedure to 'extract' orthologs. And, by applying the T25 amino acid identity threshold (that's roughly corresponding to the limit of what you identify as homologs when scanning sequences"by eye" ), they obtained 54,349 clusters (= protein families ) covering 431,492 proteins from their pool. They excluded the remaining 108,231 'singletons' from further analyses (such 'singletons' were likely to encompass most of the misannotated genes, so they were on the safe side ). These protein families contained the lowest possible number of false-positive and false-negative 'hits' (find some notes of caution here ). Yet, at T25, only six families had members in all genomes! (Only ) the finding that the 181 genomes share at least one gene family, and therefore are interconnected with each other, thereby forming a 'complete network', allowed for a valid matrix-based analysis of the edges (I stop with the bioinformatics lingo right here ).
None of their networks had more than a shady resemblance to known taxonomic boundaries, but Dagan et al. readily found three distinct groups with shared characteristics among these networks. To better visualize these groups, they projected them onto a reference tree representing the phylogeny of the concatenated three rRNAs of all the genomes (Figure 2 A) (most phylogenetic studies employ the '16S rRNA approach', which is generally considered the "gold standard" but is known to fail in resolving closely related species and the deepest branching ). In group 1, which they consider to represent the most recent HGTs, links exist between extant genomes, i.e., they share genes that are not present in any of the others (Figure 2 D, red); In group 2, which likely represents 'more recent' HGTs (from an ancestor of one branch to an extant genome ), links exist between slightly 'deeper' edges within a bunch of branches and single extant genomes (Figure 2 C); in group 3, links between edges of branches in distantly related groups represent ancient HGTs (Figure 2 B). Figure 2 shows qualitatively (but more impressive than any table with discrete numbers could ) that the three groups are equally "dense," that is, HGT was a constantly strong contributor to genome evolution over time. The authors estimate "that, on average, at least 81±15% of the genes in each genome studied were involved in lateral gene transfer at some point in their history".
One of their especially interesting findings was that there are no apparent "hubs" for HGT, that is, edges in lineages that gave rise to multiple HGTs over short times. This suggests that HGT occurs largely at random and does not depend on particularly 'HGT-prone' genomic constellations (in neither 'donor' nor 'recipient' ). Also interesting is their finding that once a gene was acquired by a genome, it tends to stick there and doesn't continue with a 'nomadic lifestyle' (newly acquired genes that were subsequently lost within a few million years would probably have slipped through their filter, though ).
Figure 2 provokes a few comments. First, the vertical scale for the funnel-shaped trees is "years since the emergence of distinct rRNA lineages", thus roughly 3 billion years, compressed to the few millimeters in the figure. This is as if you look at a map of the United States at a scale that shows San Diego, CA and Boston, MA on a letter page (distance ~4,900 km when driving by I-80 + I‑90 ). You wouldn't be able to put a fingertip on either of the two intersections of interstates 80 and 35 in Des Moines (~10 km apart, roughly equal to 2 million years ), your fingertip would instead cover a great part of Iowa. We have to accept that time scales covering 3 billion years are beyond human perception, and also beyond the possibilities of more granular graphical representations. Second, the distances of the outer edges in the horizontal plane (the individual genomes, Fig. 2 A ) correspond to the relative similarities of their rRNA sequences assuming a constant mutation rate over time (certainly an over-simplification, but we simply don't know better ) with the time component stretched-out vertically. This is an ingenious way to "open-up" the visual space for showing the different 'layers' of HGT events over time. But we are left, as viewers, with a low-resolution 'general perspective' and not allowed to zoom-in, or highlight the path(s) of individual protein families, or to interactively "climb" within the trees for closer inspection (which would be immensely tempting! ).
Dagan et al. conclude that "when all genes and genomes are considered, the tree paradigm fits only a small minority of the genome at best; hence, more realistic computational models for the microbial evolutionary process are needed." If we consider 'all genes and genomes' here, we're unfortunately left with the ambiguity of whether 'the tree paradigm fits only a small minority of the genome', or '…of the genomes'. Assuming that this was just a typo, I could easily follow Dagan et al. in their proposal that 'more realistic computational models' are needed for 'for the microbial evolutionary process'. But only if such models could account for the undisputed fact that the vast majority of cells divide by binary fission, which adds a strong tree-like branching component to the 'the microbial evolutionary process'. Independent of the negotiations that are conducted in the cells' genomes in between divisions. The subsequent part of their conclusion, however, will haunt taxonomist for years to come: "By accounting for all genes, including the many that are patchily distributed across broad taxonomic boundaries, networks uncover a view of microbial genome evolution (sic!) that incorporates HGT as a quantitatively important mechanism of natural variation among prokaryotic genomes. In contrast to trees, networks thus present a means of reconstructing microbial genome evolution that accommodates the incorporation of foreign genes…".