by Christoph Weigel
A still somewhat unfamiliar term is floating around: the pan-genome. In 2005, Tettelin et al. coined the term along with genome analyses of eight Streptococcus agalactiae strains, and Merry introduced it to this blog, some time back already. Today, a keyword search in PubMed returns roughly 200 hits — 29 alone for 2014. Is it just another fancy new word sprawling from the -omics basket? No. A 'pan-genome' is the assembly of all the genes known to exist in members of a group. Or in a mnemonic: imagine the ancient Greek god Pan gathering and avidly devouring a bunch of vine grapes instead of picking, tasting, and enjoying each raisin separately (Fig. 1). Exchange 'grapes' by 'bacterial strains' and you know the essence of 'pan-genome'. Not far-fetched, actually: classical philologists tell us that the prefix 'pan' (παν, Greek for 'copious') and the name of the god Pan (Παν) have the same linguistic root.
The pan-genome unfurled
The pan-genome is the 'union' of the 'gene sets' – both terms are from the set theory, a branch of mathematical logic – of all the strains of a species. We may expect such stilted definitions when microbiology and bioinformatics crossbreed to sort today's sequence data deluge (Fig. 2). The determination of the pan-genome for a species under scrutiny depends just as much on the expert choice of strains by the microbiologist as on the choice of appropriate algorithms – that detect homologs, paralogs, and gene duplications – by the bioinformatician.
The gene annotation step during sequencing is the Achilles' heel here: due to the vast number of genes in bacterial genomes, manually curating them individually is (almost) impossible. Want an example? The dnaA gene encoding the replication initiator protein DnaA is not annotated in the genome of E. coli CFT073 (GenBank entry AE014075) due to a sequencing error disrupting the reading frame although both halves are easily detectable by manual homology searches (my observation).
Another example? The rpmH gene encoding the large-subunit ribosomal protein L34 (~45 residues) is annotated in only 95 of the 106 published genomes of the actinobacteria but is clearly detectable in the remaining by manual homology searches (my observation). Also well known are the problems of the current annotation software packages to detect reliably rare translational start codons. There is thus in all genomes a 'twilight zone' of un-annotated or wrongly annotated genes. Published integer values for gene numbers per genome should probably be listed with the assumed error rate. But if you accept a certain error rate and if you are into bioinformatics, you find tool kits for pan-genome analyses here, here, and here.
The core genome
The core genome is the set of genes shared by all the strains of the same bacterial species. Tettelin et al. found that for the eight S. agalactiae strains the majority of genes making up the core genome belong to the groups of housekeeping functions, e.g., cell envelope, regulatory functions, and transport and binding proteins. About one-third of the shared genes fell into the class encoding hypothetical / unknown function proteins. Housekeeping genes are known to be less prone to replacement by horizontal gene transfer than genes from the rest of the genome, but there are known exceptions. Want an example? As in the genomes of most other deltaproteobacteria, the dnaN and recF genes are embedded in a conserved gene context (dnaA · dnaN · recF · gyrB) in the Hippea maritima DSM 10411 genome (GenBank entry CP002606). However, these dnaN and recF genes have their closest homologs in the gammaproteobacteria, suggesting horizontal gene transfer (my observation). Tettelin et al. conclude that the "essence of a species" is linked to the core genome – it is at least that part of the genome which best represents vertical inheritance.
In typical pan-genome analyses, the numbers of shared genes initially decrease with addition of each new sequence. Then, extrapolation of the 'decay curve' indicates that a core genome reaches an asymptotic minimum that will remain relatively constant, even as many more genomes are added (Fig. 3).
The variable genome
The variable genome is the set of genes present in single strains – or a subset of strains – of a bacterial species. For this, Tettelin et al. had originally introduced the conflicting term 'dispensable genome', which was soon replaced by the more neutral term 'variable genome'. They found that many of the strain-specific genes fell into the class of hypothetical / unknown function proteins or, interestingly, are located in genomic islands (although they do not have the classical features of pathogenicity islands, are also often flanked by insertion elements and display an atypical nucleotide composition). The latter suggests their possible acquisition through horizontal transfer.
One stunning observation from most pan-genome studies: adding more strains to the analyses allows to extrapolate that – on the long run – the pan-genome of a species 'grows' by another 30–50 unique genes for every single additional strain. No saturation is obtained (Fig. 3). It is not particularly inventive to speculate that phages contribute to the physical maintenance of this genetic 'dark matter'. If so, the variable genomes of bacterial species overlap at the 'outer orbits', which makes their distinction as separate species impossible unless one allows for graduation.
Can the concept of the pan-genome solve the 'bacterial species' enigma?
No, sorry, not the way you might expect! The present methods to define bacterial species by DNA·DNA reassociation kinetics, 16S rRNA typing, or MLST – alone or in combination – cover features mainly associated with the core genome. At first glance, the 'pan-genome approach' to define species seems attractive since it includes those traits that are mostly linked to the variable genome: virulence, capsular serotype, adaptation, and antibiotic resistance, among others. Therefore, Tettelin et al. claim that: "...sequencing of multiple strains is necessary to understand the virulence of pathogenic bacteria and to provide a more consistent definition of the species itself." But there is a high price for it as the authors point out: "...because, in reality, only species with an open pan-genome are species,...". This is an inherent contradiction that, although embarrassing from the scientific point of view, sounds like a more realistic approach to cope with the amazing genome dynamics of the small things we love to consider here.
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, & Fraser CM (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci, 102 (39), 13950-13955. PMID 16172379