by Merry Youle
Given the streamlined genomes and the frugal nature of the Bacteria and Archaea, one might expect their proteins to be short and to the point. However, a survey of the 580 prokaryotic sequenced genomes available in 2008 found many genes apparently encoding large proteins. Specifically, 0.2% of the ORFs (3732 genes) were longer than 5 kb. Of those, 80 were truly giants — more than 20 kb! These mammoths were found scattered about in 47 taxa in 8 different phyla. The longest two, both from the green sulfur bacterium Chlorobium chlorochromatii CaD3, encode proteins containing 36,806 and 20,647 amino acids, respectively.
Does this set an all-domain record? Not quite. They are bested by one of our own, ttn, the gene for titin, an abundant protein of critical importance in vertebrate striated muscles. This single gene encodes 363 exons that, upon translation, yields a protein with 38,138 amino acid residues with 244 individually folded protein domains. Each titin molecule spans half the length of an entire sarcomere. Score one for the eukaryotes.
Back to the prokaryote world, one can't help but wonder if these giant genes actually encode proteins and if so, what might be their functions. Answers are sketchy at best. However, more than 90% of them are annotated as encoding either a surface protein or a polyketide / nonribosomal peptide synthetase (PKS/NRPS). Both PKS and NRPS are multienzymes that are usually encoded by 3-6 genes in an operon. The giant genes of this type appear to be the result of gene fusion, the multiple enzymes now replaced by domains within a single protein. One can't help but wonder if this works well. If it does, one would expect these fused genes to be more widely distributed, not just strain-specific as they are.
Among the predicted surface proteins is one that has been experimentally demonstrated to be transcribed as one full length mRNA: the halomucin gene of Haloquadratum walsbyi. (H. walsbyi is a square halophilic archaeon found in extremely high salt environments.) This halomucin is the largest archaeal protein known so far — 9195 amino acids — and is thought to be exported from the cell to serve as a surface shield against the salty environment.
More food for thought. The 80 giant genes have tetranucleotide signatures that differ from the rest of the organism's genome. Their protein products all represent a major investment of resources and time, suggesting that they are doing something important. The maximum rate of bacterial protein synthesis under the best of conditions is estimated to be 40 amino acids added per second. On that basis, to make an average E. coli protein of 400 amino acids would take 10 seconds, while the 36,806 amino acid giant (the size of the largest one from Chlorobium chlorochromatii) would take a minimum of 15 minutes, and likely much longer. And besides, whatever their functions, handling mRNAs and proteins of this length must surely pose some knotty problems.
Reva O, & Tümmler B (2008). Think big--giant genes in bacteria. Environmental microbiology, 10 (3), 768-77 PMID: 18237309