This is a longread again (23 min). It dives deeper into "translational initiation" and is, therefore, arranged like a messenger RNA (mRNA) with proper ribosome-binding site (RBS), AUG start codon, open reading frame (orf), and amber (UAG) stop codon.
When did you first hear about the "genetic code"? In high school? Or in college, in an undergraduate course in molecular biology and/or genetics? In either case you were certainly shown the famous codon table that specifies which of the 64 possible triplets in messenger RNA (mRNA) are translated into one of the 20 canonical or "proteinogenic" L-amino acids. I'm sure your teacher mentioned that the genetic code is virtually identical in E. coli and the elephant, and, therefore, prokaryotes and eukaryotes are much more closely related than was imagined before the code was elucidated in the 1960s (~15 years before Woese's "three domains"). Maybe you had to struggle, a little, to come to terms with degeneracy of the genetic code, meaning that it is redundant yet not ambiguous: there are four codons for proline (P or Pro) but none of them codes for another amino acid. Maybe you were even told the fun story that the first identified "non-sense" or "translational termination" or simply stop codon (UAG) was called amber stop for Harris Bernstein (German for amber), a grad student who helped with the experiments that led to its detection. With a wink, the other two stop codons were called ochre (UAA) and opal (UGA). Lastly, your teacher probably mentioned that AUG codes for methionine (M, or Met) but is also used as "translational initiation codon", in short: start codon. Which may sound trivial, but it's not.
Transfer RNAs (tRNA) are the actual "simultaneous interpreters" of the code presented for translation in a gene's transcript, the mRNA. The tightly folded, L-shaped tRNA molecules have two "business ends", an exposed 3‑letter "anticodon" that matches the 3-letter codon for an amino acid on one end, and the respective amino acid covalently linked to the other end. Yet to make sense of the numerous amino acid-loaded tRNA "words" by polymerizing them into a polypeptide "sentence", a third type of RNA is required, the ribosome. Think of the ribosome as a highly ordered tangle of three RNAs adorned with a number of proteins (55 in E. coli ), which forms peptide bonds between amino acids via its peptidyltransferase, a ribozyme arranged as a pocket within its 23S rRNA (in the 50S subunit, see a diagram here). Ribosomes slide along the mRNA that is "clamped" between their two subunits (30S + 50S in bacteria) in a ratchet-like fashion. Specifically, they expose a mRNA codon in their "A site" for binding to the cognate tRNA while the growing polypeptide chain remains covalently attached to the tRNA that had been previously bound to the A-site and now, after one movement of the gear, sits "next door" in their "P site" (see here for a schematic diagram of the translation cycle). During ongoing translation, ribosomes move along mRNA in 5'→3' direction in jolts of three bases and, since they don't hop back and forth, the translated polypeptide is co-linear with the mRNA, from its N- to its C-terminus (this is an oversimplification, and there are well known examples for programmed frameshifts). When you add to this the fact that the genetic code lacks a "word divider" (a comma or a space) for the codon triplets, it is immediately obvious that ribosomes better start translating mRNA in the "correct" reading frame, because the two other possible reading frames can easily result in peptide gibberish. But how can ribosomes distinguish between AUG codons that signal addition of methionine (M, or Met) to a nascent peptide chain and those that are "meant" to be start codons?
E. coli solves this problem by "diversification", and here's a brief explainer. It's genome codes for two types of methionine-specific tRNAs, tRNAMet and tRNAfMet (mind the tiny "f", which means "formyl-"). Both have the CAU anticodon, complementary to the AUG codon in mRNA, and both are "charged" with methionine by methionyl-tRNA synthetase, MetG. After charging, Met-tRNAMet is "intended for immediate consumption" by translational elongation, whereas Met-tRNAfMet is first converted by transformylase, Fmt, to fMet-tRNAfMet (N‑Formylmethionine-tRNAfMet). The N‑formyl group of the modified methionine prevents its use in peptide-bond formation during translational elongation, but not its use for translational initiation. However, another "diversification" is even more important. Although both methionine-specific tRNAs share the common cloverleaf structure of tRNAs, their lack of significant sequence homology leads to the more-than-subtle differences in their respective 3D structures. As a consequence of these structural differences, Met-tRNAMet binds readily to elongation factor Tu (EF-Tu), and their joint complex fits neatly into the ribosomal A site. In contrast, fMet‑tRNAfMet has a high affinity for initiation factors 2 (IF-2) and 3 (IF-3), which coordinate the formation of an "initiation complex" comprising mRNA, a 30S ribosomal subunit, initiation factors (IF-1, IF-2, and IF-3), and fMet‑tRNAfMet (see here for a schematic diagram sans the IFs). In this initiation complex, the mRNA is "kept in place" through base-pairing with a short stretch of homology to the 3' end of 16S rRNA (30S subunit), the ribosome-binding site (RBS, or "Shine-Dalgarno sequence"), and exposes the AUG start codon ~7 nt downstream on the mRNA (see here for a diagram), the position onto which IF-1 and IF-2 guide fMet-tRNAfMet. So, when the initiation factors dissociate from the initiation complex and the 50S subunit docks on, the ribosomal P site is formed with the initiator tRNA already "in place" and the reading frame set.
Employing two different AUG‑decoding tRNAs for translational elongation and initiation, respectively, is certainly an ingenious solution! (An aside: Archaea and Eukaryotes were no less inventive and evolved different mechanisms.) However, only 83% of all known E. coli genes have AUG start codons for the encoded protein. So, the nagging question remains why a disturbingly high percentage of genes have "alternate" start codons (14% GUG, 3% UUG, 1×AUU, 1×CUG). And E. coli is not particularly eccentric when it comes to start codons. In a number of other bacterial genomes the ratio of standard vs. alternate start codons was found to be much the same. I say "disturbingly", because ~15% alternate start codons cannot be simply dismissed as some weird "coding error" as was done when the GUG or UUG start codons for the N‑terminal methionine residues in the E. coli LacI (GUG), NusB (GUG), and Ndh (UUG) proteins were experimentally confirmed in 1974, 1981, and 1984, respectively (all 30+ years ago, think of that!).
Enter Ariel Hecht and colleagues who asked in a recent study what would happen to the translation of a reporter gene in which the start codon was replaced by all 64 codons, one at a time? They constructed multi-copy plasmids with a green fluorescent protein (GFP) as reporter and whose transcription was driven by a strong, inducible promoter. Translation of the transcripts was triggered by a RBS and, at the optimal distance ('spacer'), a fully "permutated" set of 64 start codons (Figure 1). These 64 plasmids and a non-expressing control plasmid were introduced into E. coli (no, the plasmids were not transformed into the cells, as the authors write, a mind-twisting habit found all too often in papers and heard in talks. Actually, phenotypically antibiotic-sensitive cells were transformed by the added plasmid DNA into an antibiotic-resistant phenotype that allowed selection of plasmid-carrying transformants (Avery et al. 1944)). Transformants were grown and induced under carefully controlled conditions in microtiter plates. Following induction, optical density and fluorescence were measured in the cultures. In roughly two thirds of the cultures the fluorescence exceeded auto-fluorescence of the non-expressing control culture (Figure 2). The logarithmic scale used in this figure for the x-axis gives the visual impression of a smooth, almost continuous variation of the measured fluorescence across all tested "start codons". In actuality the values suggest that the 64 codons would better be "binned" into four groups: three "canonical start codons" (AUG, GUG, UUG), from which translation initiated at 10 – 100% of AUG levels (and that together account for 99% of all start codons in E. coli, see above); four "near-cognates" (AUA, AUC, AUU and CUG), from which translation initiated at 0.1 – 1% relative to AUG, 40 codons, from which translation initiated at 0.01 – 0.1% relative to AUG, and 17 codons, from which translation initiation could not be detected at a level significantly above that of the non-expressing control cells. Note that no translation above background occurred from the UAA and UAG (stop) codons, and only very weak translation from the UGA (stop) codon, although in all three cases an interference with the peptide chain release factors RF1 (prfA ) and RF2 (prfB ) can be excluded as these would recognize an "empty" ribosomal A site and not an initiation complex.
They took a different look at the full range of their results (from Figure 2) by projecting them onto a codon table (Figure 3). Apparently, the strongest start codons have U as the second base. NAU (N=A,U,C,G) is an unexpectedly strong set of start codons, and G as the first base results in stronger start codons than C as the first base in almost all cases. They had checked before (by prediction) that the alternate start codons do not significantly tweak the mRNA's secondary structure, and that no other in‑frame start codons are present in the sequence preceding the actual reporter gene sequence (an in‑frame GUG at the 16th codon in the GFP coding sequence would result in a truncated, non‑fluorescent polypeptide). They could therefore safely conclude that the trends they saw in the "alternate start codon table" reflect all possible variations in codon::anticodon (CAU) efficiency for translational initiation without being blurred by too much noise of translational errors. A good starting point for model building in the Ångström scale and, much later, co‑crystallization experiments!
Hecht et al. worried that their results with GFP expression from a strong promoter on a high-copy plasmid might have led to biased results by "over-stressing" the cell's translational capacity. In order to get closer to physiological conditions they checked a set of 12 codons (from the "bins" mentioned above) in additional plasmid constructs, with another inducible promoter, another reporter gene, and plasmid backbones with lower copy numbers (Figure 4). For various experimental flaws that they don't fail to mention, none of these new constructs worked satisfactorily (Figure 4 A,B), except for the constructs with a NanoLuc(iferase) reporter gene on mini-F plasmids (1 – 2 copies per cell). For these constructs, they measured luminescence values that reproduced the corresponding values found for the GFP reporter on high-copy plasmids (Figure 2) albeit at considerably lower signal levels (Figure 4 C). This results ensured that the first set of measurements was in fact not biased but I wonder why they did not choose chromosomal integration of their reporter gene in the first place. This would have allowed them to test their constructs under a full range of different physiological conditions without having to account for varying plasmid copy numbers at different growth temperatures, for example (I'm aware that it's cheap to mention such tricks post festum ).
Finally, they set out to determine via mass spectrometry how translation of the reporter gene began at five selected codons with 1 – 3 "mismatches" with respect to AUG (AUC, ACG, CAU, GGA and CGC (for the experts: they used C-terminal 6×His-tags for protein purification and fragmented the proteins using Asp-N endoprotease). They recovered significant amounts of protein for all but the CGC variant (the expression level was too low for purification), and all four had intact N-termini, that is, according to the predicted sequence, and, importantly, included an N-terminal methionine. This confirmed that fMet-tRNAfMet was indeed used for initiation, a result in line with those obtained for LacI and NusB earlier (see above). They almost hid an important result in a half-sentence: "...In cultures with ACG as the start codon a small fraction of spectra (1 of 8) indicated that the N-terminal [amino acid of the] peptide might be the cognate amino acid, threonine (Mr = 119), with a mass shift of −30 Da relative to methionine (Mr = 149)" (author's addition in square brackets). This strongly suggests that free Thr-tRNAThr ("free"= not fully complexed by EF-Tu under conditions of gene expression from a multi-copy plasmid) competes with fMet-tRNAfMet for binding to the ACG start codon in the mRNA that is "kept in place" by the 30S ribosomal subunit. Apparently, the perfect codon-anticodon interaction for Thr-tRNAThr is strong enough to overcome the low affinity of the initiation factors IF-1 and IF-2 for non-initiator tRNAs with a probability of ~1:8. Conversely, this means that during initiation complex formation, the strong interaction of initiation factors IF-1 and IF-2 with fMet-tRNAfMet is sufficient to overcome the imperfect codon-anticodon interaction for the AUC, CAU, and GGA codons, and, at a ratio of 8:1, also for ACG (but not for CGC). So, it might actually be better to understand the "genetic code" not as a matrix of binary yes/no decisions but each codon::anticodon:amino acid combination as an integral of the affinities of all molecules involved. And "all molecules" does not only include mRNA, tRNA and ribosome, but elongation and initiation factors as well. "All molecules" are highly dynamic during translation, and the conformational changes (mechanical work) are driven by GTP hydrolysis, which of course results in varying affinities throughout the entire process. This sounds complicated but looking at the cherished "codon table" from another perspective seems timely.
Hecht et al.'s experiments may seem like mere diligence, but, as they point out, their results have more unexpected consequences than just questioning the way we understand the genetic code: "Average per-cell abundances of proteins in bacteria and mammalian cells span five to seven orders of magnitude. Given that the non-canonical translation initiation shown in this paper spans about four orders of magnitude, it is possible that this [variance in] level of expression could be physiologically significant and may serve as an additional mechanism for controlling protein synthesis" (author's addition in square brackets). Indeed, this doesn't seem far-fetched. Regulating protein expression by tweaking the mRNA's start codon would add to the notion that cells follow a variation of Murphy's law when it comes to controlling protein synthesis: everything that can be regulated will be regulated. Historically, the variation of promoter strength by sequence alterations and regulation of promoter activity by transcription factors was observed first, think lac operon. Since then it has become clear that, in addition, mRNA half-life (codon-choice dependent mRNA secondary structure determines sRNA and RNase accessibility), mRNA "translatability" (codon-choice dependent translation influences the "drain" rate of tRNA pools), and protein half-life (amino acid-sequence dependent 3D folding determines protease accessibility) control the expression level of a protein – and, of course, this regulatory net "folds & stretches" with the cell's physiology (temperature, growth rate). That growth temperature plays an important role in gene expression is old hat, but indirect evidence suggests that this also applies to translational initiation: a mutant rIIB gene of phage T4 has the wild-type AUG start codon replaced by AUA, which results in reduced translation at 37°C in vivo and in vitro (to 10 – 15% of the wild-type level) and abolishes it completely at 42°C.
And there are even evolutionary implications, again in their own words: "...there may be evolutionary utility to translation initiation from non-canonical start codons. Research with yeast has shown gradual transitions of genetic sequences between genes and non-genic ORFs in related species. We can imagine a scenario wherein, over evolutionary time scales, point mutations could create a weak non-canonical initiation codon downstream of a RBS. The small amounts of protein produced from such an ORF, if beneficial to the organism, could select for further mutations that increased translation efficiency up to a point where the gene product more directly impacted organismal fitness." (see here in STC for an example of step-wise de novo gene evolution in S. cerevisiae ).
Finding the "right" initiation codon in an mRNA is also the task of bioinformaticians who comb through piles of DNA sequence files when annotating protein-coding genes in newly sequenced genomes. They have long since given up doing this by eye and use "annotation pipelines" instead, that is, software packages that find open reading frames (orfs) in sequence contigs, among other things. State‑of‑the‑art pipelines can deal with known variants of the genetic code, for example, the code for vertebrate mitochondria. However, even advanced programs mostly fall short of detecting "exotic" start codons like AUU – the E. coli gene for IF-3, infC, contains an AUU start codon – but they routinely find the "standard" rare initiation codons (GUG, UUG) in bacterial genomes. Not always, though, and here's an example. When you align enterobacterial sequences of the (highly conserved) replication initiator protein, DnaA, you will see virtually perfect sequence conservation but fluttering N-termini in two cases (Figure 5). For the Enterobacter cancerogenus dnaA gene, GTG is annotated as the initiation codon (GUG=V, or Val, see the codon table), and you find GTG codons at corresponding positions in the Enterobacter cloacae ATCC 13047 and Lellottia sp. PFL01 genome sequences *). In addition, you easily spot the complete conservation of the entire N-terminal sequence up to the annotated ATG start codon. Thus, a brief look at the DNA sequence confirms that these two seemingly "aberrant" DnaA proteins are in fact "complete". If you had looked at >500 DnaA sequences – as I did, and it wasn't boring at all – you would have been familiar with the occurrence of rare start codons in >30% of the dnaA genes from bacteria across all phyla. But there's a catch. As a DnaA expert, you would probably also know that even small deletions in the N-terminus render DnaA proteins incapable of dimerization via their N-termini (and thus ineffective for replication initiation, inhibitory even). So, you have to check the DNA sequence for completeness in those cases where the automatic annotation looks flawed. I don't have to add that this caveat does not only apply to dnaA genes. The results of Hecht et al. suggest, in addition, that algorithm-driven detection of start codons will not make the manual curation of sequence data – in some rare cases even experiments, say, protein sequencing of N-termini – obsolete anytime soon.
*) Despite having a GTG start codon, the three mRNAs are translated with an N-terminal Met residue because fMet-tRNAfMet recognizes this codon. The "V" indicated in Figure 3 as the N-terminal residue of the E. cancerogenus DnaA protein is formally correct but Met would be found when sequencing the protein's N-terminus, just as was experimentally found for E. coli DnaA whose dnaA gene also has a GTG start codon (unpublished).