by Janie
Figure 1. A few pieces in the grand puzzle of lifeforms. Source: Janie Kim. Frontispiece: DNA containing an unnatural base-pair, X-Y. Source
What had once seemed discrete has turned fuzzy around the edges these days, whether work versus home, or minutes versus days (and pajamas versus all other alternatives). But there's been a sliver of a silver lining in this en masse jumbling. Having recently taken a class on nucleic acid chemistry and, after looking for a way to take my mind off things, now trying to learn Esperanto from Duolingo, I've been gently reminded of how much of a language the genetic code is. Much like “artificial” languages like Esperanto or Tolkien's fifteen (!) Elvish tongues, genomes, too, are subject to manipulation and modulation, by the scientist's hand.
There's been a smattering of posts here on the blog that touched on synthetic genomes, but the last post that was devoted to the topic was all the way back from May 2009, not long after Craig Venter announced his first synthetic Mycoplasma genome. In this piece, Shmuel Razin wrote about Harold Morowitz's visionary 1964 proposal to synthesize a living cell from its components, and then Venter's later announcement.
Much more nucleic acid tinkering has transpired in the years since 2009. So, here's a look at where the field of build-your-own-genome, bacterial edition, stands now as of 2020, with a little spritz of linguistics.
Linguistics and Nucleic Acids, From Loanwords to Wordplay
The intertidal zone between nucleic acids and human languages bustles with analogies. There's a lot shared by all those strands of spiraling chemical lacework and the words that roll off our tongues.
Both genomes and human languages possess heritable units and a hierarchy of building meaning, starting from letters to codons/words to genetic elements/phrases and so on, subject to rules that can be framed mathematically. In the way a single insertion or deletion of a nucleotide alters the function of a gene, a single comma makes all the difference in meaning and morbidity between "Let's eat, Grandma!" and "Let's eat Grandma!" Gene duplication leading to functional diversification is not unlike linguistic reduplication: in the Maia language spoken in parts of Papua New Guinea, 'maia' means 'thing' and its reduplicated form 'maiamaia' came to mean the plural 'things.' Integration of foreign gene snippets through horizontal gene transfer, too, has a linguistic equivalent: the Korean word for 'trench coat,' which is pronounced buh-buh-ree (버버리), is a loanword of the English brand name Burberry. All those modular compound words in German are evocative of NRPS pathways. Parallels can even be drawn between epigenetic modifications and language: the sequence of letters doesn't change, but the context that the word is in shapes its meaning, as in 'lead' absorption by bacteria could 'lead' to bioremediation. Perhaps even wordplay and puns can be thought of as having a biological analogy: a split between the face-value meaning and the meaning that is actually interpreted, like mRNA transcripts pre- and post-processing, to cap it all off (bad pun #1).
An especially fascinating shared trait: both languages and genomes evolve. Both exist in a state of dynamic equilibrium between size and complexity versus efficiency. Both grow and shrink with changing circumstances, whether genome reduction of an endosymbiont or the shrinking English lexicon, driven by the trade-off between size and energy cost.
Figure 2. Photo 51, entangled in the Franklin/Watson/Crick saga, kicked genetics into high gear in the mid-1900s. The image that started it all. Source
Both are also probed by the relentless human curiosity about the bare bones: what fundamental properties make them tick? Linguists in pursuit of linguistic universals and the ever-elusive Universal Grammar, piecing apart languages into their simplest units... Scientists moving from elucidating the structure of DNA and deciphering the genetic code all the way to synthesizing brand new nucleobases and from-scratch minimal genomes…
How far can this genetics analogy be taken? The crossover (bad pun #2) between synthetic genomics and Elvish goes pretty far.
Changing the Genetic Alphabet: Unnatural Base Pairs
The genetic "alphabet" classically consists of four letters and the modern English alphabet consists of twenty-six. The diversity that arises from such a small set of nucleic acid building blocks is astounding – but not the end-all. This open-endedness was embraced by chemists in the second half of last century. By understanding how standard Watson-Crick base pairing works, could an entirely new set of nucleobases be designed to function smoothly in replication, transcription, and translation? Perhaps A-T and C-G base-pairing are not so special.
Figure 3. “DNA”, courtesy of xkcd. Source
Replace Tolkien with a scientist, the pen with a pipette, and you have a recipe for a biological fantasy language. The field of unnatural base pairs (UBPs) was sparked to life in 1962 when Alexander Rich first proposed the possibility of a third base pair. His idea was picked up years later and materialized by Steve Benner and his group, who shuffled around hydrogen bond donors and acceptors on nucleobases to generate new noncanonical bonding patterns. UBPs like these could be integrated into DNA and RNA by their respective polymerases, and the nucleobase alphabet was thereby expanded.
But hydrogen bonding between bases was not all that it was chalked up to be. As it turned out, polymerases were happy to work with non-polar, hydrophobic analogs of nucleobases. The determining factors for incorporation by polymerases were found to be base-stacking and steric fit, and hydrogen bonding was knocked off its pedestal. The UBP playing field had at that point been unexpectedly expanded to include hydrophobic bases.
Figure 4. A sampling of nucleobases, both natural and unnatural. Source
More recent advances have embraced both hydrophobic pairs and hydrogen-bonding pairs, ironing out initial wrinkles with the original Benner system. The Romesberg group sifted through multitudes of UBPs to discover that the hydrophobic 5SICS and NaM pair outperformed its predecessors in PCR and in vitro transcription. The development of 'hachimoji' DNA and RNA came even more recently, which added four more synthetic nucleotides. With hachimoji, the letter count now sits at a provisional eight, poised to change as more "xeno-nucleic acids" trickle into the lexicon.
Ever the genetics parallel, the English A-to-Z has also been in flux. An interesting historical aside: several letters have been removed or added since Ye Olde days of Chaucer, a trend visualized in antique children's ABC books, which sometimes included the ampersand "&" for a total of 27 letters (and see here for the history of the "Ye" of "Ye Olde Shoppe" fame, its origins tied with a mix-up involving what is now an obsolete letter of the alphabet). Evolution in all its natural and unnatural forms is relentless.
Designer Genomes
Unnatural bases and expanding coding capacity comprise one subset of the field of synthetic genomics, and cobbling together whole genomes is another. In contrast to the expansionist ambitions of the UBP world, genome synthesis moved along a more reductionist trajectory.
Following Khorana's 1972 synthesis of a complete gene, the yeast alanine tRNA, came revolutions in sequencing and DNA synthesis technologies. Genome after genome was sequenced until the focus shifted at the turn of the century, redistributing interest from sequencing toward synthesizing genomes. The year 2002 saw the first synthetic genome: the de novo chemical synthesis of the 7.4-kb poliovirus genome. Far behind were the days of skyscraper sequencing gels and of pointillism in radioactive black.
In 2008, Craig Venter and his lab unveiled a synthetic Mycoplasma genome, all 583-kb worth of spiraling ladders stitched together by a combination of ligases, yeast intermediates, and forty million dollars. The inevitable leap from in vitro to in vivo came just two years later. In 2010, Venter announced the first self-replicating Franken-cell: a lab-synthesized Mycoplasma mycoides genome encased in a Mycoplasma capricolum husk. Putting together this 1.08-Mbp genome involved splurging on over a thousand commercially manufactured sequences, some containing "watermarks" spelling out an email address, people's names, and quotations, to distinguish the artificial genome from the natural one.
While reaching milestones in whole-genome synthesis, interest in genome minimization also grew in earnest: what is the absolute minimum genetic information a free-living cell needs to survive and replicate? Thus began the reductionist quest for the bare-bones genome.
In 2016, Venter announced Syn3.0, a Mycoplasma with its genome whittled down to a mere 473 genes in a total of 531 kbp, the fruit of many rounds of transposon mutagenesis screens. Starting with Mycoplasma genitalium, the free-living bacterium with the smallest genome then known at a mere 525 genes, they had stripped away 10% of its genes. For comparison, the E. coli genome consists of approximately 4,000 genes in a total of 4.6 Mbp. The minimalist's game of limbo (how minimal can you go?) was well on its way.
But the cost of DNA synthesis is a major bottleneck to progress in the field. Fabricating long strands of DNA extending thousands of bases de novo is a finicky affair, hence, the forty-million-dollar price tag on the Venter Mycoplasma. If synthesis costs could be lowered, progress in whole-genome synthesis would be much faster. To this aim, researchers at the ETH Zurich developed a computer program to generate a synthesis-friendly bacterial genome. Their program performed a major overhaul of the 785-kb "essential genome" of Caulobacter crescentus by cutting out sequences that hinder synthesis, such as repetitive regions and regions of high GC content, and by speeding through the genetic thesaurus to swap out over 10,000 bases and insert synonyms for over 120,000 codons. The payoff? The Swiss Caulobacter required a mere fraction of the resources of its 2008 Mycoplasma predecessor. Ringing in at $123,000, the project was a mere 0.3% blip of the Venter cost, and took just one year, as opposed to twenty, to complete.
The size of genomes built from scratch mentioned so far span from kilobases up to the 1.08-Mbp Mycoplasma. The year 2019 marked the toppling of the record. In work published just one month after the Swiss study, researchers in the UK and elsewhere produced a 4-Mbp E. coli genome with only 61 codon triplets, a result of 18,214 synonymous codon substitutions. To assemble this simplified genome, the group took a page from the synthetic chemists' book. In a retrosynthesis of sorts, they split their recoded genome into fragments, synthesized these 10-kb pieces, and then assembled their way back up to the full genome.
The UBP-Genome Merger and a Sighting of Xenobiology
In the last decade, UBPs and synthetic genomes finally met on their inevitable collision course. As reported in a 2014 publication, the unnatural pair d5SICS-dNaM was first incorporated into a plasmid with otherwise normal A-T and C-G pairs. This plasmid was then introduced into a special strain of E. coli equipped with an NTP transporter from the marine diatom Phaeodactylum tricornutum, of biofuel fame. When provided with the UBPs exogenously, the resulting bacterium could funnel them in through its NTP transporter and copy the plasmid with endogenous machinery, and just like that, go to town metabolizing and jiving and doing its replicating thing.
Figure 5. Translation with an unnatural codon, anticodon, and amino acid. Source
The next bomb in synthetic genomics dropped only a couple years later. Several labs, including Romesberg’s, began creating semi-synthetic bacteria whose unnatural nucleobases themselves encoded unnatural amino acids. All within the confines of a living bacterium, RNA polymerase transcribed mRNA and tRNA containing the UBPs, the ribosome decoded unnatural codons with the right unnatural anticodons, and the right unnatural proteins with unnatural amino acids were produced at the end of the day. Here was a shiny new-fangled reboot of the Central Dogma.
These semi-synthetic organisms drew a great deal of media attention and speculation that the next foreseeable step might be to drop the "semi." How long would it be until the first fully synthetic free-living microbe with a purely UBP genome?
A concise answer: a long time. The technical issues are tricky, as Prof. Ralph Kleiner, a chemistry professor at Princeton University who taught the aforementioned nucleic acids class, described: one of the major hurdles is coaxing cells to metabolize the unnatural building blocks necessary for UBP installation. Romesberg's work with NTP transporters represents one possible solution, although it is too early to say how general this strategy will be, and it would be far preferable to supply simpler building blocks like artificial nucleosides rather than the fully formed nucleotide triphosphate. But even with a reliable system for UBP installation, in order for these pairs to be useful, they need to be uniquely recognized by the protein translation machinery and accessory factors. From ensuring compatibility with the ribosome to aminoacyl tRNA synthetases, as well as codon orthogonality, there is a lot to align with in intracellular circuitry. Remember, Nature has had billions of years to fine tune the current system. But even with just one pair of unnatural bases, a cell's coding capacity jumps from 43 to 63 with a parallel jump in potential applications.
As of 2020, the highest published number of unnatural codons that can be simultaneously decoded in a living cell stands at three, for a total of 67 decoded codons. There is a slight decrease in protein yield by the semi-synthetic cell, albeit a relatively insignificant decrease compared to that in cells engineered with stop codon suppression. This highlights another issue to overcome: hijacking the system often leads to lower efficiency, although one group is tackling this problem by subjecting E. coli with reduced genomes to adaptive laboratory evolution.
Some Extra Thoughts, Scattered Across the Reality-to-Science-Fiction Spectrum
Looking past the showmanship that seems at times to tinge the field, finding minimal genomes and creating minimal bacteria will help understand fundamentals about what makes life tick. Perhaps DIY-ing bacteria that are just-alive will further illuminate endosymbionts, their drastically reduced genomes, and their physiology. Import of nucleotide building blocks, for example, is a process shared by the Romesberg group's d5SICS-dNaM E. coli, endosymbionts, and intracellular parasites. Indeed, the organisms shortlisted for transporter-borrowing included the amoeba endosymbiont P. amoebophila. To co-opt Richard Feynman's words as synthetic biology co-opts parts of organisms: "What I cannot create, I do not understand." What does it take at the genome level for a cell to forfeit its independence and take up permanent residence within another, in what seems a case of microbial Stockholm Syndrome?
There's also more to gain from the recent codon-swapping experiments done by the Swiss and UK groups, other than cheaper bacterial genome synthesis. Enter synthetic viruses. Manmade viruses, at the core of conspiracy theories running amuck these days, have historically been produced with a more benign motive: developing vaccines. For example, swapping codons in the poliovirus genome to run counter to its normal codon bias attenuates virulence, and this customized ineffectual virus can be used in vaccines. From poliovirus to influenza to Ebola, recreating and recoding of viral genomes allows researchers to develop and test drugs, vaccines, and diagnostics. This is the same situation with SARS-CoV-2, as labs work to recreate the virus.
This theme of piecing apart and assembling genome sequences is reminiscent of the new mechanical vision of life illustrated in Membranes to Molecular Machines. The book was previously covered on the blog here. Truly, the "cell simulacrum" has come a long way since the 1970s liposomes, those little ATP-producing globules jigsawed together from chicken eggs, soybean plants, beef heart, and Halobacteria.
Today's semi-synthetic microbes could "serve as a platform for the creation of new life forms and functions," as so starkly worded in a publication from the Romesberg group. But a bacterium with a completely unnatural alphabet still seems a long way off. It's an idea evocative of world-building in fiction, with structures spun into existence by imagination and invented words. Maybe a fully synthetic genome is a pot of gold at the end of the rainbow, but in the meantime, we can imagine. And we can continue to write speculative fiction stories, which all so often slip into reality years down the line…
Comments