One curated database, that of the Joint Genome Institute (JGI), has collected 10,000+ complete genomes of prokaryotes and eukaryotes over the last two decades. Multiply this number by 10 for 'draft' or 'unfinished' genomes and you understand why we sometimes talk of 'the sequencing craze'. Impressive as this number is, for the even larger number of known genes I tend to wave the white flag. Consider, already the genome of Pelagibacter ubique, a tiny bacterium with one of the smallest genomes of free-living bacteria, has ~1,400 genes (1,354 annotated, that is, predicted protein-encoding genes, 1 rRNA operon, and 32 tRNA genes ). Given that biology today has at best informed guesses about the number of extant prokaryotic species – the known genomes representing just a small fraction thereof – it is unquestionable that cells had made genes in abundance during their evolution. It seems almost frivolous then to ask: do contemporary cells still make new genes from scratch (that is, not by gene duplications & divergence, or gene fusions/domain swapping ) ? And if so: how could we possibly detect such novel genes 'in the making' ?
To tackle these questions I should briefly recap what biologists mean when they talk of 'genes' (for a more thorough discussion see here; the historical development (evolution? ) of the term 'gene' is neatly summarized in this infogram ). Basically, a gene is a stretch of genetic material (double-stranded DNA (dsDNA ) in the case of cells, dsDNA/ssDNA or RNA in the case of viruses ) that can potentially be expressed. The first step in expression is generate a transcript, a single-stranded RNA (Figure 1). Such transcripts can have one of two fates. They can either serve as messenger RNA (mRNA ), blueprints that instruct ribosomes to synthesize a polypeptide chain, a protein, according to the succession of bases of an open reading frame (ORF ) that the ribosomes together with tRNAs decode as triplets (Fig. 1, left part). Or these transcripts are enzymatically processed to yield ribosomal RNAs, tRNAs, a plethora of small regulatory RNAs (miRNAs, siRNAs, antisense RNAs, to name a few ), or long non-coding RNAs (lncRNAs ) (Fig. 1, right part ). To make things more complicated, some transcripts can serve both as non-coding RNA (ncRNA ) for further processing to yield a small regulatory RNA and as mRNA coding for a small polypeptide. But in either case, the primary transcripts are always longer than required to let cells express the encoded function(s) (thus the stippled line on both ends I added to the 'gene' in Fig. 1 ). For example, the long transcripts of the rrnB operon in E. coli yield, after processing, the mature 16S, 23S, and 5S rRNAs and one tRNA. In mRNAs, decoding of the ORF that yields the "expressed" protein requires not only the Start and (untranslated ) Stop codons but in addition so-called 5'-UTRs and 3'-UTRs, untranslated regions of varying length that carry docking and un-docking signals for the ribosomes, respectively, and, in eukaryotic mRNAs, the 5'- and 3'-splice sites for the removal of introns (the unprocessed primary transcript is therefore usually called 'pre-mRNA' ).
This fuzziness of the actual 'borders' of genes doesn't end here, it gets even worse! Genes can only be expressed – and when we are looking for novel genes made from scratch they should be expressible, and not be 'just open reading frames' – if the flanking regions contain binding sites for RNA polymerase, promoters, with transcription start sites, and also detachment sites, terminators. Although these regulatory DNA regions vary considerably in length, promoters are, conveniently, located upstream of genes because RNA polymerases synthesize transcripts in 5'–>3' direction (one of the rare facts in molecular biology without exception, yeah! ) but may well lie within the coding region of an upstream gene. As promoters are, in most cases, not simple on/off switches, 'promoter regions' tend to be decorated with binding sites for activating and/or repressing transcription factors (recall the lac operon promoter, with binding sites for the LacI repressor and the Crp activator ) or come in tandem. For eukaryotes, delimiting the boundaries of genes the classical way, that is, in two dimensions along a stretch of DNA, becomes obsolete because they expertly loop their chromosome domains in a way that transcription factor binding-sites, enhancers, are brought into contact with a transcription start site even when located at distances of >100 kb (!) (for an example, see here in STC ). (An aside: prokaryotes also organize their chromosome(s) in larger loops but the extent to which this contributes to regulation of gene expression is presently debated ). All this jazz for the fine-tuning of gene expression in cells... and for dazzling molecular biology students.
In their study from 2008, Wen Wang and collaborators described, exemplarily, their detection of a novel gene in yeast, Saccharomyces cerevisiae. They focused on the protein-encoding gene BSC4, for which they were unable to find homologs in closely related Saccharomyces species by protein databank searches. Also, in hybridization experiments with a BSC4 DNA probe they did not get signals in the chromosomal DNA of S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, four yeast species closely related to S. cerevisiae (Figure 2). As this result did not exclude the possibility that S. cerevisiae had acquired this gene by horizontal gene transfer (HGT ), the researchers studied the synteny, that is, the gene context of the BSC4 gene in S. cerevisiae, other yeasts, and selected Ascomycetes (Figure 3). The genes flanking BSC4, LYP1 and ALP1, are paralogs, that is, genes that arose through duplication of an ancestral gene, concomitant with a head-to-head arrangement of the duplicates and including a ~800 bp long intergenic region (LYP1 and ALP1 retain 63% identity at the amino acid level but diverged functionally to a lysine permease and asparagine permease, respectively ). The specific arrangement of LYP1 and ALP1, including the intergenic region, can be traced back to the last common ancestor of S. cerevisiae and Ashbya gossypii, whose lineages separated from each other >100 million years ago. The LYP1–ALP1 intergenic regions of S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus are of comparable lengths, contain several blocks of 30 – 40 well conserved residues, but show no more than ~36% overall similarity on the DNA level due to numerous small indels. These findings explain the lack of hybridization signals, yet made it unlikely, which is the point here, that the BSC4 gene was acquired by S. cerevisiae via HGT.
Wen Wang et al. successfully identified the BSC4 transcript and, applying the RACE technique, found it to be considerably longer than expected (512 bp instead of the usual ~200 bp 3'‑UTR in yeast ). This was interesting because it had been found earlier that low-level translational readthrough (TR) at the BSC4 Stop codon results in a protein that is longer by 107 residues than the 131 residues of the annotated ORF (also this longer ORF had no protein homologs in the database, as of April 2017, so it's a true ORFan ). What was even more interesting: they found by RT-PCR (cDNA synthesis combined with PCR ) that the LYP1–ALP1 intergenic regions are also actively transcribed in S. paradoxus, S. mikatae, and S. bayanus (Figure 4). In contrast to S. cerevisiae however, there are no detectable ORFs in these intergenic regions, and any function(s) of these transcripts is presently unknown.
The 'experimental protocol' followed here by S. cerevisiae seems straightforward though time-consuming: to make a novel protein-coding gene, a distant ancestor has to acquire a short stretch of several hundred basepairs of DNA in the first place, in this case here by a recombination event that resolved an accidental duplication (inverted duplications are not uncommon 'accidents' in yeast ). Transcription of this stretch of DNA must start at some nearby promoter. Then, multiple mutations within this transcribed stretch eventually allow for the translation of the transcript, or parts of it. If this translation product, the protein, contributes a favorable functionality – even if only incremental – to any of the existing metabolic pathways of the cell, the coding region is further optimized by selection (again by additional mutations, of course, one or two at a time ). If, on the other hand, the protein turns out to mess with existing functions it will soon be pseudogenized (for example, by accumulation of pre-mature stop codons ), or deleted if the transcript itself is dispensable for the cell.
So what might be the function of BSC4 ? Digging in a collection of yeast transcriptomic data for different growth conditions, Wen Wang et al. found a steep increase of BSC4 transcription during stationary phase. Interestingly, the transcription pattern of BSC4 correlates partially with that of RPN4, a transcriptional activator of various DNA repair genes and proteasome genes in yeast (Figure 5). Both genes, BSC4 and RPN4, had been found in a separate study by Pan et al. to be synthetically lethal (single knockout mutations (KO ) are viable but not the double knockout ), which led Wen Wang and his collaborators to suggest that BSC4 may also function in a DNA repair pathway (note that in the Pan et al. paper you will find BSC4 by its systematic ORF name YNL269W, as I was kindly informed by Jef Boeke, the corresponding author ).
The answer to the initial question whether cells still make new genes is: yes, they do. For eukaryotes, several de novo genes are now well established, and, although they appear to be rare, we can safely expect that more will be revealed by future studies. An answer to the follow-up question of whether also prokaryotes still make new genes would be tl;dr (= too long; didn't read ) here, and will be dealt with in a couple of sequels.
Frontpage picture by Buddhini Samarasinghe, from her blog jargonwall.