by Jamie Henzy
Although it's a regular occurrence for bacteria to acquire new genes whole cloth, by horizontal gene transfer, such events are rather rare for complex eukaryotes like us. Eukaryotes are vastly more inclined to take advantage of gene duplication events (see Christoph's series on gene generation tactics of prokaryotes and eukaryotes). An exception to this rule is the presence in vertebrate genomes of numerous retroviral sequences (endogenous retroviruses, or ERVs, for short) that were acquired presumably when a retrovirus infected a sperm or egg cell. Retroviruses make a double-stranded DNA copy of their genome and insert it into the host DNA to be replicated by the usual host replication machinery. One would imagine that an actively replicating retrovirus would interfere with the development of the fertilized egg, and it probably does more often than not. But sometimes the host can effectively silence the provirus (as the viral sequence is known post-insertion), or the provirus has a mutation that prevents its replication, and then the egg develops into a wee baby with half its mother's genes, half its father's genes, plus a wee bit more — a copy of the retrovirus in every nucleated cell of its body. Sounds like horizontal gene transfer to me.
What happens next? The broken or silenced ERV, freed from purifying selection, will begin accumulating mutations, hampering its potential for replication. During an initial honeymoon phase, however, it may replicate just enough to reinfect some new cells or copy itself into new genomic locations, or it can be copied by other retrotransposons such as LINE-1 elements, with the same results — copies in new locations. Eventually the sequence, behaving generally as a neutral allele, will be lost or fixed according to the whims of genetic drift. The amount of time that chance deliberates between fixation and loss varies with the length of the generation time. If we take the average generation time for humans to be twenty years (likely applicable for most human history, though it is lengthening in the present), then a neutral allele is expected to take ~800,000 years to reach fixation. During this unsettled phase, some individuals may carry an ERV where others have an unoccupied site. This state is known as insertional polymorphism.
Another insertionally polymorphic state for ERVs involves a sleight of hand based on a recombination event. When a retrovirus first inserts, the gene-coding region is flanked by long terminal repeats (LTRs) of several hundred to a thousand base pairs that are identical. These long regions of identity frequently recombine such that the intervening region is deleted and what is left is a so-called "solo LTR". In fact, the odds of this happening are as high as one hundred to one. Thus, the vast majority of ERVs in our genomes exist as solo LTRs. However, some individuals may carry a solo LTR where others harbor either an unoccupied site or a full-length ERV.
As the LTRs acquire random mutations, they become less similar, making them less likely to mediate recombination. Thus, it's possible for an ERV to make it through this period of adolescent recklessness and survive as a full-length provirus. Regardless of whether LTR recombination occurs, mutations bombard the ERV sequences and, like meteors pockmarking the moon's surface, eventually pulverize the information content. So, the possible states of an ERV locus include three main forms: a full-length copy with two LTRs and coding sequence between; a solo-LTR; or an unoccupied site, all in various stages of sequence degeneration, including deletions that may shorten the sequences significantly (Figure 1).
Among the sixty plus families of ERVs found in the reference human genome, only one is known to have insertionally polymorphic loci: HERV-K(HML-2) ('HERV' for human ERV, 'K' for the tRNAlys used to prime transcription, and 'HML' for Human MMTV-Like, reflecting the virus's relatedness to Mouse Mammary Tumor Virus). HERV-K began infecting the germline of our hominid ancestors ~35 million years ago, and hung in there, continuing to expand — either by reinfection or retrotransposition — even after the human-chimp divergence ~6 million years ago. This means that we share some HERV-K loci with chimps, but have others that are human-specific. More than 120 HERV-K insertions have become full citizens of human genomes, fixed in the population as a whole, so that every person carries them. However, at least a few copies of this family appear to be such recent arrivals that they are still drifting in the population, not yet having met their fate of fixation or deletion (green card holders?). This situation has led to the idea that HERV-K is mediating an ongoing, smoldering infection of the human genome even today.
However, the true extent of polymorphism has been difficult to assess. Why? Alleles at drift in a population vary in frequency, and the less frequent copies will go undetected unless a large enough sample is examined. You must pick through a whole heap of genomes to find them. However, copies that are not yet fixed are also the most recent, and therefore the most likely to still have functional elements — open reading frames (ORFs) from which proteins can be produced, or intact promoter regions that can interfere with transcription of nearby host genes. Therefore, these rare copies are the very ones that would be the most likely culprits if HERV-K were involved in disease.
Last year a group set out to hunt down these elusive, unfixed ERVs. Their haystack consisted of >2500 sequenced human genomes from the 1000 Genomes Project and the Human Genome Diversity Project. Sifting through this heap they spotted some previously unidentified rare insertions, plus some not-so-rare insertions that just happened to be missing among the handful of genomes that were pieced together for the human genome project. In fact, compared to the reference human genome sequence, they identified 36 "nonreference" HERV-K polymorphisms (remember, we're talking about insertional polymorphism) that ranged in frequency from <0.05% to >75% of the population. "What?" you ask. How did researchers previously miss copies that were in >75% of the population? Well, this group found empty sites in individuals at loci that are occupied by HERV-K insertions in the reference human genome, and were wrongly assumed to be fixed. In other words, their frequency is not 100% after all, but less.
Generally, African populations were found to have the highest frequencies of nonreference HERV-K2 alleles, and all of them but two were confirmed in these populations. This finding is consistent with the Out-of-Africa theory, whereby modern humans evolved in Africa ~200,000 years ago, with groups migrating outward 50,000 – 100,000 years ago. Migrating groups carried only a portion of overall human genetic diversity with them, and subgroups of these groups carried even less, creating an interesting decrease in diversity in populations according to their distance from Africa. Thus, Africa remains the most genetically diverse population, and their HERV-K alleles are no exception. Similarly, the presence of all but two HERV-K alleles in African populations indicates that these insertions occurred before subgroups began migrating outward.
Other HERV-K2 insertions were found to match those reported in Neanderthals and Denisovans — archaic humans with whom modern humans interbred after migrating from Africa. An outcome of such encounters is 2 – 3% Neanderthal sequences in the genomes of non-African populations, and as much as 5% Denisovan sequences in Pacific Islanders (Figure 2). So, were some of these HERV-K2 sequences acquired from our kissing cousins? Unlikely, since nearly all the insertions are also found in indigenous African populations, which did not interbreed with Neanderthals and Denisovans.
The group also found lurking within some X chromosomes a previously unreported full-length provirus, LTRs and all, with intact ORFs across all the viral genes (Figure 3). Only one other intact full-length ERV has been reported in the human genome. Their newly-discovered ERV is very rare, and most prevalent in African populations. Because LTRs are identical upon insertion, counting the mutations that differ between them and calibrating this to the mutation rate allows a rough estimate of time spent in the genome. By this neat trick, the ERV probably inserted sometime between 670,000 and 1.3 million years ago. With no apparent inactivating mutations, this provirus may very well be capable of replication. As the authors point out, even if it remains silenced by the host under normal conditions, disease states that interfere with silencing could result in its resurrection as an infectious retrovirus. Needless to say, this one is being looked at more closely in the lab!
The finding of an even higher level of insertional polymorphism than was previously known underscores how recently active the HERV-K family has been. In particular, some members may still be active, possibly with disease implications for the host that have yet to be discovered. The presence of these unfixed sites also highlights the importance of amassing ever larger databases of human genomes representing various populations. Who knows what rare, but possibly biologically important, ERVs are still lurking within the human population, or within your own cells as you read this!
Jamie teaches genomics and bioinformatics at Boston College, investigates ERVs, and is an Associate Blogger at STC.
Wildschutte JH, Williams ZH, Montesion M, Subramanian RP, Kidd JM, Coffin JM. Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc Natl Acad Sci U S A. 2016 Apr 19;113(16):E2326-34. doi: 10.1073/pnas.1602336113. Epub 2016 Mar 21. PubMed PMID: 27001843; PubMed Central PMCID: PMC4843416.