On June 15, Dr. Marcia McNutt, Dr. John Anderson, and Dr. Victor Dzau, presidents of the National Academy of Sciences, National Academy of Engineering, and National Academy of Medicine, respectively, published a letter whose title says it all: “Let Scientific Evidence Determine Origin of SARS-CoV-2” (asamonitor.pub/3vNiwi2). This is both a call for scientific equipoise, and a call for those who seek political gain from the dialog to stand down so the tools of science can be applied dispassionately.

There are three competing hypotheses for the origins of SARS-CoV-2:

  1. In late 2019, a bat coronavirus underwent one or more recombination events creating SARS-CoV-2. Although horseshoe bats were the likely source, it could have been in an intermediary species such as a pangolin. This was transmitted to a human, where its high affinity for the ACE2 receptor launched the COVID-19 pandemic. This is the “zoonotic theory.”

  2. In late 2019, SARS-CoV-2 residing in a bat or other wild creature escaped from either the Wuhan Institute of Virology or the nearby Wuhan Center for Disease Control, possibly by infecting a lab worker or perhaps wheeled out in infected guano. This is the “lab-leak theory.”

  3. Scientists intentionally created a superbug to study just how infectious coronaviruses can be. Or, perhaps, to make a biological weapon. Regardless of motivation, they succeeded spectacularly. Unfortunately, their superbug broke through their viral containment, unleashing a global pandemic. This is the “engineered virus” theory.

Background

We don't know the origin of SARS-CoV, the virus responsible for the 2002-2004 SARS outbreak. Initially, the source was thought to be civets (Science 2003;302:276-8). In 2017, Dr. Zheng-Li Shi, director of the Wuhan Institute of Virology, published the results of nearly a decade of research into bat coronaviruses (asamonitor.pub/3h9UsAP; asamonitor.pub/3wMus4Q; PLoS Pathog 2017;13:e1006698). In a single cave in Yunnan province, nearly 2,000 miles from the city of Fudan where the first cases of SARS-CoV were identified, Dr. Shi and her colleagues identified 11 new strains of coronaviruses. These are in addition to the 41 novel strains of bat beta coronaviruses identified in 2017 by Lin and colleagues in bats taken from Guizhou, Henan, and Zhejiang provinces (Virology 2017;507:1-10).

Many of these strains were similar to SARS-CoV, including expressing spike (“S”) proteins that avidly bound the human ACE2 receptor. SARS-CoV was not found, but as the 2017 report documents, Dr. Shi's team identified every component of the SARS-CoV genome among the strains from Yunnan province. The authors concluded that “all building blocks of the pandemic SARS-CoV genome are present in bat SARSr-CoVs (SARs related coronaviruses) from a single location in Yunnan ... SARS-CoV most likely originated from horseshoe bats via recombination events among existing SARSr-CoVs.” So much for civets! The civets likely caught it from bats.

In 2018, Dr. Shi and colleagues performed serological assessments of residents living near the Yunnan bat caves. Six individuals (out of 218) were seropositive for SARS-CoV, as opposed to 0 out of 240 in a control sample from Wuhan (ironically) (Virol Sin 2018;33:104-7). These individuals had never had the original SARS-CoV, so seropositivity was likely the result of infection from the local bat coronaviruses.

These studies, published prior to the COVID-19 pandemic, demonstrate: 1) there is sufficient genomic diversity among coronaviruses to explain the origin of SARS-CoV, 2) the ability of bat coronaviruses to directly infect humans living in proximity to bats, and 3) we likely have only sequenced a tiny fraction of the coronavirus strains circulating in horseshoe bats.

A year later, the SARS-CoV-2 epidemic started in Wuhan, more than 1,000 miles from the caves, but just across the Yangtze river from the Wuhan Institute of Virology. Shi and colleagues immediately sequenced the genome (Virol Sin 2016;31:31-40). The closest match to a known coronavirus was RaTG13, a bat coronavirus first identified in 2013. The RaTG13 genome is 96% similar to SARS-CoV-2 (Figure 1) (Nature 2020;579:270-3). As seen in Figure 1, the 4% difference is widely distributed over the genome. Using known coronavirus mutation rates, this 4% difference represents at least four decades of evolution (Nat Microbiol 2020;5:1408-17; Virus Evol 2020;7:veaa098). Using similar genomic clocks, scientists have pinpointed the likely emergence of SARS-CoV-2 to November 2019 (Science 2021;372:412-7). Thus, RaTG13 cannot be an immediate predecessor of SARS-CoV-2. Additionally, if a “lab-leak” happened, it must have happened no earlier than November 2019.

Figure 1:

A comparison of the SARS-CoV-2 genome (GISAID ID: EPI_ISL_402124) to the genome RaTG13 (GISAID ID EPI_ISL_402131) from horseshoe bats. Encoding regions for the spike (S) protein, its S1 and S2 subunits, and the receptor binding domain (RBD) are marked. The dotted line shows the 96% average similarity of RaTG13 to SARS-CoV-2. The genotypes were aligned using COVID-Align-hCOV-19 (asamonitor.pub/3gYggim). The RNA sequences and analysis R code are available from the author.

Figure 1:

A comparison of the SARS-CoV-2 genome (GISAID ID: EPI_ISL_402124) to the genome RaTG13 (GISAID ID EPI_ISL_402131) from horseshoe bats. Encoding regions for the spike (S) protein, its S1 and S2 subunits, and the receptor binding domain (RBD) are marked. The dotted line shows the 96% average similarity of RaTG13 to SARS-CoV-2. The genotypes were aligned using COVID-Align-hCOV-19 (asamonitor.pub/3gYggim). The RNA sequences and analysis R code are available from the author.

The red box in Figure 1 shows that RaTG13 abruptly deviates from SARS-CoV-2 in a region coding for the “receptor binding domain.” This is the region that recognizes the human ACE2 receptor. The closest genomic match for the receptor binding region comes from a pangolin coronavirus, as shown in Figure 2 (PLoS Pathog 2021;17:e1009664). As might be expected from the similarities in the receptor binding domain, the pangolin spike protein binds the human ACE2 nearly as well as SARS-CoV-2 (Nat Commun 2021;12:1607). However, it is still only 90% homologous. Additionally, it is missing something called the “furin cleavage site.”

Figure 2:

Close-up of the receptor binding domain, demonstrating that the pangolin genome more closely matches about half of this region (specifically, the exact portion that defines binding sites to the ACE2 receptor).

Figure 2:

Close-up of the receptor binding domain, demonstrating that the pangolin genome more closely matches about half of this region (specifically, the exact portion that defines binding sites to the ACE2 receptor).

Furin is a human enzyme that cleaves proteins, typically at arginine-arginine amino acid pairs. The S1 and S2 segments of the spike protein are joined together by a furin cleavage site (Figure 3), and this site is essential to infectivity (Mol Cell 2020;78:779-784.e5; Nature 2021;591:293-9). Much has been made about this polybasic cleavage site in the popular press, in part because it is not found in the genomes of SARS-CoV (the original SARS), pangolin CoV, nor RaTG13. Some have suggested that the presence of this site shows intentional manipulation.

Figure 3:

A “furin cleavage site” on the HKU9 genome almost exactly matches the site on SARS-CoV-2. The synonymous substitute at the 12th nucleotide from adenine (A) to cytosine (C) is inconsequential. Redrawn from reference 17 (Arch Virol 2020;165:2341-8). For both SARS-CoV-2 and HKU9, it is downstream of the sequence CAGAC, a sequence prone to causing copy-choice recombination events.

Figure 3:

A “furin cleavage site” on the HKU9 genome almost exactly matches the site on SARS-CoV-2. The synonymous substitute at the 12th nucleotide from adenine (A) to cytosine (C) is inconsequential. Redrawn from reference 17 (Arch Virol 2020;165:2341-8). For both SARS-CoV-2 and HKU9, it is downstream of the sequence CAGAC, a sequence prone to causing copy-choice recombination events.

Poppycock! As shown in Figure 3, a nearly exact genomic sequence of the furin cleavage site from SARS-CoV-2 can be found in the bat HKU9 coronavirus genome (Arch Virol 2020;165:2341-8). The author, Dr. William Gallaher, noted that in both SARS-CoV-2 and HKU9, the furin cleavage site is immediately downstream of the palindromic (i.e., reading the same forward and backward) CAGAC nucleotide sequence. He explains that the 30-kb SARS-CoV-2 genome is organized in a way that requires replication in segments. These segments are often separated by the sequence CAGAC, noted in blue in Figure 3. Between segments, the RNA polymerase may jump to a nearby RNA template, which might be from a different coronavirus strain infecting the same cell. This is called a “copy-choice” error because the RNA polymerase makes a choice for the RNA template to copy, and it picks the wrong one. This helps explain why coronaviruses undergo so much recombination: whenever there are multiple infections in a single animal, the RNA polymerase is quite happy to jump from one genome to another, particularly at CAGAC sites, as shown in Figure 3.

Additional analysis by Gallaher and Garry demonstrates that furin cleavage sites are a reasonably common feature of coronaviruses, as shown in Figure 4 (asamonitor.pub/3h0Ggdk). In the past few months, additional furin cleavage sites have been identified in horseshoe bats from Cambodia (asamonitor.pub/3d1gmEM). Despite claims by multiple authors that the presence of the furin cleavage site is evidence of genetic manipulation, furin cleavage sites are found reasonably often in coronaviruses.

Figure 4:

Multiple coronavirus genomes have furin cleavage sites, shown in red. Redrawn from reference 18 (asamonitor.pub/3h0Ggdk).

Figure 4:

Multiple coronavirus genomes have furin cleavage sites, shown in red. Redrawn from reference 18 (asamonitor.pub/3h0Ggdk).

The analysis of the evolutionary clock assumes that there is no “unnatural” selection pressure (e.g., from a laboratory setup designed to rapidly evolve high levels of human infectivity). The ratio of non-synonymous mutations (mutations that change the amino acid) to synonymous mutations (mutations that don't change the amino acid) can bifurcate genomic changes between “purifying selection” (very few changes in amino acids, because the virus fits the environment well) and “Darwinian selection” (lots of amino acid changes reflecting strong selection pressure) (asamonitor.pub/3gYoWFJ). This has been done for SARS-CoV-2 (asamonitor.pub/3zPZhYc). SARS-CoV-2 has relatively few amino substitutions for the number of silent mutations. This is strong evidence that the predecessor of SARS-CoV-2 evolved gradually over decades until jumping into humans in late 2019.

Summary

  1. Scientists have never found the original SARS-CoV in an animal species. All of the genomic components for SARS-CoV can be found in bats living in a single cave with hundreds of coronavirus strains circulating among them.

  2. There are likely many thousands of beta coronavirus genomes. We will probably never find the proximate predecessor of SARS-CoV-2. This is expected.

  3. Coronaviruses recombine genomic elements frequently because the RNA genome is transcribed in segments, leading to “copy-choice” errors.

  4. Some naturally occurring coronavirus spike proteins bind avidly to human ACE2 receptors.

  5. The “furin cleavage site” is not uncommon in coronaviruses. The HKU9 beta coronavirus, sequenced in 2007, has a nearly identical RNA sequence (Figure 3). Both are immediately downstream of CAGAC, which predisposes that location to recombination.

  6. The SARS-CoV-2 genome is entirely consistent with natural selection and random mutation/recombination acting on constantly mixing populations of bats and other animals (e.g., civets, pangolins, you, me).

The scientific evidence strongly supports that SARS-CoV-2 arose when a virus circulating in animals suddenly transferred to humans.

There is no evidence to support the “lab-leak” theory, although that does not rule out the possibility that SARS-CoV-2 came to Wuhan on the shoe of a researcher who visited the bat caves in Yunnan province. The presumed epicenter, the Huanan Seafood Wholesale Market, is about 11 miles from the Wuhan Institute of Virology and located on the opposite side of the Yangtze River. The Huanan Seafood Wholesale Market is about the same distance from the Wuhan train station, one of the four most important transit hubs in China.

As to the “engineered virus” theory, the available evidence rules that out (Nat Med 2020;26:450-2).

The zoonotic theory is consistent with every other emergent infectious disease in the past 100 years, including rabies, HIV, H1N1, Ebola Sudan, Ebola Makona, Hantavirus, West Nile virus, Zika virus, Lyme disease, yellow fever, avian flu, SARS, MERS, and plague. Our numbers, our biomass, and relentless travel make us the jackpot for viral evolution. We are endlessly encroaching on wild habitats harboring perhaps hundreds of thousands of unknown viruses. Future pandemics will happen. However, by doggedly searching for the origins of SARS-CoV-2, as Dr. Shi and others have done for SARS-CoV, we can better understand how to anticipate and prepare for future pandemics.

Additional reading

The Benhur Lee Lab has an outstanding three-part tutorial on the origins of SARS-CoV-2, which I used as a starting point for my analysis above (asamonitor.pub/3zMaSrm; asamonitor.pub/3gSeAHr; asamonitor.pub/3gJnyYC). The assistance of Christian Stevens, an MD/PhD student in the Lee research team, is acknowledged with gratitude.

Dr. Kristian Andersen, lead author of the April 2020 paper in Nature on the origins of SARS-CoV-2, has maintained an active blog on the subject (Nat Med 2020;26:450-2; asamonitor.pub/3gLybtO). Dr. Andersen's blog led to multiple papers documenting the furin cleavage site in multiple coronaviruses included in this review.

References omitted by intent because they lack scientific foundation.