WikiJournal Preprints/The Proteins of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS CoV-2 or n-COV19), the Cause of COVID-19

Introduction
Severe acute respiratory syndrome coronavirus-2 (SARS CoV-2) is the virus that caused the global pandemic that was first reported on December 31, 2019. Taxonomically, SARS CoV-2 belongs to the realm Riboviria, order Nidovirales, suborder Cornidovirineae, family Coronaviridae, subfamily Orthocoronavirinae, genus Betacoronavirus (lineage B), subgenus Sarbecovirus, and the species Severe acute respiratory syndrome-related coronavirus.

The genome of SARS CoV-2 (NCBI Reference Sequence: NC_045512.2) is similar to the genome of the coronavirus that caused the SARS epidemic in 2003 (SARS CoV, NCBI Reference sequence: NC_004718.3). Much of the understanding of the proteins found in SARS CoV-2 are based on the numerous research studies reported on SARS CoV and other related viruses (e.g. MERS CoV). However, among the recent coronavirus outbreaks in the new millennium (SARS CoV: 2002-2003, MERS CoV: 2012, SARS CoV-2: 2020), SARS CoV-2 mysteriously had the most devastating global impact. Understanding the proteins present in these viruses enable a more rational approach to designing more effective antiviral drugs. The majority of proteins of SARS CoV have been characterized in detail. The proteins of SARS CoV consist of two large polyproteins: ORF1a and ORF1ab (that proteolytically cleave to form 16 nonstructural proteins), four structural proteins: spike (S), envelope (E), membrane (M), and nucleocapsid (N), and eight accessory proteins: ORF3a, ORF3b (NP_828853.1, not present in SARS CoV-2), ORF6, ORF7a, ORF7b, ORF8a, ORF8b, and ORF9b (NP_828859.1, not present in SARS CoV-2). Although accessory proteins have been viewed as dispensable for viral replication in vitro, some have been shown to play an important role in virus-host interactions in vivo. Similar to SARS CoV, SARS CoV-2 lacks the hemagglutinin esterase gene, which is found in human coronavirus (hCoV) HKU1, a lineage A betacoronavirus. The spike protein, envelope protein, membrane protein, nucleocapsid protein, 3CL protease, papain like protease, RNA polymerase, and helicase protein have been suggested to be viable antiviral drug targets. SARS CoV-2 is an RNA virus and its RNA genome is 30 kb in length. SARS CoV-2 is thought to have originated from its closest relative, BatCov RaTG13 (GenBank: MN996532), which was isolated from horseshoe bats.

Discussion: Proteins of SARS CoV-2
SARS CoV-2 (NC_045512.2) has a total of 11 genes with 11 open reading frames (ORFs) (Table 1): ORF1ab, ORF2 (Spike protein), ORF3a, ORF4 (Envelope protein), ORF5 (Membrane protein), ORF6, ORF7a, ORF7b, ORF8, ORF9 (Nucleocapsid protein), and ORF10.

1. Polyprotein expressed by ORF1ab
The first gene (ORF1ab) expresses a polyprotein. The ORF1ab polyprotein is comprised of 16 nonstructural proteins (NSPs) (Table 2).

(i) NSP1 (Leader protein)
Nonstructural protein 1 (NSP1) is the first protein of the polyprotein of SARS CoV-2 (Figure 1 – sequence alignment of NSP1 for SARS CoV with SARS CoV-2). This protein is also known as the leader protein. This protein is also found in SARS coronavirus and is known to be a potent inhibitor of host gene expression. NSP1 binds to the 40S ribosome of the host cell to inactivate translation and promotes host mRNA degradation selectively, while the viral SARS CoV mRNA remain intact. Figure 1 shows the amino acid sequence alignment for the NSP1 proteins of SARS CoV (from genome: NCBI Reference Sequence: NC_004718.3) and SARS CoV-2.

(ii) NSP2
Nonstructural protein 2 (NSP2) is the second protein of the polyprotein of SARS CoV-2 (Figure 2). This protein is conserved in SARS CoV, the related beta coronavirus to SARS CoV-2. In SARS CoV, NSP2 was found to bind to two host proteins: prohibitin 1 and prohibitin 2 (PHB1 and PHB2). PHB1 and PHB2 proteins are known to play roles in cell cycle progression, cell migration, cellular differentiation, apoptosis, and mitochondrial biogenesis. The binding of NSP2 to PHB1 and PHB2 proteins suggest that NSP2 plays a role in disrupting the host cell environment. The amino acid sequence alignment for NSP2 of SARS CoV and SARS CoV-2 is shown in Figure 2.

(iii) NSP3 (Papain like proteinase)
NSP3 is the papain-like proteinase protein (Figure 3). This protein is nearly 200 kDa in size and is the largest protein (not including the polyproteins ORF1a and ORF1ab) encoded by the coronaviruses. With such a long sequence, it possesses several conserved domains: ssRNA binding, ADPr binding, G-quadruplex binding, ssRNA binding, protease (papain-like protease), and NSP4 binding), and transmembrane domain. Among the 16 nonstructural proteins, NSP3, NSP4, and NSP6 have transmembrane domains. The papain like protease 1 (PL1 protease) of alpha coronavirus (alpha CoV) Transmissible Gastroenteritis Virus (TGEV), which is part of NSP3, was shown to cleave the site between NSP2 and NSP3. Furthermore, this papain like protease domain is responsible for the release of NSP1, NSP2, and NSP3 from the N-terminal region of polyproteins 1a and 1ab from coronaviruses. Considering this important protease activity to release essential proteins for viral activity, the inhibition of NSP3 protease activity is an important target for antiviral activity. Tanshinones, a class of natural products have been found to inhibit NSP3 protease activity. The amino acid sequence alignment for the NSP3 protein of SARS CoV and SARS CoV-2 is shown in Figure 3.

(iv) NSP4 (contains transmembrane domain 2)
NSP4 interacts with NSP3 and possibly host proteins to confer a role related to membrane rearrangement in SARS CoV. Moreover, the interaction between NSP4 and NSP3 is essential for viral replication. The sequence alignment for NSP4 proteins for SARS CoV and SARS CoV-2 is shown in Figure 4.

(v) NSP5 (3C-like proteinase)
The NSP5 protein based on the Middle East Respiratory Syndrome (MERS) coronavirus has been characterized. NSP5 cleaves at 11 distinct sites to yield mature and intermediate nonstructural proteins (NSPs). The amino acid sequence alignment for NSP5 of SARS CoV and SARS CoV-2 is shown in Figure 5.

(vi) NSP6 (putative transmembrane domain)
The NSP6 protein of the avian coronavirus (infectious bronchitis virus, IBV) was shown to generate autophagosomes from the endoplasmic reticulum (ER) (Figure 6B shows sequence alignment with SARS CoV-2 NSP6). Autophagosomes facilitate assembly of replicase proteins. Furthermore, NSP6 limited autophagosome/lysosome expansion, which in turn prevents autophagosomes from delivering viral components for degradation in lysosomes. With SARS CoV, NSP6 was shown to induce membrane vesicles. The amino acid sequence alignment for NSP6 of SARS CoV and SARS CoV-2 is shown in Figure 6.

(vii) NSP7
NSP7 is required to form a complex with NSP8 (next section) and NSP12 to yield the RNA polymerase activity of NSP8. The primary amino acid sequence alignment for the NSP8 proteins for SARS CoV and SARS CoV-2 is shown in Figure 7. Only one amino acid residue is different (arginine vs. lysine) but the charge is conserved at this location.

(viii) NSP8
NSP8 is a peptide cofactor that makes a heterodimer with NSP7 (the other peptide cofactor), and this NSP7-NSP8 heterodimer complexes with NSP12. In addition to the NSP7-NSP8 heterodimer, an NSP8 monomer unit also complexes with NSP12, which ultimately forms the RNA polymerase complex. The cryo-EM structure of this complex has been solved. The amino acid sequence alignment for NSP8 of SARS CoV and SARS CoV-2 is shown in Figure 8.

(ix) NSP9
NSP9 from the porcine reproductive and respiratory syndrome virus (PRRSV) has been found to interact with the DEAD-box RNA helicase 5 (DDX5) cellular protein. This interaction between NSP9 and DDX5 has been shown to be important for viral replication – when the DDX5 gene was silenced in MARC-145 cells, the virus titers were lower by 10-fold. Figure 9 shows the amino acid sequence alignment between the two NSP9 proteins from SARS CoV and SARS CoV-2.

(x) NSP10
NSP10 has been shown to interact with NSP14 in SARS coronavirus, and this interaction stimulates activity of NSP14. NSP 14 is known to function as an S-adenosylmethionine (SAM)-dependent (guanine-N7) methyl transferase (N7-MTase). Furthermore, NSP10 has also been shown to stimulate the activity of NSP16, which is a 2’-O-methyltransferase. Figure 10 shows the amino acid sequence alignment between the two NSP10 proteins from SARS CoV and SARS CoV-2.

(xi) NSP11
The function of NSP11 seems to be unknown. NSP11 is made of thirteen amino acids and the first nine amino acids (sadaqsfln) are identical to the first nine in NSP12. Figure 11 shows the amino acid sequence alignment between the two NSP12 proteins from SARS CoV and SARS CoV-2.

(xii) NSP12 (RNA dependent RNA polymerase)
NSP12 is the RNA-dependent RNA polymerase that copies viral RNA. As mentioned, NSP12 makes a complex with an NSP7-NSP8 heterodimer and an NSP8 monomer to confer processivity of NSP12. NSP12 exhibits poor processivity in RNA synthesis – that is the presence of NSP7 and NSP8 lowers the dissociation rate of NSP12 from RNA. The amino acid sequence alignment between the two NSP12 proteins from SARS CoV and SARS CoV-2 is shown in Figure 12.

(xiii) NSP13 (helicase)
SARS CoV was used to characterize the helicase enzyme, NSP13, which unwinds duplex RNA. The crystal structure of NSP13 of SARS CoV has been reported. Furthermore, it has been shown that binding of NSP12 with NSP13 can enhance the helicase activity of NSP13. In addition to its helicase activity, NSP13 of SARS CoV is also known to possess 5’-triphosphatase activity, which is responsible for introducing the 5’-terminal cap of the viral mRNA. Both eukaryotic and most viral mRNA have a 5’-terminal cap structure: m7G(5)ppp(5)N-. This 5’-terminal cap is the site of recognition for translation and plays a role in splicing, nuclear export, translation, and stability of mRNA. This process of incorporating the 5’-terminal cap will be discussed in the next section: (xiv) NSP14. The sequence alignment for NSP13 of SARS CoV and SARS CoV-2 is shown in Figure 13. Interestingly, only one amino acid residue is different out of the 601 amino acids in these two proteins (isoleucine vs. valine).

(xiv) NSP14 (3’ to 5’ endonuclease, N7-Methyltransferase)
NSP14 from coronavirus is known to have 3’-5’ exoribonuclease activity and N7-methyltransferase activity. The guanine-N7-methyltransferase activity is part of the process for introducing the 5’-cap of the virus, which involves multiple steps: (1) the gamma-phosphate of the 5’end of nascent mRNA is removed by the RNA triphosphatase (NSP13), (2) a GMP moiety derived from a covalent enzyme-GMP intermediate is transferred to the resulting mRNA with a diphosphate end, (3) the GpppA cap is methylated with S-adenosyl-methionine, which is catalyzed by the guanine-N7-methyltransferase (NSP14) to yield the cap-0 structure, and (4) 2’-O-methylation by NSP16 of adenine gives the cap-1 structure. It is currently unknown which enzyme incorporates the GMP group involved in the second step, and it is possible that the virus uses the host guanylyltransferase enzyme. Figure 14 shows the amino acid sequence alignment between the NSP14 proteins of SARS CoV and SARS CoV-2.

(xv) NSP15 (endoRNAse)
NSP15 of SARS coronavirus has been biochemically characterized as an endoribonuclease that cleaves RNA at uridylates at the 3’-position to form a 2’-3’ cyclic phosphodiester product. The NSP15 protein specifically targets and degrades the viral polyuridine sequences to prevent the host immune sensing system from detecting the virus. The crystal structure of NSP15 has been reported for SARS CoV and SARS CoV-2. NSP15 uses manganese as a cofactor to promote endoribonuclease activity. It has been suggested that NSP15 degrades viral dsRNA to prevent host recognition. The amino acid sequence alignment of NSP15 from SARS CoV and SARS CoV-2 is shown in Figure 15.

(xvi) NSP16 (2’-O-ribose-methyltransferase)
NSP16 for coronavirus has been biochemically (feline coronavirus, FCoV) and structurally (complex of NSP10-NSP16 for SARS CoV) characterized. The viral RNA has a 5’-cap, which protects it from mRNA degradation by 5’-exoribonucleases, promotes mRNA translation, and prevents the viral RNA from being recognized by innate immunity mechanisms. The RNA cap is an N7-methylated guanine nucleotide connected through a 5’-5’ triphosphate bridge to the first transcribed nucleotide (adenine). NSP16 methylates the 2’-hydroxy group of adenine using S-adenosylmethionine as the methyl source. Figure 16 shows the amino acid sequence alignment between the two NSP16 proteins from SARS CoV and SARS CoV-2.

2. Spike Protein (surface glycoprotein)
The spike protein (Figure 17 – sequence alignment between SARS CoV and SARS CoV-2) is a glycoprotein, which mediates attachment of the virus to the host cell. The electron microscopy structures of the spike (S) protein for both SARS CoV-2 and SARS CoV  have been determined. This protein recognizes the human angiotensin-converting enzyme 2 (ACE2) protein on the host cell surface. SARS CoV spike mouse polyclonal antibodies potently inhibited SARS CoV-2 spike protein mediated entry into cells. Interestingly, a furin cleavage site (highlighted in Figure 17: QTQTNSPRRARSVASQSIIA) was located in the S protein of SARS CoV-2, which was lacking in the S protein of SARS CoV. This difference in site could possibly explain the difference in pathogenicity of these two viruses. This particular furin cleavage site ([R/K]-[2X]n-[R/K]↓) has been identified in the spike proteins of other coronaviruses: HCoV-OC43, MERS-CoV, and HKU1 but is clearly not present in SARS CoV (Figure 17 sequence alignment).

3. ORF3a protein
The ORF3a protein from SARS CoV is an ion channel protein related to NLRP3 inflammasome activation. ORF3a interacts with TRAF3, which in turn activates ASC ubiquitination, and as a result, leads to activation of caspase 1 and IL-1b maturation. The amino acid sequence alignment between the two ORF3a proteins from SARS CoV and SARS CoV-2 is shown in Figure 18.

4. Envelope protein
The envelope protein is a small integral membrane protein in coronaviruses, which can oligomerize and create an ion channel. The four structural proteins of coronaviruses are: S protein, M protein, E protein, and N protein. The E protein has been shown to play multiple roles in the viral replication cycle: (1) viral assembly, (2) virion release, and (3) viral pathogenesis. Interestingly, in the sequence alignment of the E proteins from SARS CoV and SARS CoV-2 (Figure 19), there is a glutamate residue (E69) with a negative charge in SARS CoV that corresponds to a positively charged arginine in SARS CoV-2 (R69).

5. Membrane protein
The SARS coronavirus membrane (M) protein is an integral membrane protein that plays an important role in viral assembly. In addition, the SARS coronavirus M protein has been shown to induce apoptosis. The M protein interacts with the nucleocapsid (N) protein to encapsidate the RNA genome. Figure 20 shows the amino acid sequence alignment of the two ORF5 proteins from SARS CoV and SARS CoV-2.

6. ORF6 protein
The ORF6 protein from SARS coronavirus is an accessory protein that plays an important role in viral pathogenesis. Using a yeast two-hybrid system, ORF6 was shown to interact with NSP8, the nonstructural protein related to promoting RNA polymerase activity. Figure 21 shows the amino acid sequence alignment of the two ORF6 proteins from SARS CoV and SARS CoV-2.

7. ORF7a protein
ORF7a from SARS coronavirus is an accessory protein that is a type I transmembrane protein and its crystal structure has been determined. Figure 22 shows the amino acid sequence alignment between the two ORF7a proteins of SARS CoV and SARS CoV-2.

8. ORF7b protein
The ORF7b accessory protein from SARS coronavirus is localized in the Golgi compartment. Figure 23 shows the sequence alignment between the two ORF7b proteins of SARS CoV and SARS CoV-2.

9. ORF8 protein
SARS CoV-2 has a single ORF8 protein while SARS CoV has two ORF8 proteins: ORF8a and ORF8b. In SARS CoV, the ORF8b protein binds to the IRF association domain (IAD) region of interferon regulatory factor 3 (IRF3), which in turn inactivates interferon signaling. Interestingly, L84S and S62L missense mutations have been reported in various SARS CoV-2 sequences. Figure 24 shows the alignment between the ORF8 protein of SARS CoV-2 with the ORF8a and ORF8b proteins of SARS CoV.

10. Nucleocapsid protein
The nucleocapsid (N) protein of coronaviruses is a structural protein that binds directly to viral RNA and providing stability. Furthermore, the N protein of SARS CoV-2 (Figure 24) has been found to antagonize antiviral RNAi. In another study, the nucleocapsid protein of SARS CoV was found to inhibit the activity of cyclin-cyclin-dependent kinase (cyclin-CDK) complex. Inactivation of the cyclin-CDK complex results in hypophosphorylation of the retinoblastoma protein and in turn inhibits S phase (genome replication) progression in the cell cycle. Figure 25 shows the amino acid sequence alignment between the two N proteins of SARS CoV and SARS CoV-2.

11. ORF10 protein
ORF10 protein from SARS CoV-2 is comprised of 38-amino acids and its function is unknown. Interestingly, SARS CoV possesses an ORF9b protein (NP_828859.1), which is not present in SARS CoV-2. Figure 26 shows the sequence alignment between ORF10 of SARS CoV-2 with ORF9b of SARS CoV. SARS CoV-2 does not have an ORF10 protein. A summary of the sequence identities and similarities of the discussed proteins from SARS CoV and SARS CoV-2 is shown in Table 4.

Available Structures of SARS CoV-2 Proteins
A major thrust of research of SARS CoV-2 proteins is the elucidation of the 3-dimensional structures of the proteins determined through X-ray crystallography or electron microscopy. Table 5 summarizes efforts to elucidate the structures of each of these proteins. The images of structures were produced using Chimera software.

Overlapping Genes: ORF9b and Two Proteins with Variation Among SARS CoV-2 Sequences: ORF3b and ORF9c
Overlapping genes in coronavirus have been previously observed. For example, in SARS CoV, the start and end positions in the nucleotide sequence of the N-protein are 28,120 and 29,388 respectively while the ORF9b gene of SARS CoV starts and ends at positions: 28,130 and 28,426 (within the gene sequence of the N-protein). Similarly, there is a putative ORF9b protein in SARS CoV-2 located within the gene encoding the N-protein, which does not yet have an accession number.

In the gene alignment of 2,784 SARS CoV-2 sequences, two variations were recognized in the SARS CoV-2 genome. It was recognized that a premature stop codon at position 14 of ORF3b in SARS CoV-2 in 17.6% of isolates (position E14). Furthermore, there were two mutations that gave rise to premature stop codons in ORF9c (at position Q41 in 0.7% of sequences and at position Q44 in 1.4% of the sequences). The observations of these stop codons suggested that these genes for ORF3b and ORF9c may not be bonafide gene sequences in SARS CoV-2. With the putative SARS CoV-2 ORF3b protein, only 12 out of 57 overlapping amino acid residues were identical (21% sequence identity) to the ORF3b protein of SARS CoV. In the above sections, ORF3b and ORF9c for SARS CoV-2 were not included in the above analysis. Another protein lacking an accession number is ORF14.

Exploration of Treatment Options for COVID-19
An intense effort has been put forth to discover potential treatment options for COVID-19, the disease caused by SARS CoV-2. For instance, the FDA approved drug, ivermectin, is known to inhibit nuclear transport, and has been shown to inhibit the replication of SARS CoV-2. Other drugs have been repurposed and tested against COVID-19. Remdesivir is a potential antiviral drug originally developed to treat ebola and has been used to treat COVID-19 by inhibiting viral RNA polymerase activity. Hydroxychloroquine and chloroquine have been used to potentially treat COVID-19. However, the use of these drugs has been known to result in cardiotoxicity. In fact, in a recent observational study, it was determined that hydroxychloroquine administration was not associated with a greatly lowered risk of death from COVID-19.

Traditional Chinese Medicine (TCM) has also been employed in China to treat COVID-19. However, due to potential toxic components present in TCM remedies, the use of this strategy should be handled with caution. Ironically, it has been suggested that TCM could have potentially been the cause of COVID-19.

In addition to small molecules, vaccines are also currently being developed against SARS CoV-2, and convalescent plasma transfusions have been used to treat COVID-19. Nevertheless, more research is needed to develop effective treatments against SARS CoV-2 especially in the context of future outbreaks.

Conclusion
Although there is some variation in sequence in the proteins, many of the proteins found in SARS CoV-2 (NC_045512.2) are also found in SARS CoV (AY515512.1 or NC_004718.3) with 77.1% of the protein sequences shared in their proteomes. Thus, previous research on related coronavirus proteins enable a better understanding of how we can approach to understand the current coronavirus (SARS CoV-2) that caused the current global pandemic (COVID-19). The general structures of most of the proteins from SARS CoV-2 can be visualized from homology models. Advances in the knowledge of the structures and functions of the proteins in SARS CoV-2 will enable researchers to design better antiviral drugs that target this virus.

Acknowledgements
Francis K. Yoshimoto, PhD holds a Voelcker Fund Young Investigator Award from the MAX AND MINNIE TOMERLIN VOELCKER FUND.

Competing interests
The author has no competing interest.

Ethics statement
Not applicable.