Study: CORSID enables de novo identification of transcription regulatory sequences and genes in coronaviruses. Image Credit: vchal/ Shutterstock

Identification of transcription regulatory sequences and genes in previously unannotated coronaviruses

The ongoing coronavirus disease 2019 (COVID-19) pandemic is caused by a novel coronavirus, namely, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). Examples of other coronaviruses with high transmissibility and infect humans are Middle East Respiratory Syndrome (MERS) and severe acute respiratory syndrome (SARS).

Study: CORSID enables de novo identification of transcription regulatory sequences and genes in coronaviruses. Image Credit: vchal/ ShutterstockStudy: CORSID enables de novo identification of transcription regulatory sequences and genes in coronaviruses. Image Credit: vchal/ Shutterstock

Translation of coronavirus genome

Coronaviruses are single-stranded and positive-sense RNA genomes that are translated by the host ribosome. The coronavirus genome consists of multiple genes which are expressed and translated via two different mechanisms. The first mechanism involves the invasion of the virus into the host cell and the translation of the viral genome using the hosts’ machinery to produce polypeptides. These proteins correspond to one or two overlapping open reading frames (ORFs). The second mechanism involves the auto-cleaving of polypeptides to synthesize several non-structural proteins. These proteins include the formation of RNA-dependent-RNA-polymerase (RdRP), whose function is to mediate the expression of the remaining viral genes via discontinuous transcription.

Previous studies have revealed that RdRP tends to switch templates after encountering transcription regulatory sequences (TRSs). These are positioned in the 5’ untranslated region (UTR) of the genome, known as TRS-L (L stands for leader), and upstream of each viral gene, called TRS-B (B stands for the body). This mechanism is associated with the synthesis of many subgenomic mRNAs that are translated into the structural and accessory viral proteins essential for the viral life cycle. Hence, identification and characterization of the TRS region are essential to elucidate the regulation and expression of the viral proteins.

Scientists have hypothesized that the presence of regulatory sequences could be effectively used to instantaneously and accurately identify TRS sites as well as the related viral genes in unannotated coronavirus genomes. This study is available on the bioRxiv* preprint server.

Although previous studies have formulated methods to identify either TRS sites or viral genes, to date, researchers have not developed a method to identify both simultaneously. Earlier studies have revealed that TRSs contain 6 − 7 nt long conserved sequences (core sequences), and both TRSL and TRS-Bs can be identified in coronaviruses using general-purpose motif finding methods.

MEME is a commonly used method based on expectation maximization to simultaneously locate the appearances of multiple motifs. Scientists indicated that the only method available to date to identify TRS sites in coronaviruses particularly is SuPER. This method uses coronavirus genome sequence with specified gene locations and taxonomic and secondary structure information as inputs for analysis. Another gap in the research highlighted by researchers is the unavailability of methods to identify viral genes in unannotated coronavirus genome sequences.

Gene identification

Two of the commonly used gene prediction tools include Glimmer3 and Prodigal. Glimmer3 is based on the Markov model to determine scores of similarity to ORFs, following which it identifies overlapping genes to generate the list of predicted genes. On the contrary, Prodigal is based on a heuristic approach associated with fine-tuned parameters, optimized to identify desired genes in prokaryotes. However, these gene tools are unable to study the regulatory sequence and the TRS sites located upstream of the genes in the genome.

Interestingly, in this study, researchers introduced the TRS Identification (TRS-ID) and the TRS and Gene Identification (TRS-GENE-ID), to locate TRS sites in a coronavirus genome with specified gene annotations. Additionally, both TRS sites and regulatory genes in an unannotated coronavirus genome could be identified simultaneously. Researchers introduced CORSID-A (CORe Sequence IDentifier), a dynamic programming (DP) algorithm that extends classical Smith-Waterman recurrence to identify TRS-I.

CORSID was also applied to solve the TRS-GENE-ID problem. It can incorporate a maximum-weight independent set formulation on an interval graph to locate TRS sites and genes. Researchers assessed the performance of the newly developed methods on coronavirus genomes obtained from GenBank. They found that CORSID-A is more advanced than MEME and SuPER in identifying TRS sites. Furthermore, CORSID showed better results compared to two other commonly used aforementioned gene tools, Glimmer3 and Prodigal. This method can also identify recombination events in a genome. Additionally, scientists revealed that CORSID allows de novo identification of TRS sites and genes in previously unannotated coronaviruses.

Conclusion and future research

The authors stated that CORSID is the first method that can conduct simultaneous and accurate identification of TRS sites as well as genes in coronavirus genomes without requiring information related to the taxonomic or secondary structure of the protein.

The authors recommended several avenues for future research. For instance, presently, CORSID requires the complete genome as input to identify the TRS sites and the genes. However, researchers aim to modify their method such that it can perform gene identification using partial reference genomes. This could be attained by leveraging information from other coronaviruses that have complete genomes with similar TRS sites. At present, this method is focused on the gene identification of coronaviruses; however, it can be extended to other viruses as well.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.