A 6.4 million genome analysis predicts dominant variants of SARS-CoV-2

Scientists have studied various aspects of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), the causative agent of the current 2019 coronavirus disease pandemic (COVID-19), and have developed pharmaceutical and non-pharmaceutical measures. pharmaceuticals to combat the disease. They have characterized the pandemic by studying the repeated waves of COVID-19 cases due to the emergence of new variants of SARS-CoV-2. These new lineages show increased viral fitness, where fitness is related to traits associated with lineage growth, including baseline reproductive number (R0), ability to evade immune responses, and generation time.

Study: Analysis of 6.4 million SARS-CoV-2 genomes identifies fitness-associated mutations. Image credit: Science journal

Fund

As SARS-CoV-2 lineages emerge, scientists warn that it is essential to identify them and predict the possibility of an outbreak. One of the main difficulties associated with this study is to analyze a huge and growing data set, that is, about 7.5 million virus genomes, which constitute a geographical and temporal variability.

The researchers stated that current phylogenetic approaches to assess the relative suitability of newly emerged lineages are computationally incompetent to assess large data sets, i.e., more than 5,000 samples. Furthermore, although the ad hoc methods used to assess the relative suitability of specific SARS-CoV-2 lineages are computationally efficient, they depend on models that can only compare one or two lineages of interest with others. Therefore, it does not capture the dynamic complexity of multiple SARS-CoV-2 lineages circulating.

Mutation-based analysis could help identify specific genetic determinants associated with improved transmission and pathogenesis. This analysis would positively help to predict the phenotypes of the newly emerged lineages. For example, the D614G mutation in the SARS-CoV-2 ear protein is associated with a high viral load. Other mutations found in the ear protein of SARS-CoV-2 (VOC) variants of concern in relation to the original strain, such as N439R, E484K, and N501Y, are related to improved transmissibility, antibody leakage, and increased affinity for the virus to bind to host ACE2.

A new study

Researchers believe it is essential to identify functionally important mutations with the phenotypic outcome related to a large number of SARS-CoV-2 variants that have emerged since the onset of the pandemic. Recently, scientists have modeled the relative fitness of SARS-CoV-2 lineages based on viral growth as a linear combination of the effect of each mutation.

In this new study in the journal Science, researchers have developed a hierarchical Bayesian regression model, called PyR0, that can analyze the complete set of publicly available SARS-CoV-2 genomes. The authors stated that this model is also applicable to any set of viral genomic data. One of the advantages of this regression model is that it can estimate the growth rate of genomic sequences and therefore determine the statistical strength between genetically similar lineages without relying entirely on phylogeny.

The authors modeled the multinomial proportion of different lineages instead of an absolute number of samples for each lineage. In this study, they adjusted PyR0 to 6,466,300 SARS-CoV-2 genomes, obtained from GISAID. The model contained 3,000 clusters derived from 1,544 PANGO lineages and 2,904 non-synonymous mutations. The result of this regression model is a subsequent distribution of the relative fitness of each SARS-CoV-2 lineage and the impact of each mutation on fitness.

The computational challenge inherent in this large model led researchers to use a method of approximate inference, stochastic variational inference. This model helped predict the fitness of completely new lineages, infer the fitness of the lineage, and estimate the impact of individual mutations on fitness.

Relative aptitude versus date of appearance of the lineage. The size of the circle is proportional to the cumulative case count inferred from the lineage ratio estimates and to the confirmed case count. The insertion table lists the 10 most suitable lineages inferred from the model. R / RA is the increase in relative fitness over the Wuhan (A) lineage, assuming a fixed generation time of 5.5 days.

Key findings

The model showed a modest upward trend over time among all lineages. Some lineages showed greater aptitude than others. The qualitative uniformity of the fitness estimates between the spatial data subsets was determined by sensitivity analysis.

The rapid transmission of SARS-CoV-2 to the human population from the onset of the pandemic until early 2022 was marked by the rapid evolution of fitness and an increase in COVID-19 cases. In addition, in some geographical regions, certain PANGO lineages with multiple successive peaks were observed, suggesting that the lineages within them had a varied physical form. This is why researchers have algorithmically refined PANGO lineages into smaller groups.

It is important to note that the scientists tested the predictive ability of the model and found that its predictions were reliable for one or two months in the future for SARS-CoV-2 VOC; predictions may differ for other new strains. The findings were consistent with the World Health Organization report that Omicron (PANGO BA.2) has the highest physical condition among other VOCs.

In this study, PyR0 identified three hotspots in the S region that are associated with viral fitness. These regions are the receptor binding domain (RBD), the N-terminal domain, and the furin rupture site. The researchers identified two mutations, namely T478K and S477N, that significantly affect viral fitness. In addition, PyR0 predicted the growth of a new variant of SARS-CoV-2.

Manhattan chart of amino acid changes evaluated in this study. (A) Genome-wide changes. (B) Changes in the first 850 amino acids of S. In each of (A) to (C) the y-axis shows the size of the effect Δ log R, the estimated change in the log of the relative aptitude due to of each amino acid change. The lower three axes show the background density of all observed amino acid changes, the density associated with growth (weighted by | Δ log R |), and the ratio of the two. The top 55 amino acid changes are labeled. See fig. S13 for detailed views of S, N, ORF1a and ORF1b. C. Changes in the first 250 amino acids of N. (D) Structure of the spike-ACE2 complex (PDB: 7KNB). Light blue, light orange and gray spike subunits. Top-level mutations are shown as red spheres. ACE2 is shown in magenta. (E) Close-up view of the RBD interface. (F) First-level mutations in the N-terminal RNA binding domain of N. N-44 residues (PDB: 7ACT) are shown in light blue. The positions of the amino acids corresponding to the main mutations in this region are shown as red spheres. An RNA bound to 10 nt is shown in gray.

Conclusion

Modeling millions of viral sequences across multiple regions, PyR0 provided mechanistic information on how mutations improve viral fitness. In addition, this regression model provided a panoramic view of viral evolution. According to the authors, their model could have accurately predicted or provided early warning of VOC detection. Because the model can be adapted more quickly to incorporate mutations from new lineages, it can help protect the public from the uncertainties caused by new emerging variants.

Fund

A new study

Key findings

Conclusion

Leave a Comment Cancel Reply