Association of disease severity and genetic variation during primary Respiratory Syncytial Virus infections

Background Respiratory Syncytial Virus (RSV) disease in young children ranges from mild cold symptoms to severe symptoms that require hospitalization and sometimes result in death. Studies have shown a statistical association between RSV subtype or phylogenic lineage and RSV disease severity, although these results have been inconsistent. Associations between variation within RSV gene coding regions or residues and RSV disease severity has been largely unexplored. Methods Nasal swabs from children (< 8 months-old) infected with RSV in Rochester, NY between 1977–1998 clinically presenting with either mild or severe disease during their first cold-season were used. Whole-genome RSV sequences were obtained using overlapping PCR and next-generation sequencing. Both whole-genome phylogenetic and non-phylogenetic statistical approaches were performed to associate RSV genotype with disease severity. Results The RSVB subtype was statistically associated with disease severity. A significant association between phylogenetic clustering of mild/severe traits and disease severity was also found. GA1 clade sequences were associated with severe disease while GB1 was significantly associated with mild disease. Both G and M2-2 gene variation was significantly associated with disease severity. We identified 16 residues in the G gene and 3 in the M2-2 RSV gene associated with disease severity. Conclusion These results suggest that phylogenetic lineage and the genetic variability in G or M2-2 genes of RSV may contribute to disease severity in young children undergoing their first infection.


Introduction
RSV infection most often presents clinically as a mild respiratory disease with symptoms of rhinitis and cough.For some individuals, especially young children during their first cold season, the virus presents as severe disease with severe fever, cough, and wheeze leading to significant morbidity and sometimes death [1][2][3][4].Studies have shown an association between severe RSV disease and increased incidence of asthma and allergic disease in young children [3,5].Understanding the risk factors associated with RSV disease severity in children has been a continuous subject of study in the RSV field.
RSV can be further subclassified into phylogenetic clade genotypes [19,20] using the second hypervariable region of the RSV G protein [21,22].Association with RSV genotypes has also been variable between studies with some showing clades GA3 being associated, but no other A or B clades [16], some showing NA1 genotypes as causing more frequent lower respiratory tract infections [23].With the emergence of a new genotype (ON1) for RSVA due to a duplication in the G protein studies have shown variable results with some showing increased severity in individuals infected with ON1 compared to NA1 [24] while others showing milder disease with ON1 [15,25] and others showing no association with genotype [26].Still others showed associations with disease severity in some sub-clades of ON1, but not others [27].Further studies have shown association with specific mutations in the RSV G gene [28,29].
Several positively selected sites have been identified on the RSV suggesting an adaption to external pressure [30].Twelve positively selective sites in the RSV B protein were identified in G protein ectodomain sequences suggesting immune pressure [31].Infection with RSV A viruses containing 5 specific amino acids substituted in G has been shown to less often presented with wheezing [29].Moreover, mutations in the central conserved domain of G alter the host immune response and decrease severity [32][33][34][35].Additionally, mutations in the F protein result in changes in the host immune response to RSV infection and increased RSV-induced disease severity in mice [12].Significant genetic variation occurs in other RSV genes besides G including NS2 and M2-1 and M2-2 [36].Moreover, M2-2 has been found to be under positive selection [22].
Whether variation in RSV genes is associated with disease severity has been largely unexplored.Furthermore, genetic variation that does not affect phylogeny may still increase the likelihood of disease severity but has not been assessed [37].

Sample selection
Subject identifiers for cryogenically frozen nasal swab samples from RSV positive individuals that were collected routinely from patients in both inpatient and outpatient settings in the Rochester, New York area from 1977-1998 and used for a variety of research studies.Subjects were limited to those with phenotypic and clinical information, including age and notes on clinical presentation.Samples were randomly selected for subjects that were in their first RSV season and with no documented previous RSV infection resulting in subjects between 0 and 8 months old of age.

Mild and severe disease grouping
Severe disease cases was defined as those admitted to the hospital for RSV infections (inpatient) while mild cases consisted of RSV positive cases not requiring hospitalization (outpatient).Using archived medical records, subjects were first grouped into severe and mild based on impatient or outpatient status.Second, individual records were clinically-evaluated and subjects were removed whose symptoms did not match the mild or severe phenotype or did not contain adequate notes to determine severity level.

Whole-genome RSV sequencing
Whole-genome RSV sequences were generated using overlapping PCR amplicons spanning the RSV genome.The amplicons were pooled by sample, barcoded, and sequenced using the Ion Torrent PGM Next Generation Sequencing platform (355.5 × sequencing depth).The RSV reference (clc_ref_assemble_long v. 3.22.55705)was used for assembly.The consensus sequences of the internal PCR primer hybridization sites were manually verified using reads from amplicons that spanned across the sites.The final dataset contained 160 whole-genome sequences.

Phylogenetic tree
Whole-genome RSV sequences were translated in silico into amino acids sequences and aligned with ClustalOmega using the MSA package [38] for R Statistical Software version 4.2.2.BEAUti v1.10.4 [39] was used to create an XML document from our aligned AA sequences.Tip dates were set to the sample collection year, a HKY substitution model was used and the Site Heterogeneity Model was set to Gamma model with gamma number of 4.An uncorrelated relaxed clock was utilized.The Tree Prior was set to Coalescent: Bayesian SkyGrid, the number of parameters equal to the number of sequences, with a time at last transition point set to 1.0.The length of MCMC chain was set to 10,000,000 with echo state and log parameters set to 1000.XML files were input into BEAST v1.10.4 [39] and 5 independent runs were performed and combined with Logcombiner and highest credibility tree was determined with Treeannotator.The phylogenetic tree was visualized using Figtree v1.4.4.

Phylogeny and trait association
The Bayesian Tip-association Significance testing (BaTS) software [40] was to statistically associate phylogenic topology with disease severity (mild/severe).The BaTS algorithm was used to apply three statistical methods to test the association between phylogeny and a trait: parsimony score, association index, and maximum exclusive single-state clade size [40].

Genotype assignment
Sequences were assigned to RSV genotype clades using genotype reference sequences [21].Statistical association of RSV genotype and disease severity (mild/severe) was performed using Pearson's Chi-Square test and permutation test (R v4.2.2).

Association of viral-gene coding-sequence variation and disease severity using ssTA
Here we introduce a statistical approach based on an immunological shape space [41] called the Shape Space Trait Association (ssTA) algorithm (https:// github.com/ wbend er1/ ssTA) ssTA uses viral-gene coding-sequence as input and places each sequence in a genetic distance space in which the distance between any two sequences in the space represents the number of amino acid differences between the coding sequences [42,43].Categorical traits, such mild/severe disease, can then be associated with the sequences distribution within the genetic distance space using spatial permutation tests [44].
Sequences were translated in silico for each of the 11 protein-coding-regions for each RSV genome.Protein sequences were aligned using the MUSCLE algorithm [45].Pairwise Hamming distances between all aligned sequences were determine using the "stringdist" package in R version 3.4.4.resulting in a 160 × 160 distance matrix representing all pairwise genetic distances.To determine if the distribution of sequences in the 160-dimensional space was associated with disease severity trait (mild/ severe) we used two statistical methods (Adonis2 [46] and Anosim [47] Vegan package, R version 3.4.4).For the adonis2 method, 9999 permutations were performed to determine empirical null.For the anosim method, which is less affected by limited degrees of freedom, 9999 permutations were performed to determine empirical null.P values of less than 0.05 after adjustment (BH) was considered significant.

Association of disease severity and amino acid usage at each residue
The meta-CATS algorithm [48] was used to identify statistically associate residue positions of RSV amino acid sequence with disease severity status (mild/severe).Protein sequences were aligned using MUSCLE.Subtypes (RSVA and RSVB) were tested separately.P values of less than 0.05 were considered significant.
Phylogeny-trait association demonstrated significant differences between the distribution of mild/disease traits and tree topology (Table 3).The association index, which tests the association between traits (mild/severe) and phenotypic clustering, was significant (p-value = 0.023) between phylogenetic cluster and disease severity.Additionally, the parsimony score, which determines the number of state changes required to explain the observed trait distribution in the phylogenetic tree, showed a significant association (p-value = 0.012) between disease severity and phylogeny.Lastly, the maximum exclusive single-state clade size, which is expected to be larger when tips all share the same trait, were significantly associated for the severe trait (p-value = 0.027), but not mild (p-value = 0.503).
The G protein for RSVA was significantly associated with disease severity for both statistical tests (Table 4; Fig. 3).The M2-2 protein of RSVA and RSVB were significantly associated with disease severity for both statistical tests.The NS2 protein was also significantly associated with disease severity in the RSVB subtype, but only for one statistical test.
Association of severity disease and residue position amino acid usage for G and M2-2 sequences was accessed (Table 5).RSVA G-protein had four amino acids associated with severity status, three were found in the Mucin-like-1 domain and one was found in the Heparinbinding-domain (HBD).RSVB G-protein had nine amino acids associated with severity status, three were found in the Mucin-like-1 domain, two were found in the HBD, and four were found in the Mucin-like-2 domain.RSVA M2-2 protein had one amino acid associated with severity status, while the RSVB M2-2 protein had two amino acids associated.

Discussion
Severe RSV disease is multifactual and the contribution of the virus genetics is still debated.Here we sought to provide evidence RSV-associated severe respiratory disease in young children (0-8 months old) experiencing their primary infection is associated with virus genotype.In the study, we assessed genomic variation of RSV viruses that circulated in Rochester, New York from 1977 -1998.Our findings agree with others that the RSV genotype changes over time and multiple genotypes circle  We compared RSV sequence variation and disease severity using both phylogenetic and non-phylogenetic approaches.Phylogenetic approaches demonstrated that specific RSVA and RSVB genotypes (GA1 and GB1) were associated with disease severity and GB4 was exclusively seen in severe cases.This is in contrast with other studies showing that GA3 was more associated Fig. 2 Comparison of Amino Acid Across RSV Proteins.The number of amino acid substitution between each gene coding region of the whole-genome RSV sequence was calculated.A Boxplot of the number of amino acid substitutions for each gene within subtype for each RSV gene coding region.B Boxplot of the percentage of number of amino acid substitutions divided by the amino acid length of the coding sequence within subtype for each RSV gene coding region with increased disease severity [16], but given the difference in genotypes and years in which sequences were sampled it is difficult to make direct comparisons.For instance, in our study both GB3 and GB4 had less than ten sequences, making comparisons between mild and severe cases difficult to interpret.Interestingly, phylogenetic tree topography, including monophyletic clades, were associated with severe disease suggesting that disease severity may be tied to RSV evolution as previously suggested [16,52,53].
Using a non-phylogenetic approach, we found that variation in specific RSV genes were associated with disease severity.Specifically, the G protein from RSVA was associated with disease severity, but not G from RSVB.It is perhaps not surprising given the G variation was associated with disease severity given the affect the G protein has on attachment to the host [32,33], immune cell migration [32], host response [34,54,55], and antigenic differences [56].RSVA also showed greater variation over the time period compared to RSVB which may have contributed.
We found that M2-2 protein variation was associated with disease severity for both RSVA and RSVB.M2-2 has been shown to be involved in the regulation of viral RNA transcription and replication balance.RSV viruses with a deletion of M2-2 show decreased virion production and increased protein creation [57].For that reason, a current RSV vaccine candidate uses an attenuated-virus with an M2-2 gene deletion [57][58][59].This work suggests that variation of the M2-2 gene may affect the transcription/replication regulation leading to differences in host disease severity.
A significant association of RSVB NS2 variation and disease severity was found, but only for the adonis2 test, possibly because NS2 gene showed limited variability and the Anosim test is affected when there are limited degrees of freedom.The inconsistency between the two statistical tests make the NS2 results inconclusive.Given the role NS2 plays in immunomodulation [60,61], additional studies will be needed to determine if NS2 plays a role in severe disease during primary infection.
The impact specific amino acid substitutions RSV have on disease severity is still largely unexplored.State-ofthe-art methods were used to associate specific amino acid substitutions in the G or M2-2 proteins with disease severity in young children.Multiple studies have looked at positive selection sites in RSV and have found significantly drifting sites in both G and M2-2.Multiple studies have shown positive selection sites in G [22,23,31,62].In a cohort of RSV positive infants in Vietnam, M2-2 was found to be under positive selection [22].
Our results suggest that RSV variation in the G or M2-2 G can impact disease severity in in the very young experiencing a primary RSV infection.It is worth noting that age was significantly different between mild and severe subjects and we cannot rule out that our result indicate variants more likely to infect younger children, who are more susceptible to severe disease.Although our studies were not designed to investigate mechanism or causality, they do suggest RSV genes that may play a role in disease severity.Whether these changes that are associate with disease severity arise due to the adaptive pressures, or just random genetic Fig. 3 Amino Acid Variability Among RSV G and M2-2 Proteins.Protein sequences for G and M2-2 proteins from RSVA and RSVB subtypes were aligned separately.The number of amino acid substitutions were calculated between all strains and Principal Coordinate Analysis was performed to demonstrate amino acid variability in reduced dimensional space.Ellipses are centered on centroids with 1 standard deviation.Points are colored by disease severity status; red = mild, black = severe.When points contain multiple sequences and from patients of both disease types, points are colored by the more numerous disease type drift, is still unknown and future studies will be needed to confirm mechanistically the effect this variation has on RSV infection and disease.As we approach a new era with newly licensed therapeutics and vaccines, their ability to effect public health may be dependent on their ability to protect against severe variants.

Table 2
Association of Genotype and Severity

Table 3
Association of Phylogeny and Severity

Table 4
Association of Protein Variation and Severity