Molecular epidemiology of SARS-CoV-2 isolated from COVID-19 family clusters

Background Transmission within families and multiple spike protein mutations have been associated with the rapid transmission of SARS-CoV-2. We aimed to: (1) describe full genome characterization of SARS-CoV-2 and correlate the sequences with epidemiological data within family clusters, and (2) conduct phylogenetic analysis of all samples from Yogyakarta and Central Java, Indonesia and other countries. Methods The study involved 17 patients with COVID-19, including two family clusters. We determined the full-genome sequences of SARS-CoV-2 using the Illumina MiSeq next-generation sequencer. Phylogenetic analysis was performed using a dataset of 142 full-genomes of SARS-CoV-2 from different regions. Results Ninety-four SNPs were detected throughout the open reading frame (ORF) of SARS-CoV-2 samples with 58% (54/94) of the nucleic acid changes resulting in amino acid mutations. About 94% (16/17) of the virus samples showed D614G on spike protein and 56% of these (9/16) showed other various amino acid mutations on this protein, including L5F, V83L, V213A, W258R, Q677H, and N811I. The virus samples from family cluster-1 (n = 3) belong to the same clade GH, in which two were collected from deceased patients, and the other from the survived patient. All samples from this family cluster revealed a combination of spike protein mutations of D614G and V213A. Virus samples from family cluster-2 (n = 3) also belonged to the clade GH and showed other spike protein mutations of L5F alongside the D614G mutation. Conclusions Our study is the first comprehensive report associating the full-genome sequences of SARS-CoV-2 with the epidemiological data within family clusters. Phylogenetic analysis revealed that the three viruses from family cluster-1 formed a monophyletic group, whereas viruses from family cluster-2 formed a polyphyletic group indicating there is the possibility of different sources of infection. This study highlights how the same spike protein mutations among members of the same family might show different disease outcomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12920-021-00990-3.

One of the most important factors affecting the rapid spreading of COVID-19 is transmission within families [4,5]. Genomic epidemiology has been suggested to be important to fill the gaps in identifying the SARS-CoV-2 infection sources [6]. However, to our best knowledge, no reports have described the genomic epidemiology within family clusters [6][7][8]. Moreover, multiple spike protein mutations have been associated with a higher transmissibility of SARS-CoV-2 [9]. In this study, we aimed to: (1) perform full genome characterization of SARS-CoV-2 and correlate the sequences with the epidemiological data within family clusters in Indonesia, and (2) conduct phylogenetic analysis of all samples from Yogyakarta and Central Java, Indonesia, involving the family clusters, and virus data from other regions in Indonesia.

SARS-CoV-2 samples
We collected all virus samples of confirmed COVID-19 patients from Yogyakarta and Central Java provinces from June to November 2020. All nasopharyngeal samples were collected in viral transport media (DNA/RNA Shield ™ Collection Tube with Swab, Zymo Research, CA, United States) and transported to four COVID-19 diagnostic laboratories in Yogyakarta province: (

Full-genome sequencing
First, we performed RNA extraction of 19 nasopharyngeal swab samples by a QiAMP Viral RNA mini kit (Qiagen, Hilden, Germany), synthesized the doublestranded cDNA by Maxima H Minus Double-Stranded cDNA Synthesis (Thermo Fisher Scientific, MA, United States), and purified the cDNA using a GeneJET PCR Purification Kit (Thermo Fisher Scientific, MA, United States). For library preparations, we utilized the Nextera DNA Flex for Enrichment using Respiratory Virus Oligos Panel, whereas for full-genome sequencing, we used next generation sequencing (NGS) applied in the Illumina MiSeq instrument (Illumina, San Diego, CA, United States) with Illumina MiSeq reagents v3 150 cycles (2 × 75 cycles). We excluded two samples for further bioinformatics analysis because of low coverages. Our sample genomes were assembled by mapping to the reference genome from Wuhan, China (hCoV-19/Wuhan/ Hu-1/2019, GenBank accession number: NC_045512.2) using Burrow-Wheeler Aligner (BWA) algorithm embedded in UGENE v. 1.30 [10]. Identification of single nucleotide polymorphisms (SNPs) was performed using the number of high confidence base calls (consensus sequence variations of the assembly) that disagree with the reference bases for the genome position of interest, then all SNPs were exported to a vcf. file and visualized in MS Excel . The following accession IDs for the 17 samples  are: EPI_ISL_516800, EPI_ISL_516806, EPI_ISL_516829,  EPI_ISL_525492, EPI_ISL_576383, EPI_ISL_632936,  EPI_ISL_610161, EPI_ISL_610162, EPI_ISL_576145,  EPI_ISL_632937, EPI_ISL_575331, EPI_ISL_576113,  EPI_ISL_576114, EPI_ISL_576115, EPI_ISL_576116, EPI_ISL_576128, and EPI_ISL_576130 [11]. The first four IDs have been reported in our previous study [12].

Phylogenetic analysis
We used the reference genome of hCoV-19/Wuhan/ Hu-1/2019 (NC_045512.2) for annotation of our sequences. A dataset of 142 available SARS-CoV-2 genomes (89 sequences from Indonesia and 53 from other countries) was retrieved from GISAID to conduct a phylogenetic analysis (Acknowledgment Table is provided in Additional file 2: Table S2). We only used the full-genome sequences of several strains representing SARS-CoV-2 clades from some countries that had complete genome data and no long stretches of 'NNNN' for the phylogenetic analysis. The MAFFT program server was utilized for multiple nucleotide sequence alignment (https:// mafft. cbrc. jp/ align ment/ server/). A phylogenetic there is the possibility of different sources of infection. This study highlights how the same spike protein mutations among members of the same family might show different disease outcomes.
Keywords: COVID-19 severity, Family cluster, Multiple spike protein mutations, Phylogenetic analysis, SARS-CoV-2 transmission, Whole genome sequencing tree was constructed from 29.409 nt length of the open reading frame (ORF) of 142 SARS-CoV-2 virus sequences using Neighbor Joining statistical method with 2000 bootstrap replications. The evolutionary distances were computed using the Kimura 2-parameter method and the rate variation among sites was modelled using a gamma distribution with estimated shape parameter (α) for the dataset. The estimation of α gamma distribution was calculated in DAMBE version 7 [13], whereas all the other analyses were performed in MEGA version 10 (MEGA X) [14].
Our study was approved by the Medical and Health Research Ethics Committee of the Faculty of Medicine, Public Health and Nursing, Universitas Gadjah Mada/ Dr. Sardjito Hospital (KE/FK/0563/EC/2020). All participants or guardians signed a written informed consent for participating in this study.

Molecular analysis
Ninety-four SNPs were detected throughout the ORP of the SARS-CoV-2 virus samples with 60% (54/94) of the nucleic acid changes resulting in amino acid substitutions (missense mutations) ( Table 1, detailed in Additional file 1: Table S1). The types of nucleic acid base changes were more often detected as transitions (70%) compared to transversions (30%). Higher entropy values were observed more from nucleic acids that carried more frequent base changes; however, nucleic acid changes that caused missense mutation could have lower entropy values than those that resulted in synonymous mutation.

COVID-19's severity and spike protein mutations of COVID-19 samples
Based on the case definition of COVID-19 severity developed for this study, 3 of 17 virus samples (17.6%) were collected each from asymptomatic cases (people) and critical cases, 5 virus samples (29.4%) from mild cases, and 6 virus samples (35.3%) from moderate cases ( Table 2). Two of the patients with critical stages eventually died. A range of Ct values was found amongst different stages of severity, nevertheless all the virus samples with D614G mutations, except one (YO-UGM-10004/2020|EPI_ISL_576116), showed lower Ct values    (clade GH, GR, and O, Ct range 16.9-24.7) than those with no mutation in this position (clade L, Ct 27.9). Dual mutations of V213A and D614G on spike protein were detected in four patients, and two of these eventually died after a period of hospitalization.

Disease outcomes of COVID-19's family clusters
The epidemiological and clinical data of COVID-19's family clusters, including clinical symptoms, date of first symptoms appeared, diagnostic results, abnormal findings, comorbidity background are provided by timeline and tabulation in Fig. 2 and Table 3, respectively. In family cluster-1, all three patients showed critical COVID-19 and eventually, two died (YO-UGM-10001|EPI_ISL_576113 and YO-UGM-10003|EPI_ISL_576115) and one survived (YO-UGM-10002|EPI_ISL_576114). The disease began from patient-1.1, a 28-year-old male, who had a history of He complained of fever, sore throat, cough and malaise on August 8th, 2020, and was tested for PCR three days afterward with the result of COVID-19 positive. His father (patient-1.2, 58-yo), who was living in the same house, showed fever, dyspnea and diarrhea on August 13th, then was followed by his grandfather (patient-1.3, 88-yo) who showed fever and dyspnea on August 18th. The PCR tests for both patients showed positive for COVID-19. All patients developed severe disease outcomes including bilateral pneumonia, cardiomegaly and ARDS. Several comorbidities were recorded from patient-1.1 (obesity), patient-1.2 (diabetes mellitus, and obesity), and patient-1.3 (type 2 diabetes mellitus, geriatric syndrome, history of infarction stroke). Patient-1.1 was uneventfully discharged from the hospital on day 29 of hospitalization, but sadly, patient-1.2 and patient-1.3 passed away in the hospital after 7 and 12 days of hospitalization, respectively.
Family cluster-2 involved three patients which were comprised of a son, 8-yo (patient-2.1), father, 35-yo (patient-2.2) and mother, 33-yo, with the following virus samples: YO-UGM-10006 (EPI_ISL_576130), YO-UGM-10004/2020 (EPI_ISL_576114), and YO-UGM-10005/2020 (EPI_ISL_576128), respectively. Prior to the index case, this family travelled from West Nusa Tenggara to Yogyakarta on August 2nd, 2020, in order to obtain medical treatment for patient-2.1 who had an autoimmune disorder in a hospital in Yogyakarta. Patient-2.1 had firstly exhibited symptoms of fever, cough, and runny nose on August 11th, 2020 and he was diagnosed COVID-19 positive on August 18th. His parents showed clinical signs of cold, headache and dry cough (patient-2.2) and cough, sore throat, and runny nose (patient-2.3) at the same day on August 20th and the PCR results of both patients were positive on August 21st. Patient-2.1 developed moderate severity with bilateral paracardial infiltrate, whereas patient-2.2 and patient-2.3 developed mild disease without any abnormalities in their chest X-rays and other laboratory findings, except eosinophilia (6.4%), increased levels in pH of the blood (7.45), PaO 2 (83.2), PaCO 2 (39.2) with SaO 2 96.8% and PaO 2 /FiO 2 value was 416 from the arterial blood glass analysis of patient-2.2. All three patients uneventfully recovered and were discharged from the hospital on September 2nd, 2020.

Molecular characterizations of virus samples collected from family clusters
Phylogenetic analysis revealed that the three viruses from family cluster-1 were grouped together from a single node. A matrix of nucleic acid difference showed that YO-UGM-10001|EPI_ISL_576113 and YO-UGM-10003|EPI_ISL_576115 were identical on their ORF (nucleic acid and protein levels) and both virus strains had differences of 2 nucleic acids and 1 amino acid in the NSP2 protein which correspond with V247A substitution in YO-UGM-10001|EPI_ISL_576113 and YO-UGM-10003|EPI_ISL_576115 and T256I substitution in YO-UGM-10002|EPI_ISL_576114, respectively (Table 4). Other unique mutations in the other viral proteins were detected in these three virus strains which were not shown in the other study viruses, including V213A (Spike), K12R (NSP5), T248I (NSP12/RdRp), A119S and S193I (N). The virus samples from family cluster-2 were separated in different nodes in the phylogenetic tree (Fig. 3). The tree and the matrix sequence showed that YO-UGM-10005/2020|EPI_ISL_576128 and YO-UGM-10006|EPI_ ISL_576130 were genetically identical. Both virus strains had 15 nucleic acid differences compared to YO-UGM-10004|EPI_ISL_576114 which resulted in amino acid variations detected in several viral proteins (Table 4).

Discussion
Our study provides evidence of SARS-CoV-2 transmission within families, in which the same mutation of the spike protein in each family cluster was identified. It is important to understand the transmission routes of SARS-CoV-2 to prevent and control its spreading [4]. Families have been reported as the most dominant infection cluster of COVID-19 [16]. Family clusters have a higher risk of cross-infection because of frequent and close contact among each family member [4]. Our study also documented that although all family members showed the same multiple S protein mutations, however, they revealed different outcomes. While multiple S protein mutations, i.e. B.1.1.7 variant, have been associated with the severity of COVID-19 [17,18], this is not the case for our patients. Our samples did not consist of B.1.1.7 variant. In addition, several prognostic factors have been associated with increased risk of severity and mortality of COVID-19, including increasing age, obesity, and comorbidities such as hypertension, diabetes and cerebrovascular disease [15]. Our patients who eventually died (YO-UGM-10001|EPI_ISL_576113 and YO-UGM-10003|EPI_ISL_576115) have more prognostic factors than the patient who survived (YO-UGM-10002|EPI_ ISL_576114) (Table 3). Besides the SARS-CoV-2 variants and prognostic factors, a recent GWAS identified rs11385942 at locus 3p21.31 and rs657152 at locus 9q34.2 as a genetic risk factor for severe COVID-19 [19]. Further study is necessary to confirm whether these polymorphisms might be as susceptible factors in our patients.
Double mutations of V213A and D614G on spike protein were detected in four patients, but three of them (75%) developed severe diseases causing critical conditions and two (50%) with fatal outcome. Another interesting finding was documented from family cluster-1, in which all the virus samples in this family cluster belong to the same clade GH, but two patients died and one survived ( Table 2). All samples from family cluster-1 revealed another spike protein mutation, V213A, besides D614G. However, virus samples isolated from fatal disease outcomes carried V247A mutation in the NSP2 protein, while those from the recovered patient did not. In conjunction with D614G mutation, substitution of valine (V) to alanine (A) in position 247 and 213 of NSP2 and spike protein, respectively, were detected in the patients with fatal disease outcomes. While both V and A, as well as G are in the non-polar hydrophobic amino acid group and no evidence shows that the double mutations of V213A and D614G affect the severity and lethality of COVID-19 patients, further investigations are necessary to determine whether these dual mutations (V213A and D614G in spike protein) or even triple mutations (V213A and D614G in spike protein and V47A in NS2) associated with increased risk of mortality in COVID-19 patients. Moreover, due to limited number of sample size in this study, it is very difficult to associate between the number of mutations on the spike protein or other proteins or SNPs and severity of COVID-19.
Phylogenetic analysis revealed that three viruses from family cluster-1 formed a monophyletic group.
The epidemiological and genetic data indicated that local transmission occurred in family cluster-1 in which patient-1.1 (YO-UGM-10002|EPI_ISL_576114) was initially infected and then transmitted the virus Table 4 Amino acid mutations detected in SARS-CoV-2 viruses collected from two family cluster cases in Yogyakarta and Central Java provinces Family clusters are indicated in asterisks: Case-1 (*) and Case-2 (**). In family cluster-1 and 2, variations in the amino acids of the NSP2 protein (V247A and T256I, and T256I and Q321K, respectively) were noted, but not in others. The virus samples isolated from fatal disease outcomes of family cluster-1 (YO-UGM-10001|EPI_ ISL_576113 and YO-UGM-10003|EPI_ISL_576115) carried V247A mutation in the NSP2 protein, while those from the recovered patient (YO-UGM-10002|EPI_ ISL_576114) did not. Bold, amino acids were different from their reference (  Recently, more than 50% of the viral genome sequences in the UK were reported to have a new single phylogenetic cluster, i.e. B.1.1.7 variant (multiple spike protein mutations: deletion 69-70, deletion 144, N501Y, A570D, D614G, P681H, T716I, S982A, D1118H) [9]. These new variants have been associated with a higher transmissibility of SARS-CoV-2 up to 70% [9]. Until the submission date of April 2021 in GISAID, these variants were also detected in Asia, including Indonesia [11]. Interestingly, we detected other spike protein mutations in our collected virus strains, including those from the family clusters, i.e. L5F, V213A, W258R, Q677H, and K811I. Noteworthy, the V213A variant was identified in all patients from family cluster-1. V213A was detected in 4/17 (23.5%) of our samples. This variant is only found in only 0.01% of samples in four countries, including Indonesia [11]. Whether this variant is due to a founder effect needs further study.
Currently, besides the D614G variant, several mutations within the receptor binding domains (RBD) of the S protein have attracted most scientists' attention due to their increased frequency in certain countries, including S477N (Australia and some Central European), N439K (UK and European), and N501Y (part of the new UK variant B.1.1.7, the new South Africa variant 501.V2 and the new Brazil variant P.1) [11]. These variants might be associated with some potential advantages for these viruses. While the B.1.1.7 variant has been associated with COVID-19 clinical severity [17,18], the 501.V2 and P.1 variants have not [20].
In addition, among eight clades in the GISAID classification, we only detected five clades, i.e. L, G, GH, GR, and O, in the SARS-CoV-2 samples from Indonesia and most of them (~ 60%) contained D614G. Globally, D614G has been detected in ~ 97% samples in 182 countries [11]. While a recent study showed that D614G mutation is significantly associated with the increase of SARS-CoV-2 infectivity, competitive fitness, and transmission in primary human airway epithelial cells and hamsters [21], it does not associate with the clinical severity of COVID-19 patients [22]. Moreover, it is difficult to assess the convergent evolution of D614G mutation in our samples since all samples were from Yogyakarta and Central Java and D614G has been already found in most samples (97%) from all over the world [11]. These findings were compatible with previous studies [22,23]. The hypothesis of convergent evolution for D614G mutation is not supported by the sequence data since almost all 614G variants derived from the same ancestor [23]. Volz et al. [22] proposed a more complex selective landscape in the spike protein for the co-occurring variants between D614G and the neighbouring sites (615 and 613).
Phylogenetic analysis showed that the full-genome sequences of SARS-CoV-2 identified within these family clusters are identical, which strongly indicates a direct transmission within these families. Moreover, our study is also able to determine the virus clades of COVID-19 cases with unknown contact history with a confirmed COVID-19 case. Our findings support a previous suggestion regarding the importance of genomic epidemiology in filling the gaps of identifying SARS-CoV-2 infection sources [6]. Therefore, a full-genome surveillance of SARS-CoV-2 in Indonesia is essential to prevent further transmission of SARS-CoV-2 and to identify any established or new variant that might affect the SARS-CoV-2 transmission and severity.
Notably, our study only included a limited number of family clusters from Yogyakarta and Central Java, Indonesia. These limitations should be considered for interpretations of our findings.

Conclusions
This is the first molecular epidemiology study associating the full-genome sequences of SARS-CoV-2 with the epidemiological and clinical data within family clusters. Phylogenetic analysis revealed that the three viruses from family cluster-1 formed a monophyletic group, whereas viruses from family cluster-2 formed a polyphyletic group indicating there is the possibility of different sources of infection. This study highlights how the same spike protein mutations among members of the same family might show different disease outcomes. Moreover, we also detected multiple spike protein mutations in our samples. Further studies are necessary to clarify the impact of these multiple spike protein mutations in the transmission and severity of SARS-CoV-2 infection, especially in Indonesia.