Novel disease syndromes unveiled by integrative multiscale network analysis of diseases sharing molecular effectors and comorbidities
BMC Medical Genomicsvolume 11, Article number: 112 (2018)
Forty-two percent of patients experience disease comorbidity, contributing substantially to mortality rates and increased healthcare costs. Yet, the possibility of underlying shared mechanisms for diseases remains not well established, and few studies have confirmed their molecular predictions with clinical datasets.
In this work, we integrated genome-wide association study (GWAS) associating diseases and single nucleotide polymorphisms (SNPs) with transcript regulatory activity from expression quantitative trait loci (eQTL). This allowed novel mechanistic insights for noncoding and intergenic regions. We then analyzed pairs of SNPs across diseases to identify shared molecular effectors robust to multiple test correction (False Discovery Rate FDReRNA < 0.05). We hypothesized that disease pairs found to be molecularly convergent would also be significantly overrepresented among comorbidities in clinical datasets. To assess our hypothesis, we used clinical claims datasets from the Healthcare Cost and Utilization Project (HCUP) and calculated significant disease comorbidities (FDRcomorbidity < 0.05). We finally verified if disease pairs resulting molecularly convergent were also statistically comorbid more than by chance using the Fisher’s Exact Test.
Our approach integrates: (i) 6175 SNPs associated with 238 diseases from ~ 1000 GWAS, (ii) eQTL associations from 19 tissues, and (iii) claims data for 35 million patients from HCUP. Logistic regression (controlled for age, gender, and race) identified comorbidities in HCUP, while enrichment analyses identified cis- and trans-eQTL downstream effectors of GWAS-identified variants. Among ~ 16,000 combinations of diseases, 398 disease-pairs were prioritized by both convergent eQTL-genetics (RNA overlap enrichment, FDReRNA < 0.05) and clinical comorbidities (OR > 1.5, FDRcomorbidity < 0.05). Case studies of comorbidities illustrate specific convergent noncoding regulatory elements. An intergenic architecture of disease comorbidity was unveiled due to GWAS and eQTL-derived convergent mechanisms between distinct diseases being overrepresented among observed comorbidities in clinical datasets (OR = 8.6, p-value = 6.4 × 10− 5 FET).
These comorbid diseases with convergent eQTL genetic mechanisms suggest clinical syndromes. While it took over a decade to confirm the genetic underpinning of the metabolic syndrome, this study is likely highlighting hundreds of new ones. Further, this knowledge may improve the clinical management of comorbidities with precision and shed light on novel approaches of drug repositioning or SNP-guided precision molecular therapy inclusive of intergenic risks.
Comorbidity, or the co-occurrence of two or more diseases with each other, is a widespread phenomenon with estimates suggesting that 42% of all patients have at least one comorbidity . For example, comorbid metabolic syndrome, presenting at least two diseases from a cluster of metabolic underlying medical conditions, afflicts 47 million people in the US alone . Comorbidities do not occur randomly, and the excess observation of specific disease-disease co-occurrence in clinical records can imply shared underlying pathophysiological mechanisms . Another recent study showed that 120 disease-trait pairs (e.g., acute lymphoblastic leukemia-mean corpuscular volume) with shared genetic architecture using curated gene association studies significantly co-occurred in electronic health records .
Big data science approaches have begun to characterize biomolecular mechanisms that lead to cross-disease relationships; however, these have primarily focused on specific disease-disease hypotheses or required biochemically well-described disease-protein associations as input (e.g., using protein-protein interactions [5, 6]). In addition, very few studies have validated their predictions of comorbid syndromes (a group of medical conditions consistently occurring together) by observation in clinical datasets . Specifically, these few computational biology studies have shown to correlate with clinical comorbidity: (i) the presence of shared gene expression and flux coupling in metabolic pathways of disease-causing genes , (ii) the overlap of disease-associated host genes of polymorphisms and their interacting proteins or functional annotations [9, 10], (iii) the comorbidity of diseases sharing Mendelian genetics , (iv) the overrepresentation of Mendelian disease genes in differentially expressed genes of cancers , and (v) the genetic or phenotypic network proximity observed in databases of complex and Mendelian genetics . However, all of these approaches focused on candidates within the protein-coding genome, with intergenic regions neglected by design. Yet, intergenic and noncoding regions of the genome contain the majority of genetic variants associated with common diseases according to genome-wide association studies (GWAS) [7, 14]. Despite this abundance of noncoding genetic signals, the role of these intergenic polymorphisms in the pathology of clinical comorbidities remains insufficiently characterized [15, 16].
We hypothesize that intergenic variants explain in part the emergence of comorbidities through their role in gene regulation. Variants located far from coding regions and noncoding variants could cause comorbidity by regulating expression levels of messenger RNA transcripts, which are associated with both diseases. These regulated genes can presumably be imputed using expression quantitative trait loci (eQTL) that associate single nucleotide polymorphisms (SNPs) with altered levels of messenger RNAs, microRNAs, and/or noncoding RNA transcript expression. Note, we will refer eQTL-associated RNAs as eQTL RNA hereafter. Statistically-significant overlap of eQTL-associated RNAs whose expression is regulated by distinct disease-associated SNPs could identify a novel shared biomolecular mechanism between a disease-disease pair, or comorbid syndromes in general. This relationship could then be validated by the overrepresentation of that disease-disease pair (comorbidity) in clinical care.
To examine this hypothesis, we integrated summary statistics from the NHGRI-EBI GWAS Catalog of published genome-wide association studies  with eQTL calculated by Fagny and colleagues  from the Gene-Tissue Expression (GTEx) project  to investigate the effects of intergenic variants with an emphasis on trans-regulation of gene expression. We previously published precursors to this approach which were limited to lymphoblastic cell line and liver tissue-derived eQTL data [19, 20]. In these two studies, we could accurately predict genetic synergy and antagonism between within-disease SNP pairs which were validated by an independent GWAS. This demonstrates that within-disease-SNP pairs, for which a statistical relationship has been predicted between their downstream messenger RNAs are associated by eQTL, were substantially overrepresented among interacting chromatin elements found in ChIA-PET data drawn from The Encyclopedia of DNA Elements . With that established, the study in this paper now focuses on convergent mechanisms between diseases and leverages a large clinical dataset from 35 million patients.
An overview of this study is shown in Fig. 1 and consists of six major steps: 1) data preprocessing (Fig. 1a), 2) computation of convergent molecular mechanisms between SNP pairs associated with distinct diseases via GWAS and eQTL studies (Fig. 1b), 3) disease comorbidity calculations using Healthcare Cost and Utilization Project (HCUP) clinical datasets (Fig. 1c), 4) comparative study and validation (Fig. 1d), 5) network construction (Fig. 1e), and 6) additional validation through manual curation (Fig. 1f). In detail, the preprocessing step (Methods- Data preprocessing to define disease bundles and map heterogeneous diseases representation) consisted of defining the terminology between the disease classes of interest, represented in the GWAS studies, for the purposes of the study, i.e., disease-bundles. This disease list was obtained through a semi-automated procedure consisting of expert curation and the use of different data sources, such as the Systematized Nomenclature of Medicine--Clinical Terms (SNOMED-CT) [22, 23] and the Unified Medical Language System (UMLS) . This also enabled the harmonization of discrepant disease nomenclatures of the GWAS studies and the claim records by defining a controlled disease terminology between the molecular and clinical datasets. The second step included the imputation of molecular convergence between disease bundle pairs and the related statistical significance using eQTL associations of SNPs associated via GWAS to the two diseases in the pair (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs). Next, we used the HCUP Electronic Health Records (EHR) datasets to calculate disease comorbidities and their statistical significance (Methods- Calculation of disease comorbidity based on HCUP). Finally, to verify our hypothesis, we studied the concordance between the genetic convergence and clinical comorbidity and discuss our findings (Methods- Comparative studies between eQTLs and HCUP - Curation of prioritized comorbidities).
The study integrates nine data sources (Table 1). GWAS diseases and their reproducible disease-associated SNPs were downloaded from the NHGRI-EBI GWAS catalog. To extract eQTL associations linking SNPs to expressed genes (RNAs), we downloaded files calculated by Fagny and colleagues  from the GTEx project V6  spanning 19 tissues. The authors determined cis- and trans-eQTL at p-value< 0.2 for 19 tissues. These files were chosen to conduct a more in-depth analysis of trans-eQTLs. For clinical data, we acquired the Healthcare Cost and Utilization Project (HCUP) claim datasets. Both the HCUP National Inpatient Sample (NIS13) and the HCUP Nationwide Emergency Department Sample (NEDS13) datasets were employed to compute and ensure the reproducibility of comorbidities. For the definition of the disease bundles and for mapping purposes, we integrated several phenotype datasets, including EMBL-EBI Experimental Factor Ontology (EFO) , the SNOMED-CT, and the Unified Medical Language System (UMLS) MRCONSO file. Finally, other datasets were downloaded as needed for biological study, such as the Single Nucleotide Polymorphism database (dbSNP)  for intragenic (within gene coordinates) and intergenic (between genes) categorization, and HapMap , 1000 Genomes Project , and LDlink  for linkage disequilibrium (LD) in case studies.
Data preprocessing to define disease bundles and map heterogeneous diseases representation
The HCUP clinical datasets use ICD-9-CM diagnosis codes while the genetic GWAS dataset uses heterogeneous descriptive language for disease names that vary even in related studies of the same trait. To bridge the heterogeneous disease names between datasets, as a first step (Fig. 1a), we performed semi-automated curation and normalization of the disease names into the SNOMED-CT concepts. The use of SNOMED-CT, i.e., a comprehensive ontology organized as a directed acyclic graph (DAG), allowed for the automatic calculation of a semantic relatedness/similarity between the diseases by leveraging the SNOMED-CT hierarchy . Thanks to this procedure, all disease names in the GWAS Catalog were first regrouped into SNOMED-CT classes of proper semantic granularity, i.e., disease-bundles (Methods- Creation of the SNOMED-coded disease-bundles from GWAS terms). Similarly, each ICD-9-CM code  relating to a disease in the HCUP datasets was also mapped to the corresponding SNOMED-CT code, facilitating the comparison with the diseases represented in the GWAS studies (Methods- Mapping HCUP diseases to disease-bundles).
Creation of the SNOMED-coded disease-bundles from GWAS terms
To focus only on diseases from the GWAS Catalog corresponding to disease phenotypes (e.g., removing phenotypes associated with response to therapy or non-disease traits such as skin color), we kept only GWAS traits under the disease branch of the EMBL-EBI EFO, reducing the number of unique traits from 1622 to 533. Next, a first round of curation was performed by physicians, which further reduced these 533 GWAS disease traits to 481 of interest for purposes of the study. Through the disease-EFO mapping, we linked (whenever available) each GWAS disease-trait to external terminology IDs, i.e., SNOMED-CT and ICD-9-CM. To automatically augment the coverage of GWAS trait-to-SNOMED-CT mapping, we first mapped the EFO-linked external IDs to UMLS Concept Unique Identifiers (CUIs) using the UMLS MRCONSO table and, second, we obtained the final SNOMED-CT list by retaining all the SNOMED-CT IDs covered under each CUI. After the augmented mapping, we linked 431 disease terms to at least one SNOMED-CT ID.
Next, starting from the resulting mapping, we defined a list of non-redundant and clinically meaningful disease-bundle candidates by applying two criteria: 1) merging pairs of diseases into a disease-bundle if they were mapped to an identical set of SNOMED IDs; and 2) merging pairs of diseases into a disease-bundle if the disease names were identical after removing the ending parenthetical qualifier of the GWAS term, e.g., “Glaucoma” and “Glaucoma (high intraocular pressure)”. For quality control, the disease-bundles underwent iterative curation by a physician, a geneticist, and a clinical informatician. Following this procedure, we finally selected from the GWAS Catalog a total of 238 disease-bundles associated with 6175 SNPs.
Mapping HCUP diseases to disease-bundles (coded in SNOMED-CT).
To compare the disease comorbidity obtained using the HCUP datasets with the disease pairs sharing convergent mechanisms via GWAS and eQTL associations, we had to convert the ICD-9-CM diagnosis codes in the HCUP datasets to the previously identified disease-bundles. Since ICD-9-CM is a classification and SNOMED-CT is a nomenclature, our mapped SNOMED-CT concepts are more comprehensive than ICD-9-CM codes . Therefore, we proceeded by identifying all descendant terms of a SNOMED-CT term associated with a GWAS disease-bundle. Next, the most precise correspondence between SNOMED-CT codes and ICD-9-CM codes were identified using the UMLS dataset. Inconsistent SNOMED-CT-to-ICD-9-CM mappings were detected through a final curation by experts. This procedure enabled the mapping of 2454 ICD-9-CM codes into 188 bundles that occur in the two HCUP datasets.
Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs (eQTL RNA overlap)
To identify the effect size and significance of shared genetic risk mechanisms between two diseases, we assessed the statistical overlap of expressed RNAs associated with disease-associated SNPs by eQTL studies (eQTL RNA overlap model, proxy for shared transcript mechanisms). As our focus was specifically on interpretable downstream mechanistic insight through the incorporation of eQTL, we investigated beyond the simple model of SNPs or loci shared by two different diseases (eQTL SNP overlap model). Examples of studies on disease-disease relationships through common shared risk loci (e.g., HLA) can be found in [30,31,32,33].
In our model, we integrated disease-to-SNP associations via GWAS with SNP-to-eQTL RNA associations via eQTL. As each disease may have one or many associated SNPs by GWAS and each SNP may regulate the expression of multiple genes via eQTL (eQTL RNAs), acting either in cis or trans, we used SNP-SNP pairs associated with two distinct diseases and with a statistically-significant overlap of RNA(s) to prioritize the disease pairs (Fig. 1b) (described below).
For each SNP-SNP pair where the SNPs are associated with two distinct diseases, we determined whether or not both SNPs were independently associated with regulating the same eQTL RNA transcript within a specific tissue/cell line. Since we focused on SNP-SNP pairs associated to distinct disease pairs, we could determine disease pairs with significant eQTL RNA overlapping using the Fisher’s Exact Test (FETeRNA). In detail, for each of the 19 tissues, we used the FET by taking all expressed eQTL RNAs as background and evaluating whether or not a eQTL RNA was associated with each of the two diseases via the associated SNPs. This allowed us to construct a contingency table (Fig. 1b) and derive a statistical significance (p-value) between any pair of diseases for each tissue. P-values were adjusted for multiple comparisons by False Discovery Rate (FDReRNA) using the Benjamini–Hochberg procedure .
In our approach, we retained all disease pairs surpassing the stringent cutoff of 0.05 for the FDR values.
Calculation of disease comorbidity based on HCUP
As mentioned, the comorbidity between bundled diseases from two HCUP datasets, the National Inpatient Sample (NIS13) and the Nationwide Emergency Department Sample (NEDS13), were assessed for robust findings. Two directional comorbidities for every pair of disease-bundles were evaluated separately (Fig. 1c). For each pair of disease-bundles, denoted D1 and D2, we tested the following logistic regression models (E: expectation):
The two models correspond to the comorbidity risk of D1 given D2 and D2 given D1 respectively. Covariates (confounders), such as race, sex, and age, were adjusted in both models. β ij are logistic coefficients of each variable to be estimated from the data. We chose not to use alternative methods, such as graphical modeling  and LASSO [36, 37] since graphical models are typically used to understand the joint dependence structure for a set of variables, while LASSO is commonly used for regularization when there are large numbers of potential effects in the model.
In implementation, we collected the most specific disease codes (all five ICD-9-CM digits) mapped to the disease-bundles. Then, for each patient, we determined whether the patient has a billing code among the mapped ICD-9-CM codes, through which the statuses for a pair of disease-bundles were determined. From all patients without missing required data (e.g., missing sex information), the parameters within the two models were estimated and the significance of the directional comorbidity (β11 and β21) was inferred. We adjusted multiple comparisons from both models and across all pairs of disease-bundles using Benjamini-Hochberg procedure to derive the False Discovery Rate , which derived two directional FDRs for each pair of disease-bundles (e.g., disease A may cause disease B and not the reverse). Disease pairs with FDR < 0.05 in either directional tests were considered as comorbidities. Larger odds ratio (OR) were estimated from the power of two coefficients (β11 and β21) as the OR of comorbidity between each pair of diseases. These calculations were conducted in both HCUP datasets separately and the best p-value is reported in Additional file 1 as FDRcomorbidity. We further confirmed the ORs from an additional bivariate logistic model within the Zelig R package for the retained disease pairs, where the bivariate logistic model is unidirectional and able to estimate the dependence of the two disease variables subjected to covariates, yielding a mean odds ratio between a pair of diseases .
Comparative studies between eQTLs and HCUP
To verify the hypothesis that diseases sharing convergent downstream are more likely to show comorbidities and that an association exists between disease comorbidity and genetic/genomic architectures, we investigated the concordance between disease pairs showing significant downstream eQTL convergence in GWAS (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs) and the pairs of diseases resulted prioritized as comorbid using the HCUP clinical data (Methods- Calculation of disease comorbidity based on HCUP). To this end, we performed a FET to test the enrichment of comorbid disease-pairs with the disease pairs sharing eQTL mechanisms (FETfinal, Fig. 1d). In the FETfinal test, we examined the number of disease pairs resulted statistically significant (FDR < 0.05)/not significant in the molecular dataset and in the clinical comorbid data. Therefore, the FETfinal test was conducted by counting the number of disease pairs under each of the four combinative conditions (contingency table shown in Fig. 1d). The signal robustness was verified across different conditions and datasets, different FDR cutoffs (ranging from 0.02 to 1) were evaluated for both comorbidity and eQTL RNA overlap, and the reproducibility and the enrichment trend were examined with respect to the strength of the cutoffs (Fig. 2). The reproducibility from multiple HCUP datasets was also examined using the comorbidity observed in multiple datasets.
In addition, we performed a further validation applying the FET test as previously described but excluded the disease pairs involving similar diseases. This way we could assess the significance of diseases not similar resulted sharing convergent mechanisms as well as comorbid.
To identify similar pairs, we computed a clinical-ontology-based semantic similarity (or distance) between the disease-bundle pairs applying Lin’s similarity metric  with Sanchez et al.’s information content estimation . This approach takes into account the SNOMED-CT ontological structure, and SNOMED-CT are used as proxy for phenotypic relatedness of diseases. For each bundle pair, a score between 0 and 1 was derived. We considered two diseases similar if their similarity score was higher than 0.9.
Network visualization of the comorbidities sharing intergenic genetic risks through eQTL RNA overlap
First, we created a network representing disease pairs (Fig. 1e) including (i) disease pairs sharing genetic mechanism through eQTL RNA overlap of eQTL associations to disease-associated SNPs (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs), and (ii) disease comorbidities in at least one of the two HCUP datasets (Methods - Calculation of disease comorbidity based on HCUP). Here, network nodes represent diseases and two nodes (diseases) are linked (network edge) if the related disease pair meet both the above-mentioned comorbidity and eQTL RNA overlap criteria. To every edge, we assigned a weight corresponding to the number of distinct tissues that yielded the significance of the disease pairs. We colored the nodes according to the clinical organ-system classes as defined by Han et al.  and adjusted edge width according to the edge-related weight value.
Second, for any interesting overlapping disease pairs, we built a related network to represent the biomolecular mechanisms underlying the comorbidity between the two diseases. Therefore, a network was built for each comorbid pair by connecting each disease in the pair to the related SNPs via GWAS and then associating each SNP to the overlapping RNAs resulted via the eQTL associations. Nodes of the resulting network can represent a disease, a SNP, or a RNA, and edges can correspond to disease-to-SNP or SNP-to-eQTL RNA associations. The biomolecular network of a comorbid disease pair therefore includes the common downstream RNAs (prioritized at FDR < 0.05) between the corresponding prioritized and disease-associated SNPs (Methods- Calculation of disease comorbidity based on HCUP). Irrelevant information for the comorbidity (e.g., insignificant eQTL RNA overlap) that can be derived from other edges is not shown. Note, we grouped as a single locus (LD ≥ 0.8) the SNP pairs associated with the same disease and that are in Linkage Disequilibrium (LD). All networks were visualized using Cytoscape .
Curation of prioritized comorbidities
For the comorbidities discovered from HCUP datasets (FDR < 0.05, OR > 3) that have convergent eQTL downstream genes, we conducted a systematic curation of the literature using PUBMED and Google Scholar (Fig. 1f). Disease names were searched in PUBMED and Google Scholar and abstracts of the resulted papers were checked for comorbidity evidence. Full texts were examined if a conclusion of comorbidity could not be concluded in the abstracts. For quality control, three independent curators carried out the curation and resolved. In addition, 15% of random disease pairs (controls), selected among pairs with no comorbidity in either HCUP datasets nor convergent eQTL mechanisms in any GTEx tissue, were added to the curation list (blind to curators). An inter-rater agreement was computed using the Spearman correlation test, while disagreement was thereafter solved under the supervision of an expert physician. Next, we categorized the resulting curation evidence of a disease pair into six levels: 1 = well-performed controlled studies confirming the positive comorbidity, 2 = evidence from studies with important limitation (e.g., small sample size), 3 = anecdotal case reports, 4 = strong evidence of absence of association (well-controlled studies, no significance), 5 = absence of studies in the literature, and 6 = strong evidence for an anti-correlation or non-coexistence of the diseases. Finally, to assess our results, we compared the frequencies of levels of evidence confirming the “prioritized comorbidities sharing molecular mechanisms” with those of the random disease-pair controls using the Chi-square test since multiple levels were involved in the test.
We mapped the diseases extracted from the GWAS Catalog and HCUP datasets into disease bundles as described in Methods- Data preprocessing to define disease bundles and map heterogeneous diseases representation. In detail, 262 disease bundles were associated with 429 diseases collected from the GWAS (Additional file 2). Out of these 262 diseases, 188 are associated with SNPs having eQTL associations with at least one eQTL RNA. On the other hand, using the clinical HCUP datasets, we associated 238 disease bundles with 2454 ICD-9-CM diseases (Additional file 3).
Disease pairs with convergent eQTL-mechanisms of genetic polymorphisms
The eQTL RNA overlap model allowed for the identification of shared RNAs associated with eQTL SNPs that were also significantly associated with two distinct diseases (FETeQTL). We conducted the eQTL RNA enrichment among all the possible combinations of the 188 disease bundles associated with SNPs present in at least one eQTL study, using each of the 19 GTEx tissues. The possible disease pairs combinations resulting from the 19 tissues were 16,320 (Note: this number is not the possible theoretical combination between all the 188 diseases since not all the pairs were present in all tissues). Starting from the GTEx studies, for each tissue, we extracted the eQTL SNPs and RNAs associated to the diseases (Table 2; Input columns) and computed the eQTL mRNA overlap model to extract significant disease pairs with convergent molecular mechanisms (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs). Overall, we prioritized 2043 distinct disease pairs with significant eQTL RNA overlap (FDReRNA < 0.05; across all tissues) out of these possible pairs. In our calculation, we considered SNP-SNP pairs and the related eQTL RNAs where each SNP was associated to a distinct disease. Therefore, we specifically tested for cis- and trans-eQTL overrepresentation from coding as well as noncoding and intergenic SNPs. Note that the number of significant disease pairs varied from tissue to tissue, with lung tissue yielding the minimal number (45) of disease pairs and skin tissue yielding the maximal number (1775) of disease pairs (Table 2; Output column).
From the clinical NIS13 and the NEDS13 HCUP datasets (7 and 29 million patients observed at the hospital and emergency department, respectively and involving 237 disease bundles), 3032 and 4430 significant comorbidities were prioritized (OR > 1.5; FDRcomorbidity < 0.05; Additional file 4) accordingly, after adjusting for age, gender, and race. Overall, the union of the two findings resulted in 5200 significant comorbidities (HCUP OR > 1.5; FDRcomorbidity < 0.05), of which 2346 were found at OR > 3.
Convergent genetic mechanisms between disease-pairs are enriched among comorbid diseases
Among 5200 and 2346 comorbidities respectively observed in the HCUP clinical datasets at OR > 1.5 and OR > 3 (FDRcomorbidity < 0.05; Results- Disease pairs with convergent eQTL-mechanisms of genetic polymorphisms), 398 (Additional file 1) and 211 were also predicted as sharing common mechanisms among 2043 distinct disease pairs resulted with significant eQTL RNA overlap (FDReRNA < 0.05; Results - Disease pairs with convergent eQTL-mechanisms of genetic polymorphisms). Thus, the enrichment of clinical comorbidities with convergent genetics resulted significant (OR = 1.6; p-value = 2 × 10− 9; Fig. 2). Moreover, the enrichment of convergent downstream eQTL RNA (overlap) reaches an odds ratio as high as 8.6 (p-value = 6.4 × 10− 5 FET) when eQTL RNA overlap is FDReRNA < 10− 18, and comorbidity OR > 3 and FDRcomorbidity < 10− 11. The removal of highly-related diseases using information theoretic similarity in SNOMED yielded similar results (result not shown) confirming the robustness of the approach. These results suggest that convergent eQTL regulation by distinct genetic variants may contribute in part to comorbid syndromes.
Visualization of comorbidities sharing intergenic genetic risks through eQTL RNA overlap
The network of disease pairs prioritized as both (i) sharing eQTL RNAs significantly through their respective disease-associated eQTLs (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs) and (ii) comorbid in HCUP datasets (Methods- Calculation of disease comorbidity based on HCUP) is shown in Fig. 3. Two hundred and eleven disease-pairs were observed under these conservative criteria, most of which have been prioritized by eQTL associations derived from multiple relevant tissues, with each being analyzed independently (shown by the thickness of the edges). The network forms several clusters, corresponding to major disease classes (encoded by colors) such as immune-mediated diseases (the largest one), neurological diseases, cancers, metabolic diseases, cardiovascular diseases, among others. Specific examples of the patterns observed are further discussed in the caption of Fig. 3.
Curated literature review of the prioritized eQTL-driven comorbidity network
We curated the 211 prioritized comorbidities with eQTL underpinning (Fig. 3) along with 31 (about 15% of the comorbidities) random pairs of diseases added to blind the curators (Methods- Curation of prioritized comorbidities). The curation results are shown in Fig. 4. Compared to controls (random set among non-prioritized disease pairs), truly prioritized disease comorbidities were significantly enriched in validated positive comorbidities in the literature (Levels 1 or 2), while controlled pairs were significantly enriched in a no-correlation curation category (Levels 5 or 6) (Chi-square test; p = 0.001). The results indicate among our prioritized disease pairs, molecular mechanisms are proposed for ~ 42% of disease-pairs with known comorbidity (Levels 1 or 2) and ~ 30% for novel disease pairs (Level 5). The latter predictions had eluded clinicians and population health specialists and unveil novel clinical syndromes as well as opportunities for new therapies.
Case studies of biologically convergent disease comorbidities
The majority of GWAS variants in complex diseases are thought to manifest their effect through regulation , but the majority of identified variants lie far from genes or transcription start sites making candidate molecules and functions difficult to identify or study. The evidence from ENCODE suggests that as much as 80% of the non-coding intergenic regions are biologically interacting . Figure 5 illustrates two cases of reproducible comorbidities in clinical datasets for which the shared molecular underpinning had not been identified in the literature, indicating this provides testable novel mechanisms. The networks were built as described in Methods- Network visualization of the comorbidities sharing intergenic genetic risks through eQTL RNA overlap.
Figure 5a shows the psoriasis and polycystic ovary syndrome disease pair resulted as clinically comorbid (bivariate logistic regression OR = 3.3, 95% CI: 2.8–3.9 in NIS13; OR = 68.5, 95% CI: 63.2–74.3 in NEDS13). Their independent disease-associated variants are both located in chromosome 12 but have negligible linkage disequilibrium (LD r2 ≤ 0.05 for any pairs of SNPs shown in the panel). rs12580100 associated with psoriasis and rs705702 with polycystic ovary syndrome are both associated by eQTL studies with the expression change of the eQTL RNA AC004057.1 on chromosome 4, a transcribed processed pseudogene, whose functions have yet to be characterized. Another example, in Fig. 5b, illustrates the prioritized molecular mechanisms between Parkinson’s disease and schizophrenia, which also are reproducibly comorbid (bivariate logistic regression OR = 4.2, 95% CI: 4.0–4.3 in NIS13; OR = 4.3, 95% CI: 4.1–4.5 in NEDS13). Immune-mediated mechanisms are being prioritized by eQTL associations as common between intergenic SNPs of these diseases, with 14 common downstream RNAs associated by eQTL with their respective distinct variants in chromosome 6. The two diseases share a cis-regulated gene IGSF9B which may inhibit synapse development . Besides, two SNPs, rs17115100 and rs11191419 (LD R2 = 0.15), are associated with the expression of CYP17A1-AS1, a ncRNA within a P450 enzyme protein CYP17A1, which catalyzes many reactions and synthesis of cholesterol, steroids, and other lipids. In addition, they also share a microRNA MIR1307 , which further regulates a series of downstream genes. The prioritized mechanisms also recapitulate many known HLA mechanisms, thus warranting further experimental validation of the previously stated novel ones.
In summary, we computationally integrated and mined two clinical datasets jointly with GWAS/GTEx datasets and identified hundreds of comorbid diseases that also presented associations with convergent eQTL regulation, which may contribute in part to their common progression and pathophysiology. The substantial enrichment of convergent genetic mechanisms among comorbid diseases provides an internal validation to the methodology. The curation against the literature provides an external validation, controlled by non-prioritized diseases pairs submitted in a blinded way to curators. In addition, the disease association network recapitulated many known clinical syndromes, such as the metabolic syndrome, and identified published comorbidities for which the underpinning genetic mechanisms have yet to be unveiled. We are pursuing a systematic curation of the mechanisms results to identify known vs. novel molecular underpinnings of comorbidities. Associating diseases by their common cis-eQTL downstream mechanisms has been recently reported , however, their predictions have not been confirmed in clinical datasets. In addition, Hauberg et al. did not investigate trans-eQTL associations as we did in this study . Altogether, such studies may provide novel mechanisms of comorbidities and provide insight for disease prevention or new therapeutic interventions.
Future studies will extend predictions beyond exact eQTL RNA overlap to use broader shared pathway memberships for distinct RNAs [19, 20, 44]. Other types of biological data (e.g., epigenetic assays and chromatin interactions in ENCODE) and clinical data (e.g., UK Biobank) could further contribute distinct and perhaps, a more accurate biological roadmap for disease comorbidity. Future studies should also focus on novel clinical syndromes that share genetic underpinning, which we are currently curating the results to identify those as well. Further, we are planning to build a publicly-accessible database with the resulting comorbid disease pairs having convergent molecular mechanisms, the network results, and an interactive visualization tool for the resulting network.
This proof-of-concept study highlights the promise of integrating multiscale genomics datasets to unveil the shared molecular mechanisms of disease comorbidities. We first integrated GWAS studies with eQTL associations to discover diseases showing significantly convergent mechanisms. Then, the parallel computation of disease comorbidity using clinical datasets enabled the identification of relationships between convergent mechanisms and disease comorbidities. This subset of comorbid diseases, with convergent eQTL genetic mechanisms underpinning them, highly suggests the novel or established clinical syndromes. While it took over a decade to confirm the genetic underpinning of the metabolic syndrome, this study is likely highlighting hundreds of new ones. In addition, this knowledge could potentially improve the clinical management of comorbid syndromes with precision (using SNPs that interact), as well as shed light on novel approaches of drug repositioning, or SNP-guided precision molecular therapy even when risk variants are intergenic.
Concept unique identifiers
Directed acyclic graph
Single nucleotide polymorphism database
Experimental factor ontology
Electronic health records
expression Quantitative Trait Loci
- eQTL RNA:
False discovery rate
Genome-Wide Association Study
Healthcare Cost and Utilization Project
The HCUP Nationwide Emergency Department Sample
HCUP National Inpatient Sample
Systematized Nomenclature of Medicine-Clinical Terms
Single nucleotide polymorphism
Unified medical language system
Barnett K, Mercer SW, Norbury M, Watt G, Wyke S, Guthrie B. Epidemiology of multimorbidity and implications for health care, research, and medical education: a cross-sectional study. Lancet. 2012;380(9836):37–43.
Ford ES, Giles WH, Dietz WH. Prevalence of the metabolic syndrome among US adults: findings from the third National Health and nutrition examination survey. Jama. 2002;287(3):356–9.
Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56.
Li L, Ruau DJ, Patel CJ, Weber SC, Chen R, Tatonetti NP, Dudley JT, Butte AJ. Disease risk factors identified through shared genetic architecture and electronic medical records. Sci Transl Med. 2014;6(234):234ra257.
Chen J, Sam L, Huang Y, Lee Y, Li J, Liu Y, Xing HR, Lussier YA. Protein interaction network underpins concordant prognosis among heterogeneous breast cancer signatures. J Biomed Inform. 2010;43(3):385–96.
Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007;25(3):309–16.
Pouladi N, Achour I, Li H, Berghout J, Kenost C, Gonzalez-Garay M, Lussier Y. Biomechanisms of comorbidity: reviewing integrative analyses of multi-omics datasets and electronic health records. Yearb Med Inform. 2016;(1):194–206.
Lee D-S, Park J, Kay K, Christakis N, Oltvai Z, Barabási A-L. The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci. 2008;105(29):9880–5.
Rubio-Perez C, Guney E, Aguilar D, Piñero J, Garcia-Garcia J, Iadarola B, Sanz F, Fernandez-Fuentes N, Furlong LI, Oliva B. Genetic and functional characterization of disease associations explains comorbidity. Sci Rep. 2017;7(1):6207.
Bagley SC, Sirota M, Chen R, Butte AJ, Altman RB. Constraints on biological mechanism from disease comorbidity using electronic medical records and database of genetic variants. PLoS Comput Biol. 2016;12(4):e1004885.
Park J, Lee DS, Christakis NA, Barabási AL. The impact of cellular networks on disease comorbidity. Mol Syst Biol. 2009;5(1):262.
Melamed RD, Emmett KJ, Madubata C, Rzhetsky A, Rabadan R. Genetic similarity between cancers and comorbid Mendelian diseases identifies candidate driver genes. Nat Commun. 2015;6:7033.
Ko Y, Cho M, Lee J-S, Kim J. Identification of disease comorbidity through hidden molecular mechanisms. Sci Rep. 2016;6:39433.
MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog). Nucleic Acids Res. 2017;45(D1):D896–901.
Kang HP, Morgan AA, Chen R, Schadt EE, Butte AJ. Coanalysis of GWAS with eQTLs reveals disease-tissue associations. AMIA Jt Summits Transl Sci Proc. 2012;2012:35.
Hauberg ME, Zhang W, Giambartolomei C, Franzén O, Morris DL, Vyse TJ, Ruusalepp A, Sklar P, Schadt EE, Björkegren JL. Large-scale identification of common trait and disease variants affecting gene expression. Am J Hum Genet. 2017;101:157.
Fagny M, Paulson JN, Kuijjer ML, Sonawane AR, Chen C-Y, Lopes-Ramos CM, Glass K, Quackenbush J, Platig J. Exploring regulation in tissues with eQTL networks. Proc Natl Acad Sci. 2017;201707375.
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204.
Han J, Li J, Achour I, Pesce L, Foster I, Li H, Lussier YA. Convergent downstream candidate mechanisms of independent intergenic polymorphisms between co-classified diseases implicate epistasis among noncoding elements. In: Pacific symposium on Biocomputing: 2018. Hawaii; 2018. p. 524–35.
Li H, Achour I, Bastarache L, Berghout J, Gardeux V, Li J, Lee Y, Pesce L, Yang X, Ramos KS. Integrative genomics analyses unveil downstream biological effectors of disease-specific polymorphisms buried in intergenic regions. NPJ Genom Med. 2016;1.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57.
Lussier YA, Rothwell D, Cote R. The SNOMED model: a knowledge source for the controlled terminology of the computerized patient record. Methods Inf Med. 1998;37(02):161–4.
Spackman KA, Campbell KE, Côté RA. SNOMED RT: a reference terminology for health care. In: Proceedings of the AMIA annual fall symposium: 1997: American Medical Informatics Association; 1997. p. 640.
Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl_1):D267–70.
Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–11.
International HapMap Consortium. The international HapMap project. Nature. 2003;426(6968):789.
Siva N. 1000 Genomes project. In: Nature Publishing Group; 2008.
Machiela MJ, Chanock SJ. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics. 2015;31(21):3555–7.
Percy C, Holten VV, Muir CS, Organization WH: International classification of diseases for oncology. 1990.
Pickrell JK, Berisa T, Liu JZ, Segurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016;48(7):709–17.
Gratten J, Visscher PM. Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med. 2016;8(1):78.
Li MJ, Liu Z, Wang P, Wong MP, Nelson MR, Kocher JP, Yeager M, Sham PC, Chanock SJ, Xia Z, et al. GWASdb v2: an update database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 2016;44(D1):D869–76.
Cross-Disorder Group of the Psychiatric Genomics C, Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, Perlis RH, Mowry BJ, Thapar A, Goddard ME, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat Genet. 2013;45(9):984–94.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995:289–300.
Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303(5659):799–805.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996:267–88.
Omranian N, Eloundou-Mbebi JM, Mueller-Roeber B, Nikoloski Z. Gene regulatory network inference using fused LASSO on multiple data sets. Sci Rep. 2016;6:20533.
Imai K, King G, Lau O. Toward a common framework for statistical analysis and development. J Comput Graph Stat. 2008;17(4):892–913.
Lin D. An information-theoretic definition of similarity. In: Icml: 1998; 1998. p. 296–304.
Sánchez D, Batet M, Isern D. Ontology-based information content computation. Knowl-Based Syst. 2011;24(2):297–303.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Consortium GT: Human genomics. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.
Woo J, Kwon S-K, Nam J, Choi S, Takahashi H, Krueger D, Park J, Lee Y, Bae JY, Lee D. The adhesion protein IgSF9b is coupled to neuroligin 2 via S-SCAM to promote inhibitory synapse development. J Cell Biol. 2013;201(6):929–44.
Li H, Lee Y, Chen JL, Rebman E, Li J, Lussier YA. Complex-disease networks of trait-associated single-nucleotide polymorphisms (SNPs) unveiled by information theory. J Am Med Inform Assoc. 2012;19(2):295–305.
Moro F, De Simone C, Morciano A, Tropea A, Sagnella F, Palla C, Scarinci E, Teti A, Caldarola G, D'Agostino M, et al. Psoriatic patients have an increased risk of polycystic ovary syndrome: results of a cross-sectional analysis. Fertil Steril. 2013;99(3):936–42.
We gratefully acknowledge the generous assistance of Maud Fagny, Joseph N Paulson, John Quackenbush, and John Platig for providing us with tissue specific eQTL relationships beyond that which is in their recent paper, which was used to build the biomolecular network layer.
This work has been supported in part by The University of Arizona Health Sciences Center for Biomedical Informatics and Biostatistics, the BIO5 Institute, the KEYS Research Internship Program, and the NIH (U01AI122275, HL132532, CA023074, 1UG3OD023171, 1S10RR029030). This article did not receive sponsorship for publication.
Availability of data and materials
Software will be provided upon publication.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 11 Supplement 6, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-11-supplement-6.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
File S1. Comorbid disease pairs (OR > 1.5) sharing molecular mechanisms. In the table we show all the disease pairs resulting comorbid (OR > 1.5 and FDRcomorbidity < 0.05) and with significant convergent mechanisms (FDReRNA < 0.05). Disease1 and disease2 columns represent the disease pair. The related FDRs obtained from the clinical and the molecular datasets are reported in FDRcomorbidity and FDReRNA columns, respectively. We reported also the resulting ORs and FDRs for both the HCUP datasets separated, i.e. NIS13 and NEDS13. For each single dataset there are two OR since they correspond to the beta related to the diseases and we derived two directional FDRs (see Methods). (XLSX 103 kb)
File S2. Mapping between disease bundles and diseases from the GWAS Catalog. (TXT 19 kb)
File S3. Mapping between disease bundles and ICD-9-CM diseases from the HCUP clinical datasets. (TXT 24 kb)
Figure S4. Reproducibility of comorbidity odds ratios observed in NIS13 (hospitalizations) and NEDS13 (emergency departments) HCUP datasets. The correlation R2 is 0.62 and 0.63 respectively. Top disease comorbidity was measured and compared directionally, and odds ratios are shown in a log scale. (DOCX 54 kb)