Using UMLS string matching algorithms, we mapped the Peterson et al. [1] collection of human variants to the UMLS [11] phenotype ontology and built a collection of almost 70,000 (87%) unique human genetic variants to disorder concepts. UMLS [11] offers wide coverage of variants and higher-level disease categorization through MeSH. This allowed us to cluster phenotypes into broader categories and make inferences about each subgroup. Unique genetic variants associated with phenotypes otherwise unable to be mapped to UMLS [11] were run through a manual curation protocol, through which 56% of variants have currently been manually curated. Mapping to UMLS [11] allowed human variants to be grouped together based on same and/or similar phenotype, alleviating many of the difficulties faced due to the lack of standardization of vocabulary.
Disease-level and pathway-level bipartite networks were constructed using KEGG pathway enrichment, linking human genetic variants by common pathways or CUIs. In both networks, one central cluster of nodes was the most highly connected. The nodes in this cluster generally encompassed pathways involved in essential processes (i.e., reproduction, survival), many of which may be altered in and/or directly linked to cancer. [10] Through network analyzation [15] of the pathway-level network, KEGG Pathways in Cancer was found to have a high connectivity in the central cluster, with the highest node degree of 52 (Table 1).
The disease-level network contained 223 CUIs connected by 2548 unique disease associations through gene- or pathway-level analysis. Of these, 1338 (53%) connections are only observed through disease-pathway associations and not otherwise connected at the gene-level. Additionally, 461 (18%) connections are observed through disease-gene associations only, and 741 (29%) connections are observed through both gene- and pathway-level associations. When CUIs are connected through both common genes and common pathways, the resulting disease associations function as confidence builders with a higher level of evidence to support the connection. Hypoglycaemia, hyperinsulinaemic (C1864903) and Diabetes, type 2 (C0011860) were connected through the genes HNF1A, ABCC8, HNF4A, and GCK, as well as the KEGG pathway Type II Diabetes Mellitus. This association is expected, as hypoglycemia is known to affect type II diabetes patients near insulin-deficiency [16]. CUI connections made through pathways but not through genes extend the functional context of variants and provide new potential disease associations. Noonan Syndrome (C0028326) and Essential Hypertension (C0085580) were connected through the KEGG pathway Vascular Smooth Muscle Contraction, despite associated variants not having any common genes in our repository. A common symptom of Noonan syndrome is hypertrophic cardiomyopathy [17], which in turn is highly related to hypertension and often occurs in conjunction in elderly patients [18], suggesting a logical connection between Noonan Syndrome and Essential Hypertension concepts in our network.
As shown in Fig. 7, comparison of Cardiovascular (C14) and Skin and Connective Tissue (C17) networks shows high overlap in the largest cluster of KEGG pathways, which includes basic cellular functions such as cell signaling, growth, and maintenance. Many of these kinds of pathways are also altered in different types of cancer, as seen by the connections and enrichment of cancerous pathways in the main network cluster. A few examples include Melanoma, MAPK Signaling Pathway, and Pathways in Cancer. The high similarity between C14 and C17 is to be expected, as many cardiac disorders involve the connective tissue within/surrounding the heart, and relationships have been observed between normal development of connective tissue and the cardiovascular system [19]. Comparison of Hemic and Lymphatic (C15) and Immune System (C20) networks shows high overlap in a cluster of immunological KEGG pathways, including Primary Immunodeficiency, Type I Diabetes Mellitus, and Autoimmune Thyroid Disease. This intersection is also expected to be significant, as lymphatic diseases are highly linked to the immune surveillance and adaption [20].
Our next step is to continue analyzing human genetic variants at different levels of clustering, expanding our classifications and extending functional context to find new disease connections. If a pathway is found to link to multiple diseases, a drug being used to treat one disease could potentially be repurposed to treat another disease connected at the same pathway level [7]. In addition, if a disease is found to link to multiple pathways, a patient with this disease may benefit from a pathway-guided combination therapy [7]. With the addition of patient data, variant-based disease-pathway associations can be compared across individuals and provide a platform for incorporating new variant data into our database. In the future, this will allow us to develop computational tools that facilitate the optimization of personalized diagnosis, prognosis, and disease treatment.