Skip to main content

Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information



The genetic contributions to human common disorders and mouse genetic models of disease are complex and often overlapping. In common human diseases, unlike classical Mendelian disorders, genetic factors generally have small effect sizes, are multifactorial, and are highly pleiotropic. Likewise, mouse genetic models of disease often have pleiotropic and overlapping phenotypes. Moreover, phenotypic descriptions in the literature in both human and mouse are often poorly characterized and difficult to compare directly.


In this report, human genetic association results from the literature are summarized with regard to replication, disease phenotype, and gene specific results; and organized in the context of a systematic disease ontology. Similarly summarized mouse genetic disease models are organized within the Mammalian Phenotype ontology. Human and mouse disease and phenotype based gene sets are identified. These disease gene sets are then compared individually and in large groups through dendrogram analysis and hierarchical clustering analysis.


Human disease and mouse phenotype gene sets are shown to group into disease and phenotypically relevant groups at both a coarse and fine level based on gene sharing.


This analysis provides a systematic and global perspective on the genetics of common human disease as compared to itself and in the context of mouse genetic models of disease.

Peer Review reports


Common complex diseases such as cardiovascular disease, cancer, and autoimmune disorders; metabolic conditions such as diabetes and obesity, as well as neurological and psychiatric disorders make up a majority of health morbidity and mortality in developed countries. The specific genetic contributions to disease etiology and relationships to environmental factors in common disorders are unclear; complicated by many factors such as gene-gene interactions, the balance between susceptibility and protective alleles, copy number variation, low relative risk contributed by each gene, and a myriad of complex environmental inputs.

Genetic association studies using a candidate gene approach and more recently whole genome association studies (GWAS) have produced a large and rapidly increasing amount of information on the genetics of common disease. In parallel, mouse genetic models for human disease have provided a wealth of genetic and phenotypic information. While not always perfect models for human common complex disorders, the genetic purity and experimental flexibility of mouse disease models have produced valuable insights relevant to human disease.

Gene nomenclature standardization[1], database efforts [24], and phenotype ontology projects[5] in both human and mouse over the past decade have provided the foundation for integration of information on genetic contributions to disease and phenotypes. This allows the opportunity for systematic comparison and higher order systems analysis of disease and phenotypic information. In this report, we summarize and integrate large scale information on human genetic association information and mouse genetically determined phenotypic information with the goal of identifying fundamental relationships in human disease and mouse models of human disease.


The Genetic Association Database

The Genetic Association Database [2] (GAD) is an archive of summary data of published human genetic association studies of many common disease types. GAD is primarily focused on archiving information on common complex human disease rather than rare Mendelian disorders as found in the Online Mendelian Inheritance in Man (OMIM)[6]. GAD contains curated information on candidate gene studies and more recently on genome wide association studies. It builds on the curation of the CDC HuGENet info literature database [3] in part by adding molecular and ontological annotation creating a bridge between epidemiological and molecular information. This allows the large-scale integration of disease based genetic association information with genomic and molecular information as well as with the software tools and computational approaches and that use genomic information [712]. This report is a summary and analysis of the genes and diseases with positive associations in the Genetic Association Database with regard to replication, comparisons between diseases, and within broad phenotypic disease classes. Although GAD contains information on gene variation, this report is at the gene level only and does not consider specific gene variation or genetic polymorphism.

The Genetic Association Database (GAD) currently contains approximately 40,000 individual gene records of genetic association studies taken from over 23,000 independent publications. Importantly, a large number (11,568) of the records in GAD have a designation of whether the gene of record was reported to be associated (Y) or was not (N) associated with the disease phenotype for that specific record. Many records, for various reasons, do not have such a designation. In addition, a portion of the database records have been annotated with standardized disease phenotype keywords from the MeSH vocabulary. The GAD summations shown below are a subset of the records in GAD. They only include those records that are both; a) positively associated with a disease phenotype, and b) have a MeSH disease phenotype annotation. This represents a subset of 10,324 records having both positive associations to disease and records with MeSH annotations. Records designated as not associated (N) with a disease phenotype and those without MeSH disease annotation are not considered at this time in this report.

Mouse phenotypic database

The mouse phenotypic information described here was obtained from the Mouse Genome Informatics (MGI) database [4] Phenotypes, Alleles and Disease Models section. The file used for mouse phenotypic information (see methods) is comprised of 5011 unique genes and 5142 unique phenotypic terms derived from information from specific gene mutations in multiple mouse strains. The mouse phenotypic information had been annotated to the mouse gene mutation records using Mammalian Phenotype terms and codes in the mouse phenotype database as a component of the Mouse Phenotyping Project [5, 13].

Quantitation of genes and disease phenotypes

Quantitation of how often a disease phenotype was positively associated with a gene was performed as follows. GAD records having both recorded positive associations and annotated MeSH disease keywords were extracted and stored in a database according to their relationships. Using a perl script, the number of times of co-occurrence of a MeSH disease keyword was positively associated with a specific gene was recorded as found in the GAD database. These counts were sorted in declining order for each unique gene grouped by the disease MeSH term with which they are associated.

Mouse phenotypic information

The mouse phenotypic information described here was obtained from the Mouse Genome Informatics (MGI); Phenotypes, Alleles and Disease Models section;

Using these three files downloaded on 4-4-2008

The mouse phenotype files were extracted using a perl script annotating each gene with the phenotype term associated with each Mammalian Phenotype (MP) code.

Venn Diagram overlap of individual gene lists

Individual GAD primary gene sets were analyzed using Venny[14] Pathway Venn Diagram comparisons were performed by placing individual GAD primary gene sets into WebGestalt [15] to identify KEGG pathways, then placing the resulting pathway names into Venny.

Dendrogram analysis of gene sets

Relationships between diseases were identified by a unique method similar to phyologenetic classification. First the distance between the diseases were calculated by pairwise comparison of the diseases by finding the common genes between the pairs and dividing it by the smallest group of the pair. This number was then subtracted from 1. This step was done because if two lists are identical (100% match) then the resultant distance should be 0. This is represented in the formula:

Where: C k : Genes in each disease set (where k = i, j); N(C k ): Number of genes in each disease set (where k = i, j); dij is the pairwise distance; i, j: index of genes in each disease set where; i = 1, 2, 3, ........., n; j = 1, 2, 3, ........., m

The disease relationships were calculated from the distance matrix using the Fitch program from the Phylip package[16]. It calculates the relationships based on the Fitch and Margoliash method of constructing the phylogenetic trees[17] using the following formula (from the Phylip manual):

where D is the observed distance between gene sets i and j and d is the expected distance, computed as the sum of the lengths of the segments of the tree from gene set i to gene set j. The quantity n is the number of times each distance has been replicated. In simple cases n is taken to be one. If n is chosen more than 1, the distance is then assumed to be a mean of those replicates. The power P is what distinguished between the Fitch and Neighbor-Joining methods. For the Fitch-Margoliash method P is 2.0 and for Neighbor-Joining method it is 0.0. As running Fitch took a long time when the gene-set size was huge (weeks for the human gene-sets and months for the mouse gene-sets), Neighbor-Joining method was used to create the replicate dendrograms (not shown) after randomizing the input order for greater confidence. The resulting coefficient matrix files were displayed using the Phylodraw graphics program[18].

Hierarchical clustering of gene sets

Ward's minimum variance method[19] was used to find the distance between two diseases. The distance between the clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. Ward's method joins clusters to maximize the likelihood at each level of the hierarchy under the assumptions of multivariate normal mixtures, spherical covariance matrices, and equal sampling probabilities. Distance for Ward's method is: (taken from JMP Manual) where NK is the number of observations in CK (which is the Kth cluster, subset of {1, 2, ..., n) where n is the number of observations). is the mean vector for cluster CK.


Each record in GAD represents a specific gene from a unique publication of a human population based genetic association study and is categorized into one of 24 general disease classes corresponding to broad MeSH disease or disease phenotypic groupings. Table 1 is a summary of the number of positively associated human genes in each MeSH human disease class. As represented by these disease classes the GAD database covers a broad selection of diseases falling into major disease classes including; aging studies, cancer, immune disorders, psychiatric diseases, metabolic conditions, pharmacogenomic studies, and studies of chemical dependency, among others. Similarly, each record in the phenotype files from the MGI phenotype database represents a unique mouse gene specific genetic model. Table 2 shows the general categories represented by the mouse phenotype summary files and the number of mouse genes found in each top level phenotype class. The mouse files contain a greater number of intermediate developmental and morphological phenotypes (e.g. insulin resistance, absent CD4+ T cells, abnormal spatial learning) while the human files tend to comprise a greater number of end stage clinical disease phenotypes (e.g. Type 2 Diabetes, multiple sclerosis, autism).

Table 1 Number of human genes associated in each Disease Class
Table 2 Number of Mouse genes in each General Phenotypic Class

Table 3 introduces examples of human genes from fundamental biological pathways that have been consistently associated with major disease phenotypes highlighting the sometimes-broad pleiotropic effects that major regulatory molecules have on multiple disease phenotypes. Genes such as NOS3, nitric oxide synthase 3, regulating nitrous oxide production; HLA-DQB1, the MHC class II molecule DQ beta 1, involved in antigen presentation; ACE, angiotensin I converting enzyme, central to the renin-angiotensin system and PPARG, peroxisome proliferator-activated receptor gamma, regulating transcription in pathways important in lipid metabolism are examples of genes that affect multiple tissues and different organ systems through the complex course of disease progression. Importantly, all the mouse orthologs of the human genes in Table 3 have experimentally determined phenotypes that are similar or broadly overlapping with human clinical disease phenotypes (see below).

Table 3 Selected Major Genes and Disease Phenotypes

Summaries of genes and phenotypes in human and mouse

The majority of this report is built upon large non-redundant general summary lists for both human and mouse, shown below. These lists take two complimentary forms in both human and mouse. The first sets are GENE-to-Disease/Phenotype lists. These are non-redundant lists of genes showing the diseases or phenotypes that have been associated with each gene (Table 4 human, table 5 mouse, and table 6 human-mouse). The second sets of basic lists are DISEASE/PHENOTYPE-to-Gene lists. These are non redundant lists of diseases or phenotypes with the genes that have been associated with that disease or phenotype (Table 7 human and table 8 mouse).

Table 4 Selected Human Genes and Disease Phenotype (MeSH counts), positive associations
Table 5 Selected Mouse Genes-Disease Phenotypes
Table 6 Selected Human-Mouse Phenotype Overlap
Table 7 Selected Human Disease Phenotypes (MeSH) and Gene counts, positive associations
Table 8 Selected Mouse Disease Related Phenotypes


Table 4 shows examples of selected genes in each row that have been positively associated with specific disease phenotype keywords. Each human gene symbol is followed by a specific MeSH disease term and the number of times that gene has been positively associated with the term, in declining order. A major feature of Table 4 is that individual genes have been positively associated with sometimes overlapping disease phenotypes over a broad range from more frequently to less frequently. Table 4 is a small representative subset, truncated in the number of genes (rows) and the number of MeSH terms (columns). The complete list of 1,584 human genes with additional information can be found in Table S1a [20]. An interactive version of the same list can be found in Table S1b[21].

Quite often the resulting list of phenotypes associated with a specific gene may include the major disease phenotype followed by specific sub-phenotypes of the disease that contribute distinct aspects to the overall clinical disease phenotype. For example, IL13 has been associated with asthma at least 11 times as well as to the asthma sub-phenotype immediate hypersensitivity 4 times. Similarly, the gene CFH has been associated with macular degeneration at least 19 times, as well as to the endo-phenotype of macular degeneration, choroidal neovascularization 3 times. Although replication in genetic association studies has been widely debated[22], consistent replication by independent groups, although sometimes with both modest risk and significance values[23], suggests a fundamental measure of scientific validity. This is true for both candidate gene as well as GWAS studies.

In other cases, individual genes have been associated with independent but related disorders that may share fundamental biological pathways in disease etiology, such as HLA-DQB1, CTLA4, and PTPN22 as in the case of autoimmune disorders. This gene overlap emphasizes the fundamental, often step-wise biochemical role each gene plays in shared disease etiology [2427]. That is, HLA-DQB1 in antigen presentation, CTLA4 in regulation of the expansion of T cell subsets, and PTPN22 in T cell receptor signaling, all contributing to immunological aberrations and progression to clinical disease, as in rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes. In other cases, the same gene has been associated with quite different clinical phenotypes, suggesting sharing of complex biological mechanisms at a more underlying level. For example, the gene CFTR, widely recognized as the cause of cystic fibrosis, has been consistently associated with pancreatitis, may be implicated in chronic rhinitis [28], and may play a protective role in gastrointestinal disorders [29].


Tables 5 and S2 are the mouse equivalents of the human GENE-to-Disease/Phenotype lists (tables 4 and S1 for human). These were developed from the mouse phenotype table of genes with mouse phenotype ontological codes, downloaded on 4-4-08. To build tables 5 and S2, the matching phenotypic terms were exchanged for each Mammalian Phenotype code (MP:#). This resulted in the mouse GENE-to-Disease/Phenotype tables (tables 5 and S2) similar in structure to human GENE-to-Disease/Phenotype tables (tables 4 and S1). Unlike the human tables, the mouse GENE-to-Disease/Phenotype tables come from individual mouse experimental knockout or other genetic studies. They are not based on population based epidemiological studies. They also do not have the quantitative aspect of the human tables with publication frequency counts tagged to each record. In addition, although they include a wide variety of physiological, neurological, and behavioral phenotypes, they do emphasize developmental studies and observational morphological phenotypes common in mouse knockout studies. Table 5 is a small representative subset, truncated in the number of genes (rows) and the number of Phenotype terms (columns). The complete list of 5011 mouse genes with annotated phenotypes and additional information can be found in Table S2a[30]. An interactive version of the same list can be found in Table S2b[31].

Direct comparison of human and mouse genes disease/phenotypes

We can now compare these tables directly, thereby allowing gene-by-gene comparison of human disease phenotypes and mouse genetic phenotypes. Tables 6 and S3 are comparisons of the genes that overlap between the human and mouse gene lists (Table S1 and Table S2) showing mouse gene symbols and their human orthologs. Table 6 is a small subset of selected gene-phenotype cross species comparisons. Even though in some cases the human studies have not been replicated, there is often a striking concordance between human disease phenotypes and mouse genetically determined phenotypes. For example, the human gene inhibin alpha (INHA) has been associated with premature ovarian failure[32], and shows mouse phenotypes of abnormal ovarian follicle morphology, female infertility, and ovarian hemorrhage[33], among other phenotypes relevant to human disease. Similarly, in humans the engrailed homeobox 2 gene (EN2) has been associated with autistic disorder[34] while the comparison to mouse En2 has genetic mutations involved in abnormal social integration, spatial learning, and social/consecutive interaction, among others[35]. Importantly, the few mouse studies highlighted above, and many found in the main table S3, were published after the corresponding human genetic population based epidemiological studies. Given concerns of false positives and publication bias in human genetic association studies, direct comparisons to related mouse phenotypes may provide supporting evidence that a given gene may be relevant to a specific human disease phenotype. Table S3[36] is a full listing of the 1104 shared genes between the human disease and mouse phenotype summaries.

Summaries of phenotypes and genes in human and mouse

The second type of main summary tables are DISEASE/PHENOTYPE-to-Gene lists. Disease/Phenotype gene summaries are essentially transposed versions of the GENE-to-Disease/Phenotype summaries (Tables S1 & S2) that allow different types of comparisons. These are non-redundant lists of phenotype keywords, MeSH disease terms in the case of human and Mammalian Phenotype Terms (MP) in the case of mouse, followed by the genes associated or annotated to those disease phenotype keywords.


Table 7 shows examples of selected human disease phenotypes in each row positively associated with specific human genes for 8 major MeSH disease classes including cardiovascular, digestive system diseases, diseases of environmental origin, immune system diseases, mental disorders, nervous system diseases, nutritional and metabolic diseases, and eye diseases. Each Mesh phenotype term is followed by the number of times that a specific disease term has been positively associated with a particular gene in each row, in decreasing order. Table 7 is a small representative set, truncated in the number of disease phenotypes (rows) and the number of genes (columns). The complete list of 1,318 MeSH disease phenotype terms with additional information can be found in Table S4a[37]. An interactive version of the complete list can be found in Table S4b[38].


Tables 8 and S5 constitute the mouse DISEASE/PHENOTYPE-to-Gene summaries. Table 8 consists of selected mouse phenotypes which fall into similar general classes of the human table 7 followed by 6 representative genes that have been assigned to the appropriate phenotypic term due to a specific mouse genetic model. Unlike the human Disease/Phenotype-to-gene tables 7 and S4, the mouse tables 8 and S5 do not have quantitative information. Table 8 is also a small representative set, truncated in the number of disease phenotypes (rows) and the number of genes (columns). The complete list of 5,142 mouse phenotype terms with their corresponding Mammalian PhenoCode designations can be found in Table S5a[39]. An interactive version of the complete list can be found in Table S5b[40].

Using disease and gene lists

The purpose of this project is not simply to generate lists and information. It is to provide a distillation of disease and phenotype information that can be used in dissecting the complexities of human disease and mouse biology. Now that we have generated GENE-to-disease/phenotype summaries and DISEASE/PHENOTYPE-to-gene summaries for both mouse and human, they can be used for systematic analysis, comparison, and integrating of orthologous data with the goal of providing higher order interpretations of human disease and mouse genetically determined phenotypes.

Human disease and mouse phenotype based gene sets

Gene sets have been defined simply as groups of genes that share common biological function, chromosomal location, or regulation[41]. Gene sets are used in high-throughput systematic analysis of microarray data using a priori knowledge. Unlike previously defined gene sets based on biological pathways or differentially expressed genes[41], GAD disease gene sets are unique in that they are composed of genes that have been previously shown to be both polymorphic and have been determined to be genetically positively associated with a specific disease phenotype in a human population based genetic association study. Similarly, Table S5a[39] the mouse DISEASE/PHENOTYPE-to-Gene list is used as a source for gene sets for mouse phenotypes (MP gene sets) comprised of unique gene based mouse genetic models. These gene set files are currently the largest set of gene set files publicly available and the only gene sets files where each gene is based on direct human or mouse genetic studies.

Comparison of individual GAD disease gene sets

One aspect of common complex disease is that the development of disease and disease phenotypes quite often present along a broad spectrum of symptoms and share clinical characteristics, endo-phenotypes, or quantitative traits with closely related disorders [25]. This is evident in gene sharing, as mentioned above, and equally in the overlap of biological pathways between related disorders. Using GAD disease gene sets, Venn diagram comparisons among related disorders shows modest gene sharing. However, when gene sets are then placed into biological pathways and compared by Venn analysis, there is a marked increase in the overlap in pathways between related disorders. This was not found in gene sets from unrelated disorders. For example, major autoimmune disorders quite often share endophenotypes of lymphoproliferation, autoantibody production, and alterations in apoptosis, as well as other immune cellular and biochemical aberrations. As shown in Figure 1a, genes that have been positively associated with type 1 diabetes, rheumatoid arthritis, and Crohn's disease show a modest overlap. However, when individual gene sets are fitted into biological pathways, then compared for overlap of pathway membership, there is a striking increase in the overlap at the pathway level. This is true in a comparison of gene and pathways for type 2 diabetes, insulin resistance, and obesity as well (Figure 1b). This pattern of major pathway overlap does not seem to occur between unrelated disorders, such as insulin resistance, rheumatoid arthritis and bipolar disorder (Figure 1c). This disease related sharing at the pathway level suggests common regulatory mechanisms between these disorders and that the original positive associations are not necessarily due to random chance alone.

Figure 1

Venn Diagram analysis of individual GAD disease gene sets (circles) versus pathways (rectangles) produced from the corresponding gene set. All Venn Diagrams were produced with Venny

Group analysis of GAD disease gene sets between major classes of disease/phenotypes

Dendrogram analysis of human disease gene sets

As archival information grows, analysis of complex molecular and genetic datasets using clustering or network approaches has become increasingly more useful [13, 4245]. Therefore, in addition to comparisons between individual diseases using human and mouse gene sets, we analyzed large gene groups using dendrogram and clustering approaches based on gene sharing between gene sets. Figure 2 shows a broad based dendrogram comparison based on gene sharing between 480 GAD disease gene sets, using gene sets each containing at least 3 genes. A striking feature of this analysis is that at a coarse level, major disease groups cluster together in space demonstrating shared genes between major clinically important disease groups. Disease domains are represented by groups such as cardiovascular disorders, metabolic disorders, cancer, immune and inflammatory disorders, vision, and chemical dependency. At finer detail within a specific broader group, it becomes clear that individual diseases with overlapping phenotypes are found close in space, such as asthma, allergic rhinitis, and atopic dermatitis. This overlap due to gene sharing recapitulates an overlap in clinical characteristics between these related disorders. Similarly, phenotypes within the metabolic group related to diabetes are closely aligned in space including; insulin resistance, hyperglycemia, hyperinsulinemia, and hyperlipidemia. This close apposition of related disease phenotypes and sub-phenotypes at both a coarse and fine level is a consistent feature of the overall display. The human gene sets used in creating this tree diagram can be found in Table S6[46]. It is important to emphasize that this display and the distance relationships between diseases are calculated through an unbiased gene-sharing algorithm independent of disease phenotype labels and not as a result of an imposed logical hierarchy or an ontological annotation system. This grouping of major disease phenotypes based solely on gene sharing provides supporting evidence that the underlying disease based gene sets may have a fundamental relevance to disease and may not be reported in the literature by chance alone.

Figure 2

Human dendrogram comparison of 480 GAD disease gene sets based on gene sharing. The input GAD gene set file for this figure can be found in Table S6[46].

Dendrogram analysis of mouse phenotypic gene sets

Figure 3 is a similar dendrogram to the human tree using 1056 mouse phenotypic gene sets, using gene sets each containing at least 10 genes. This was produced using the same gene sharing algorithm as for the human gene sets in Figure 2. As with the human dendrogram, the mouse tree displays informative groupings at both a coarse and fine level. This tree groups into major groupings nominally assigned as brain development and brain function, embryonic development, cardiovascular, reproduction, inflammation, renal function, bone development, metabolism, and skin/hair development. The identification of major groupings emphasizing developmental processes reflects the emphasis of gene knockouts and developmental models resulting in observable morphological traits and less so with regard to end stage clinical diseases as in the human dendrogram. Like the human dendrogram (Figure 2) discrete major functional groupings in the mouse dendrogram suggests that individual experimental observations are not random. Fundamental complex processes such as metabolism, cardiovascular phenomena, and developmental processes are integrated by extensive sharing of related pliotropic genes. Moreover, like the human tree, fine structure in the mouse tree shows related mouse phenotypes are closely positioned in space. For example, in the metabolism major grouping, the individual phenotypes of body mass, adipose phenotypes, and weight gain are closely positioned. Similarly, in the brain function group, the behavioral phenotypes of anxiety, exploration, and responses to novel objects are found next to one another. This pattern is a fundamental feature of this tree. Like the human tree, the mouse dendrogram shown here is based solely on a gene sharing algorithm using genes assigned to individual phenotypes. It is not based on an imposed predetermined hierarchy or ontology. Importantly, unlike the human tree, the information contained in the mouse tree is derived from individual independent mouse genetic studies and phenotypic observations and not from large case controlled population based epidemiological studies. Controversial issues such as publication bias or study size which confound human genetic association studies are not as relevant here in the context of studies of experimentally determined individual mouse gene knockouts and related studies. The mouse gene sets used in creating this tree diagram can be found in Table S7[47].

Figure 3

Mouse dendrogram comparison of 1056 mouse phenotype (MP) gene sets based on gene sharing. The input MP gene set file for this figure can be found in Table S7[47].

Hierarchical clustering of human and mouse gene sets

Hierarchical clustering has become a common tool in the analysis of large molecular data sets[48] allowing identification of similar patterns in a scalable fashion from the whole experiment down to a level of fine structure. To provide further evidence of disease relevance and biological content contained in both the human and mouse gene sets hierarchical clustering was performed on both human and mouse. Four hundred and eighty human gene sets were clustered producing 46 major disease clusters. In the mouse, clustering was performed on 2067 mouse phenotype gene sets, using gene sets containing at least 3 genes. This resulted in 165 major subgroups of functional phenotypic specificity. Hierarchical clustering is shown for human [Additional file 1 and Additional file 2] and for mouse [Additional file 3 and Additional file 4]. Like the human and mouse dendograms, this hierarchical clustering showed functional disease grouping at both a coarse group level and at a fine level within major phenotypic groupings. These clusters in both human and mouse falling into closely defined broad functional groups as well as closely related clinical, physiological, and developmental phenotypes demonstrates a general pattern of relevance to disease in their original underlying genetic associations. As in the dendrogram displays, this suggests that the genes nominally positively associated to these disorders, drawn from the medical literature, are not pervasively randomly assigned or due to a widespread pattern of random false positives associations.

Discussion and Conclusion

This report describes a summary of the positive genetic associations to disease phenotypes found in the Genetic Association Database as well as a summary of mouse genetically determined phenotypes from the MGI phenotypes database. The genes and disease lists described here were derived from a broad literature mining approach. We have shown disease relevance in three distinct ways; a) in comparing individual gene lists and pathways, b) comparing between species and, c) in broad based comparative analysis utilizing complex systems approaches. Moreover, we identify disease based genes sets for 1,317 human disease phenotypes as well as 5,142 mouse experimentally determined phenotypes. These resources are the largest gene set files currently publicly available and the only gene set files derived from population based human epidemiological genetic studies and mouse genetic models of disease.

Each individual GAD disease gene set (i.e. a single disease term followed by a string of genes) or mouse phenotype gene set becomes a candidate for a number of uses and applications including:

  1. a)

    contributing to complex (additive, multiplicative, gene-environment) statistical models for any given disease phenotype [4953]; b) use in comparative analysis of disease between disease phenotypes; c) use in interrogating other related data types, such as microarray (see below), proteomic, or SNP data [5456]; and d) integration into annotation engines[57] or genome browsers[58] or other analytical software to add disease information in comparative genomic analysis. In a sense, each individual human or mouse disease/phenotype gene set becomes a unique hypothesis, testable in a variety of ways. Increasingly, combinations of genes may have important predictive value as combinatorial biomarkers in predicting disease risk as opposed to single candidate genes [59, 60].

In addition, in an ongoing parallel set of experiments, using a Gene Set Analysis (GSA) approach using the web tool Disease/Phenotype web-PAGE, in the analysis of orthologous microarray data (De S, Zhang Y, Garner JR, Wang SA, Becker KG: Disease and phenotype gene set analysis of disease based gene expression, unpublished), both the human and mouse disease/phenotype gene sets defined above demonstrate striking disease specificity in PAGE[61] gene set analysis of previously published microarray based gene expression studies from numerous independent laboratories in both a species specific and cross species manner. This was true when studying gene expression studies of type 2 diabetes, obesity, myocardial infarction and sepsis, among others, providing further evidence of the disease and clinical relevance of both the human and mouse gene sets.

This approach is limited in a number of ways. In particular, the GAD database compares the results of human population based epidemiological studies performed using different sample sizes, populations, statistical models, and at different times over approximately the last 16 years. In addition, the GAD database draws on association studies of broad quality with different degrees of detail provided. Although all human genetic association studies discussed here have been individually determined to be positively associated with a disease or phenotype in a peer reviewed journal, we make no assertion that any individual study is correct and we recognize the controversy in the genetics community regarding statistical and biological significance of genetic association studies. Moreover, although the GAD database contains information on polymorphism and variation, and each GAD record is fundamentally based on polymorphism, this report does not consider variation or polymorphism in the summaries shown. Likewise, mouse genetic models in many cases are weighted to gene knockouts which may not be necessarily be directly representative of multifactorial human common complex disease.

However, even with these limitations, we believe valuable insights can be gained from broad based literature assessments of the genetic contribution in human common complex disease and in mouse phenotypic biology. More importantly, this suggests greater opportunities for systematic mining and analysis of published data and in cross comparison of archival molecular databases in both human and animal models of disease with regard to genetic variation, population comparisons, and integration with many different types of orthologous information.



Genetic Association Database


Mouse Genome Informatics


Medical Subject Headings


Genome Wide Association Study


Centers for Disease Control and Prevention


Human Genome Epidemiology Network.


  1. 1.

    Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res. 2006, D319-321. 10.1093/nar/gkj147. 34 Database

  2. 2.

    Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nat Genet. 2004, 36 (5): 431-432. 10.1038/ng0504-431.

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Lin BK, Clyne M, Walsh M, Gomez O, Yu W, Gwinn M, Khoury MJ: Tracking the epidemiology of human genes in the literature: the HuGE Published Literature database. Am J Epidemiol. 2006, 164 (1): 1-4. 10.1093/aje/kwj175.

    Article  PubMed  Google Scholar 

  4. 4.

    Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA: The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res. 2008, D724-728. 36 Database

  5. 5.

    Hancock JM, Adams NC, Aidinis V, Blake A, Bogue M, Brown SD, Chesler EJ, Davidson D, Duran C, Eppig JT, et al: Mouse Phenotype Database Integration Consortium: integration [corrected] of mouse phenome data resources. Mamm Genome. 2007, 18 (3): 157-163. 10.1007/s00335-007-9004-x.

    Article  PubMed  Google Scholar 

  6. 6.

    McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 2007, 80 (4): 588-604. 10.1086/514346.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics. 2006, 22 (6): 773-774. 10.1093/bioinformatics/btk031.

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Yue P, Melamud E, Moult J: SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics. 2006, 7: 166-10.1186/1471-2105-7-166.

    Article  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Smink LJ, Helton EM, Healy BC, Cavnor CC, Lam AC, Flamez D, Burren OS, Wang Y, Dolman GE, Burdick DB, et al: T1DBase, a community web-based resource for type 1 diabetes research. Nucleic Acids Res. 2005, D544-549. 33 Database

  10. 10.

    Sherman BT, Huang DW, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics. 2007, 8 (1): 426-10.1186/1471-2105-8-426.

    Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Yi M, Horton JD, Cohen JC, Hobbs HH, Stephens RM: WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data. BMC Bioinformatics. 2006, 7: 30-10.1186/1471-2105-7-30.

    Article  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Jegga AG, Chen J, Gowrisankar S, Deshmukh MA, Gudivada R, Kong S, Kaimal V, Aronow BJ: GenomeTrafac: a whole genome resource for the detection of transcription factor binding site clusters associated with conventional and microRNA encoding genes conserved between mouse and human gene orthologs. Nucleic Acids Res. 2007, D116-121. 10.1093/nar/gkl1011. 35 Database

  13. 13.

    Butte AJ, Kohane IS: Creation and implications of a phenome-genome network. Nat Biotechnol. 2006, 24 (1): 55-62. 10.1038/nbt1150.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    VENNY. An interactive tool for comparing lists with Venn Diagrams. []

  15. 15.

    Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005, W741-748. 10.1093/nar/gki475. 33 Web Server

  16. 16.

    PHYLIP. []

  17. 17.

    Fitch WM, Margoliash E: Construction of phylogenetic trees. Science. 1967, 155 (760): 279-284. 10.1126/science.155.3760.279.

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Choi JH, Jung HY, Kim HS, Cho HG: PhyloDraw: a phylogenetic tree drawing system. Bioinformatics. 2000, 16 (11): 1056-1058. 10.1093/bioinformatics/16.11.1056.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Ward J: Hierarchical Grouping to optimize an objective function. Journal of American Statistical Association. 1963, 58 (301): 236-244. 10.2307/2282967.

    Article  Google Scholar 

  20. 20.

    Table S1a-Human GENE-to-Disease/Phenotype. A file of Human Genes followed by Disease Phenotype MeSH terms. []

  21. 21.

    Table S1b-Human GENE-to-Disease/Phenotype interactive. The same list as Table S1a, but with direct searches back to GAD. []

  22. 22.

    Ioannidis JP: Why most published research findings are false. PLoS Med. 2005, 2 (8): e124-10.1371/journal.pmed.0020124.

    Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Khoury MJ, Little J, Gwinn M, Ioannidis JP: On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epidemiol. 2007, 36 (2): 439-445. 10.1093/ije/dyl253.

    Article  PubMed  Google Scholar 

  24. 24.

    Becker KG, Simon RM, Bailey-Wilson JE, Freidlin B, Biddison WE, McFarland HF, Trent JM: Clustering of non-major histocompatibility complex susceptibility candidate loci in human autoimmune diseases. Proc Natl Acad Sci USA. 1998, 95 (17): 9979-9984. 10.1073/pnas.95.17.9979.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Becker KG: The common variants/multiple disease hypothesis of common complex genetic disorders. Med Hypotheses. 2004, 62 (2): 309-317. 10.1016/S0306-9877(03)00332-3.

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Lee YH, Rho YH, Choi SJ, Ji JD, Song GG, Nath SK, Harley JB: The PTPN22 C1858T functional polymorphism and autoimmune diseases--a meta-analysis. Rheumatology (Oxford). 2007, 46 (1): 49-56. 10.1093/rheumatology/kel170.

    CAS  Article  Google Scholar 

  27. 27.

    Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, et al: Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet. 2005, 77 (6): 1044-1060. 10.1086/498651.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Wang X, Kim J, McWilliams R, Cutting GR: Increased prevalence of chronic rhinosinusitis in carriers of a cystic fibrosis mutation. Arch Otolaryngol Head Neck Surg. 2005, 131 (3): 237-240. 10.1001/archotol.131.3.237.

    Article  PubMed  Google Scholar 

  29. 29.

    Bresso F, Askling J, Astegiano M, Demarchi B, Sapone N, Rizzetto M, Gionchetti P, Lammers KM, de Leone A, Riegler G, et al: Potential role for the common cystic fibrosis DeltaF508 mutation in Crohn's disease. Inflamm Bowel Dis. 2007, 13 (5): 531-536. 10.1002/ibd.20067.

    Article  PubMed  Google Scholar 

  30. 30.

    Table S2a-Mouse GENE-to-Disease/Phenotype. A file of Mouse Genes followed by Disease Phenotype Mammalian Phenotype (MP) terms. []

  31. 31.

    Table S2b-Mouse GENE-to-Disease/Phenotype interactive. The same list as Table S2a, but with direct searches back to MGI and GAD. []

  32. 32.

    Harris SE, Chand AL, Winship IM, Gersak K, Nishi Y, Yanase T, Nawata H, Shelling AN: INHA promoter polymorphisms are associated with premature ovarian failure. Mol Hum Reprod. 2005, 11 (11): 779-784. 10.1093/molehr/gah219.

    CAS  Article  PubMed  Google Scholar 

  33. 33.

    Wu X, Chen L, Brown CA, Yan C, Matzuk MM: Interrelationship of growth differentiation factor 9 and inhibin in early folliculogenesis and ovarian tumorigenesis in mice. Mol Endocrinol. 2004, 18 (6): 1509-1519. 10.1210/me.2003-0399.

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Gharani N, Benayed R, Mancuso V, Brzustowicz LM, Millonig JH: Association of the homeobox transcription factor, ENGRAILED 2, 3, with autism spectrum disorder. Mol Psychiatry. 2004, 9 (5): 474-484. 10.1038/

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Cheh MA, Millonig JH, Roselli LM, Ming X, Jacobsen E, Kamdar S, Wagner GC: En2 knockout mice display neurobehavioral and neurochemical alterations relevant to autism spectrum disorder. Brain Res. 2006, 1116 (1): 166-176. 10.1016/j.brainres.2006.07.086.

    CAS  Article  PubMed  Google Scholar 

  36. 36.

    Table S3-Human-Mouse Gene Overlap. A list of 1105 genes that overlap between the Human GENE-to-Disease Phenotype list (S1) and the Mouse GENE-to-Disease phenotype list (S2). []

  37. 37.

    Table S4a-Human DISEASE/PHENOTYPE-to-Gene. A file of Human Disease Phenotype MeSH terms followed by associated genes. []

  38. 38.

    Table S4b-Human DISEASE/PHENOTYPE-to-Gene Interactive. A file of Human Disease Phenotype MeSH terms followed by associated genes, but with direct searches back to GAD. []

  39. 39.

    Table S5a-Mouse DISEASE/PHENOTYPE-to-Gene (mouse). A file of Mouse Disease-Phenotype Mammalian Phenotype (MP) terms followed by assigned mouse genes. []

  40. 40.

    Table S5b-Mouse DISEASE/PHENOTYPE-to-Gene (mouse) Interactive. A file of Mouse Disease-Phenotype Mammalian Phenotype (MP) terms followed by assigned mouse genes, but with direct searches back to MGI. []

  41. 41.

    Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet. 2007, 3 (6): e96-10.1371/journal.pgen.0030096.

    Article  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104 (21): 8685-8690. 10.1073/pnas.0701361104.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, et al: Genetics of gene expression and its effect on disease. Nature. 2008, 452 (7186): 423-428. 10.1038/nature06758.

    CAS  Article  PubMed  Google Scholar 

  45. 45.

    Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG: A genomewide functional network for the laboratory mouse. PLoS Comput Biol. 2008, 4 (9): e1000165-10.1371/journal.pcbi.1000165.

    Article  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Table S6-Human Dendrogram Gene Sets. A file of the GAD Human gene sets used in the dendrogram fig 2. []

  47. 47.

    Table S7-Mouse Dendrogram Gene Sets. A file of the Mouse gene sets used to build the mouse dendrogram fig 3. []

  48. 48.

    Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Evans DM, Visscher PM, Wray NR: Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Genet. 2009, 18 (18): 3525-3531. 10.1093/hmg/ddp295.

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk of complex disease. Curr Opin Genet Dev. 2008, 18 (3): 257-263. 10.1016/j.gde.2008.07.006.

    CAS  Article  PubMed  Google Scholar 

  51. 51.

    Heidema AG, Boer JM, Nagelkerke N, Mariman EC, van der AD, Feskens EJ: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 23-10.1186/1471-2156-7-23.

    Article  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Mei H, Cuccaro ML, Martin ER: Multifactor dimensionality reduction-phenomics: a novel method to capture genetic heterogeneity with use of phenotypic variables. Am J Hum Genet. 2007, 81 (6): 1251-1261. 10.1086/522307.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Slatkin M: Exchangeable models of complex inherited diseases. Genetics. 2008, 179 (4): 2253-2261. 10.1534/genetics.107.077719.

    Article  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Chasman DI: On the utility of gene set methods in genomewide association studies of quantitative traits. Genet Epidemiol. 2008, 32 (7): 658-668. 10.1002/gepi.20334.

    Article  PubMed  Google Scholar 

  55. 55.

    Holden M, Deng S, Wojnowski L, Kulle B: GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008, 24 (23): 2784-2785. 10.1093/bioinformatics/btn516.

    CAS  Article  PubMed  Google Scholar 

  56. 56.

    Chai HS, Sicotte H, Bailey KR, Turner ST, Asmann YW, Kocher JP: GLOSSI: a method to assess the association of genetic loci-sets with complex diseases. BMC Bioinformatics. 2009, 10 (1): 102-10.1186/1471-2105-10-102.

    Article  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Huang da W, Sherman BT, Zheng X, Yang J, Imamichi T, Stephens R, Lempicki RA: Extracting biological meaning from large gene lists with DAVID. Curr Protoc Bioinformatics. 2009, Chapter 13 (Unit 13): 11.

    PubMed  Google Scholar 

  58. 58.

    Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al: The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 2009, D755-761. 10.1093/nar/gkn875. 37 Database

  59. 59.

    Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, Blennow K, Friedman LF, Galasko DR, Jutel M, Karydas A, et al: Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins. Nat Med. 2007, 13 (11): 1359-1362. 10.1038/nm1653.

    CAS  Article  PubMed  Google Scholar 

  60. 60.

    Zheng SL, Sun J, Wiklund F, Smith S, Stattin P, Li G, Adami HO, Hsu FC, Zhu Y, Balter K, et al: Cumulative Association of Five Genetic Variants with Prostate Cancer. N Engl J Med. 2008, 358 (9): 910-9. 10.1056/NEJMoa075819.

    CAS  Article  PubMed  Google Scholar 

  61. 61.

    Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005, 6: 144-10.1186/1471-2105-6-144.

    Article  PubMed  PubMed Central  Google Scholar 

Pre-publication history

  1. The pre-publication history for this paper can be accessed here:

Download references


The authors would like to thank Dr. Ilya Goldberg for helpful discussions, and Drs. Goldberg, David Schlessinger, and Chris Cheadle and for critical reading of the manuscript.

This research was supported by the Intramural Research Program of the NIH, National Institute on Aging and Center for Information Technology.

Author information



Corresponding author

Correspondence to Kevin G Becker.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YZ performed statistical analysis, gene set assembly, and contributed to the manuscript. SD performed dendrogram and clustering analysis and contributed to the manuscript. JG, KS, and SAW did database curation and analysis. KGB organized the project, did database curation, performed comparisons, and wrote the manuscript. All authors read and approved the manuscript.

Yonqing Zhang, Supriyo De contributed equally to this work.

Electronic supplementary material

Additional file 1: Hierarchical clustering of 480 Human GAD disease gene sets. This file contains a display of hierarchical clustering of 480 Human GAD disease gene sets, each gene set contain at least 3 genes each. (PDF 816 KB)

Individual human disease functional clusters

Additional file 2: . This file contains selected subsets of Additional File 1 including; a. tumorigenesis, b. autoimmune, c. cardiovascular, d. metabolism, and e. behavior. (PDF 419 KB)

Additional file 3: Hierarchical clustering of 2067 Mouse phenotypic gene sets. This file contains a display of hierarchical clustering of 2067 Mouse phenotypic gene sets, each gene set contain at least 10 genes each. (PDF 3 MB)

Individual mouse phenotypic functional clusters

Additional file 4: . This file contains selected subsets of Additional File 2 including; a. immune function, b. metabolism, c. neurological function/behavior, d. DNA replication/tumorigenesis, e. development and f. cardiovascular. (PDF 469 KB)

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhang, Y., De, S., Garner, J.R. et al. Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC Med Genomics 3, 1 (2010).

Download citation


  • Disease Phenotype
  • Mouse Genetic Model
  • Mammalian Phenotype
  • Mouse Phenotype
  • Genetic Association Database