Skip to main content

Compensating for literature annotation bias when predicting novel drug-disease relationships through Medical Subject Heading Over-representation Profile (MeSHOP) similarity



Using annotations to the articles in MEDLINE®/PubMed®, over six thousand chemical compounds with pharmacological actions have been tracked since 1996. Medical Subject Heading Over-representation Profiles (MeSHOPs) quantitatively leverage the literature associated with biological entities such as diseases or drugs, providing the opportunity to reposition known compounds towards novel disease applications.


A MeSHOP is constructed by counting the number of times each medical subject term is assigned to an entity-related research publication in the MEDLINE database and calculating the significance of the count by comparing against the count of the term in a background set of publications. Based on the expectation that drugs suitable for treatment of a disease (or disease symptom) will have similar annotation properties to the disease, we successfully predict drug-disease associations by comparing MeSHOPs of diseases and drugs.


The MeSHOP comparison approach delivers an 11% improvement over bibliometric baselines. However, novel drug-disease associations are observed to be biased towards drugs and diseases with more publications. To account for the annotation biases, a correction procedure is introduced and evaluated.


By explicitly accounting for the annotation bias, unexpectedly similar drug-disease pairs are highlighted as candidates for drug repositioning research. MeSHOPs are shown to provide a literature-supported perspective for discovery of new links between drugs and diseases based on pre-existing knowledge.


Using previously studied and approved pharmaceutical compounds and applying them towards novel diseases or phenotypes - so-called 'drug repositioning' - has emerged as a key issue in biomedical research [1, 2]. The cost of developing a new chemical or molecular entity with proven therapeutic benefit and established safety was estimated at over $1.8 billion in 2010, and continues to rise rapidly [3]. Therefore, using compounds with known biochemical mechanism of action and an established safety record for new purposes is an alternative to the high cost of de novo compound research [4]. Advances in drug repositioning research have identified potential treatments for Crohn's disease [5, 6], and have raised hopes for advances in the treatment of rare, orphan disorders [7].

Informatics-based approaches to drug repositioning are exemplified by the identification of known drug targets in genes arising in genome-wide association studies [8], the prediction of structural suitability of a known compound for a new protein target [9, 10], systems biology using gene expression patterns [6, 11], and the study of side effects [12]. Underlying many of these informatics approaches has been the availability of reference databases containing information about the relationship between genes, drugs and diseases, such as DrugBank [13], Pharmacogenomics Knowledge Base [14, 15], and the Comparative Toxicogenomics Database [16]. The broader informatics approaches to drug repositioning have been recently reviewed [2]. Advances in literature and text analysis methods offer a promising path to drug repositioning based on established knowledge. Text analysis methods have addressed the study of FDA package inserts in the SIDER database [17] to identify side effects, for the comparison of word utilization between drug and disease-related abstracts [18, 19], and for the analysis of similarity between gene ontology process annotations assigned to a known drug target and genes in disease-associated pathways. Literature-based drug repositioning has been reviewed [20, 21].

The foundation of any text-based analysis is an organized resource of the primary research literature describing the properties contained in the text. The central information source for biomedical literature is the MEDLINE®/PubMed® database encompassing over 20 million indexed articles in 2012. PubMed provides a citation resource tailored to biomedical researchers, globally accessible at no charge. This comprehensive database of medically relevant citations is curated by expert annotators at the National Library of Medicine. Each article is indexed with topics from the controlled vocabulary of Medical Subject Headings (MeSH) [22] by domain experts at the National Library of Medicine. MeSH terms include medically relevant categories such as Anatomy, Disease, Chemical Compounds (including pharmacologic compounds) and Psychiatric Disorders. In addition to the topics in the main MeSH hierarchy, additional chemical compounds are indexed through the Supplementary MeSH vocabulary.

Despite the increasing wealth of raw literature knowledge, having means to evaluate and navigate the entirety of this knowledge becomes progressively more challenging. We previously introduced Medical Subject Heading Over-Representation Profiles (MeSHOPs) as a convenient quantitative representation of the properties enriched in a bibliography of scientific literature from MEDLINE [23]. MeSHOPs succinctly describe the most highly associated MeSH terms for an entity of interest. The quantitative comparison of MeSHOPs is shown here to allow the predictive inference of entity-entity relationships in a study of relationships between drugs and diseases. However, we observe that the magnitude of research literature introduces a strong bias into the study of entity-entity relationships, with the most popular diseases more likely to be linked to drugs in the future, and vice-versa. This bias parallels the effect seen when predicting gene-disease relationships via MeSHOPs, where the most popular genes are more likely to be linked to diseases, and vice-versa [24]. It is important to be aware of biases and trends in research that may influence the results of text analysis, and to correct for these biases to better direct research efforts [25, 26].

In this report, we investigate the capacity of MeSHOP comparisons to detect functional relationships between pharmaceutical compounds and diseases with an emphasis on the ranking of candidates for drug repositioning research. We demonstrate that MeSHOPs capture the properties of drugs, and that such information can be compared to disease MeSHOPs to reveal functionally relevant relationships. Entities with limited associated literature, such as some rare diseases, are shown to have disproportionate scores in initial MeSHOP comparisons. To account for existing annotation levels of drug and disease entities and identify MeSHOP similarity, we measure the annotation strength for drug and disease entities and incorporate this prior information into the scoring of prediction strength. Using this improved comparison metric we demonstrate that drug and disease MeSHOP comparisons are improved, as validated by the identification of novel associations observed in future publications and against a curated reference collection.


Pharmacological substances

In this paper, we examine the set of drugs, defined as all chemical compounds annotated as having a Pharmacologic Action, taken from both the Medical Subject Headings (MeSH) and Supplemental MeSH vocabularies. Since 1996, indexers at the National Library of Medicine track articles where the action of a drug is discussed (MeSH Basics - In 2003, the MeSH Category "Pharmacologic Action" was created, in order to delineate chemical compounds which are used therapeutically as pharmacologic agents. Such annotations are conservatively assigned, requiring a minimum of 20 supporting research articles. We analyze these 6512 drugs with respect to the diseases in the MeSH hierarchy.

Constructing drug and disease MeSHOPs

The construction of MeSHOPs has been previously described in detail [23], but we provide a description here for the reader's convenience. A MeSHOP is a quantitative representation of the MeSH annotations associated with a set of articles, where the unifying property of the articles is that each addresses the same, specific entity (for example, all articles discussing the entity "Acetaminophen"). Each article has a curator-assigned set of MeSH terms available in MEDLINE. Comparing the observed frequency of each MeSH term annotated to the set of articles relative to the background rate for each term returns a measure of over-representation (see below for additional details). A MeSHOP is a vector of tuples < (t 1 , m 1 ), (t 2 , m 2 ), ... (t n , m n ) >. For each tuple (t i , m i ) in a MeSHOP, t i is a distinct MeSH term in the MeSH vocabulary and m i is the over-representation measure for the term t i . To account for the tree structure of MeSH, for each MeSH term associated with an article, the article is considered associated to all of the parent terms of that MeSH term.

We consider the 6 512 pharmacologic compounds identified in MeSH 2007 as the drug entities. The 4 229 terms in MeSH 2007 in Category C "Diseases" composed the set of disease entities. We take as the set of articles for a specific entity all the MEDLINE articles annotated by the associated MeSH term. These MeSH annotations are manually curated by domain experts at the National Library of Medicine.

Predicting drug-disease associations

A drug and a responsive disease are anticipated to share common literature annotations, such as metabolic pathways, cellular processes and symptoms, even if no links between the drug and the disease have been previously reported in the literature. To infer novel relationships between a drug and a disease, we perform quantitative pairwise comparisons of MeSHOPs between members of each class. We hypothesize that a previously unassociated disease t is likely to be associated with a drug d if the MeSHOP P t for the disease t is highly similar to the drug's MeSHOP P d . When many biomedical terms are common between two profiles, the likelihood for a future association between the entities profiled is expected to increase.

Sixteen distinct similarity measures were evaluated using Receiver Operating Characteristic Area Under the Curve (ROC AUC) scores, from counting measures such as term overlap and term coverage to calculated measures such as Euclidean (L2) and cosine distance of p-value profiles (See Table 1). The scores evaluate the shared characteristics from both the drug and the disease MeSHOPs to make predictions. Two baselines are presented for comparison: the number of terms in the drug MeSHOP, and the number of terms in the disease MeSHOP. These baselines consider only the drug MeSHOP alone, or the disease MeSHOP alone, respectively, not using any information from the other MeSHOP.

Table 1 Explanation of the scoring functions evaluated.

After implementing and evaluating the scoring metrics using AUC scores, a consistently effective metric was determined to be the Euclidean distance of the log of the p-value for the overlapping terms between the drug and the disease. P-values were reported by Fisher's Exact Test based on a hypergeometric distribution of term utilization across a background set of articles. For this report, two background sets are considered. When working within a specific class of entities (e.g. drugs), the background is most appropriately all articles that are associated with one or more members of the entity class. For comparisons between entity classes, a universal background is used. For this study, the universal set contained 17 million MEDLINE articles assigned MeSH terms in MEDLINE 2007.

Correcting for pre-existing literature annotation

Given the significant impact of annotation bias on pairwise MeSHOP comparison, we introduce a correction of our similarity scores for these pre-existing literature effects. This correction aims to normalize the scores with respect to existing literature annotation, correcting for inherent biases in the scoring methods and revealing associations that are due to the similarity of annotation rather than the amount of annotation (the research "popularity" of the entity).

Expressed formally, let us consider drug-disease relationships, with scores X s , drug annotation levels X c and disease annotation levels X d , where the annotation level is the number of MeSH terms annotated to articles in MEDLINE for the drug or disease. For a given drug c and disease d with drug annotation level x c and disease annotation level x d and a drug-disease score x s , we want to determine the probability that x s is more extreme than a random drug-disease relationship score with drug annotation level x c and disease annotation level x d :

P ( X s > x s | ( X c = x c ) ( X d = x d ) )

However, this probability can only be directly computed when the set of drugs and diseases is sufficiently large that there are many drugs and many diseases with the same level of annotation. In order to correct for the previously observed bias, we will seek to adjust the significance based on the local distribution of scores observed for similarly annotated entities.

P ( X s > x s | ( X c x c ) ( X d x d ) )

This can be computed by incorporating the properties of conditional probability as

P ( X s > x s | ( X c x c ) ( X d x d ) ) = P ( ( X s > x s ) ( X c x c ) ( X d x d ) ) P ( ( X c x c ) ( X d x d ) )

As well since P ( X c x c ) and P ( X d x d ) are independent, this can be further simplified to

P ( X s > x s | ( X c x c ) ( X d x d ) ) = P ( ( X s > x s ) ( X c x c ) ( X d x d ) ) P ( ( X c x c ) P ( X d x d ) )

We select P ( X c x c ) =P ( X d x d ) =0.1, and compare against the 10% of the drugs that are most similar, annotation level-wise, to the drugs in the relationship of interest, and likewise for 10% of the diseases. Specifically, we take the drugs within ±5 percentile of annotated term counts, and likewise the diseases within ±5 percentile of annotated term counts. The similarity scores for each possible drug-disease pairing between these selected groups are extracted. By comparison against these scores, an empirical significance score of the candidate drug-disease pairing is assigned. Given the 4 229 diseases and 6 512 drugs, selecting 10% yields hundreds of drug and disease peers, and several hundred thousand scores with which to compare.

P ( ( X s > x s ) ( X c x c ) ( X d x d ) ) is computed by dividing the number of drug-disease relationships with score greater than x s and with drug and disease annotation similar to x c and x d respectively, by the total number of drug-disease relationships. The correction described allows us to separate the effect of the level of annotation for the drug and disease from the similarity of the concepts and allows the user to distinguish high-scoring drug-disease relationships that are primarily due to the annotation level of the drug or disease concept, from high-scoring relationships that arise due to sharing significant profile similarity.

Validating drug-disease associations

To evaluate drug-disease associations predicted by MeSHOP similarity, we analyzed the 2007 baseline release of MEDLINE to generate predictions, and measured our predictive performance against annotations appearing in future releases of MEDLINE. The annual MEDLINE baseline releases 2007 and 2010 were used as the source of MeSH annotations for articles and were obtained directly from the NLM. The drug and disease MeSHOPs, computed for the MEDLINE baseline 2007, were compared using a panel of 16 similarity scores.

Future disease-drug relationships are predicted if MeSHOP comparison similarity scores exceed an applied threshold. Predictions were validated against drug-disease co-occurrences that appeared in the future MEDLINE releases which had not appeared in articles before 2007. A true positive novel association means an article referring to a previously unconnected drug-disease pair was published in the interim period between the 2007 and 2010 MEDLINE baselines.

As a second validation set, the Comparative Toxicogenomics Database (CTD) was used as a source of curated drug-disease relationships. We matched drugs from the 2011 CTD to the drugs defined in MEDLINE 2007, and defined a reference collection of 291 novel drug-disease relationships for those entries in CTD that were defined by publications appearing in the period of 2007-2011. The reference collection contains 191 unique drugs and 150 unique diseases.

Using these validation sets, we evaluate the candidate scoring methods by computing the Receiver Operating Characteristic (ROC) curve for predictions from analysis of the baseline 2007 data and reporting the Area Under the ROC Curve (AUC). Novel drug-disease pairs from the two reference sets are defined as "true positives", and all other drug-disease pairings are defined as "true negatives" (which is recognized to be conservative, as such pairs may be validated in future studies). All drug-disease pairs reported prior to 2007 are excluded from the AUC analysis.

The gold standard dataset analysed by the PREDICT algorithm [27] was mapped, with 574 of the 593 drugs mapping to 2007 MeSH pharmacologic compounds and the 190 of the 313 diseases mapping to MeSH Category C disease terms. A small number of drugs were not identified as pharmacologic compounds in 2007 MeSH. Diseases mapping to a combination of multiple MeSH disease phenotypes, or mapping to MeSH terms that were not in the Disease Category of MeSH were not included. Overall, 924 of the 1933 associations from the gold standard were mapped, comprising 406 drugs and 160 diseases. For the purposes of calculating the ROC validation curves, all other drug-disease associations between the mapped drugs and diseases are considered to be false positives. All the drug and disease mapped terms, as well as all the mapped gold standard drug-disease relationships are available for download at (


The analysis was performed using Python (, XSLT (, and the MySQL database system ( Fisher's Exact Test p-values were computed using the R statistics package ( Results were generated using 50 CPUs of a compute cluster running under Sun GridEngine ( A typical cluster machine is a 64-bit dual processor 3 GHz Intel Xeon with 16 GB of RAM.

Data was leased and downloaded from MEDLINE/PubMed ( The Comparative Toxicogenomics Database validation set was taken from the drug-disease relationships dataset (

Results are freely accessible on the web at Source code implemented in Python is available at (drug and disease profile analysis) and (evaluation and validation of results).


Generation of drug MeSH Over-representation Profiles (MeSHOPs)

MeSHOPs provide a quantitative overview of the biomedical knowledge associated with an entity of interest through the indexed biomedical terms. Following the described methods, MeSHOPs for all indexed diseases and drugs in MEDLINE were generated using archived MEDLINE data up until 2007. A drug MeSHOP is presented for acetaminophen (Figure 1), and a disease MeSHOP is presented for Aniridia (Figure 2). The scores within MeSHOPs are influenced by the background correction for the expectation of MeSH term frequency. If one takes the background rate from all articles in MEDLINE, MeSH terms preferentially associated with drugs are likely to be emphasized in the drug MeSHOPs, such as 'pharmaceutical preparation'. The strong scores for such drug-related terms can be corrected for by using class-specific backgrounds - such as the subset of articles that address one or more drugs. For comparisons of MeSHOPs across categories, as will follow, we select the universal background as a common background for all entities being compared.

Figure 1
figure 1

MeSHOP for Acetaminophen. All terms are presented in this MeSHOP word cloud associated in the Acetaminophen MeSHOP with a p-value of 0. The size of the term in the word cloud presented is proportional to the number of related articles for the term.

Figure 2
figure 2

MeSHOP for Aniridia. The top 150 terms in the profile for the disease Aniridia are shown, where the font size of each MeSH term is proportional to the negative log p-value for the term.

Predicting drug-disease associations

We examine the utility of drug-disease MeSHOP similarity scores for the prediction of drug-disease co-annotation in future publications. Table 2 demonstrates that comparison of drug and disease MeSHOPs predicts future drug-disease co-occurrence in subsequent years (2007-2011). The most effective similarity score is the Euclidean distance of log-p of overlapping terms only, which produces an AUC score of 0.95 for the prediction of future co-occurrence in publications:

Table 2 Performance of a selection of drug-disease similarity scores.
i e ( C D ) log ( c p ( i ) ) - log ( d p ( i ) ) 2

(C and D refer to the MeSH terms of drug and disease MeSHOPs respectively, c p (i) and d p (i) refer to the p-value for the MeSH term i of the drug or disease profile respectively).

Enthusiasm for the performance is tempered, however, by the fact that a simple metric of the number of MeSH terms associated with a disease when used as a prediction ranking produces an AUC score of 0.84 (and counts for drug-associated MeSH terms produce a score of 0.80). Randomly assigned scores will produce an AUC of 0.5. These results are consistent with a process in which well-studied diseases (or drugs) are more likely to be the subject of future research publications and therefore more likely to co-occur with drugs than diseases that have few publications. These scores reflect a systematic limitation in the scoring procedure that needed to be resolved to allow for the identification of drugs suitable for orphan disorders, as well as to produce a more refined list of candidates to pursue.

When we examine the mapped validation evaluated by the PREDICT algorithm, we see a non-random but weaker predictive ability from the number of terms for the disease (AUC of 0.60) and the number of terms for the drug (AUC of 0.58).

Comparing drug-disease MeSHOP profiles can yield AUC of up to 0.87, comparing favorably to the AUC of 0.90 reported by the PREDICT algorithm on the unmapped gold standard dataset (See Table 2).

Annotation bias observed for curated drug-disease relationships

Predicted novel drug-disease relationships were alternatively assessed against a curated reference collection from CTD that contains bonafide drug uses (i.e. not just co-occurrence in a paper, but manually assessed evidence that the drug is used as a treatment for a disorder). As seen in Table 2, similarity of MeSHOPs is able to accurately predict novel associations by comparing MeSHOPs of drugs and diseases, achieving ROC AUC of 0.93 (for the sum of the log of combined p-values). The Euclidean distance of overlapping terms metric that performed best for previous MeSHOP comparison performance tests, produces a similar ROC AUC of 0.92. As displayed in Figure 3, a substantial fraction of the validation set is over-represented for well-studied drugs and diseases. Over half of the 191 drugs are in the top 10% of all drugs in terms of amount of associated MeSH annotation (the peak to the left of the histogram). Only slightly less biased, of the 150 diseases, over half are in the top 15% of diseases, in terms of associated MeSH annotation. Consistent with these properties, using the baseline MeSH term counts for drug or disease annotation levels as scores, a ROC AUC of 0.83 is achieved. As for the co-occurrence measure, it is clear that annotation bias is a strong predictor for bona fide interactions.

Figure 3
figure 3

Distribution of drug annotation (A) and disease annotation (B) in the new drug-disease associations of the CTD validation set. The x-axis represents the quantile of the MeSH term counts for the drugs (part A) and diseases (part B) in the CTD reference collection (part A). The histograms indicate that both drugs and diseases within the CTD reference collection are biased toward greater numbers of associated MeSH terms.

Controlling for annotation bias

The influence of annotation on the MeSHOP comparison scores can be visualized using heatmaps. As seen in Figure 4, and fully consistent with the AUC scores above, there is a high degree of correlation between the amount of annotation for the disease (as measured by the number of MeSH terms in the disease profile), and the drug-disease score (Pearson correlation of -0.82). A correlation of -0.33 is observed when comparing drug-disease scores against the degree of drug annotation (see Figure 5). For a candidate list for drug repositioning, this annotation bias must be eliminated to allow for more rarely studied drugs or diseases to emerge from the analysis as candidates. We introduce a corrected scoring procedure for MeSHOP comparisons that computes the significance of similarity scores based on the distribution of scores for drug-disease tuples with similar annotation levels. In short, the observed similarity score should be remarkable given the level of annotation of the drug and disease in the tuple. After applying this correction for drug-disease annotation bias, both disease annotation level and drug annotation levels have very low correlation to the drug-disease score (0.08 and 0.05 respectively) (see Figure 6 and Figure 7). Table 3 demonstrates how the correction re-ranks the candidate drugs, shifting focus away from general compounds like monoclonal antibodies, immunoglobulin G, epinephrine and iron to compounds more directly to Arthritis and Gout. This also highlights some similar compounds that have not previously been linked to gout such as glucametacin and imidazole-2-hydroxybenzoate. We see similar results for the candidate drug lists for Asthma, Cardiac Arrhythmias, Jaundice and Lupus and provide the entire list of drug-disease relationships with raw and corrected scores online (See Additional file 1 and Supplementary Table 2 at

Figure 4
figure 4

The degree of disease annotation plotted against MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-disease tuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using Euclidean Distance. The degree of disease annotation, measured as the total number of distinct MeSH terms associated with a disease, is highly inversely correlated (Pearson correlation score of -0.82) with the similarity score.

Figure 5
figure 5

The degree of drug annotation vs. MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-disease tuples for a drug annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance. The degree of drug annotation, measured as the total number of distinct MeSH terms associated with a drug, is inversely correlated (Pearson correlation score of -0.33) with the similarity score.

Figure 6
figure 6

Disease annotation vs. corrected MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-disease tuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a corrected MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account for background annotation levels. The degree of disease annotation, measured as the total number of distinct MeSH terms associated with a disease, is no longer correlated (Pearson correlation score of 0.08) once corrected.

Figure 7
figure 7

Drug annotation vs. corrected MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-disease tuples for a drug annotation level (MeSH terms attached to the drug MeSHOP) on the x-axis and a corrected MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account for background annotation levels. The degree of drug annotation, measured as the total number of distinct MeSH terms associated with a drug, is no longer correlated (Pearson correlation score of 0.05) once corrected.

Table 3 Comparison of top drug candidates for gout.


In this report, we introduce a new literature-based procedure for the analysis of drug-disease similarity with a focus on the identification of candidates for drug-repositioning. Using MeSH Over-representation Profiles (MeSHOPs) as quantitative representatives for biological entities, we seek to identify drugs and diseases with similar annotation under the expectation that such similarity may be suggestive of potential for repositioning. Drug-disease MeSHOP similarity scores, using a panel of metrics, are found to be strongly influenced by the level of annotation of drugs and diseases. The most heavily studied diseases and drugs are disproportionately emphasized by the comparison scores. A new corrected scoring procedure is introduced to account for the background expectation of similarity scores for comparably annotated drugs and diseases. The new procedure is demonstrated to account for the bias. Application of the MeSHOP similarity scoring procedure reveals a set of candidate drugs for future repositioning research.

The assessment of drug repositioning candidate predictions is necessarily problematic. Given the expense of validating drug efficacy, there is no reference collection against which to measure performance. In this report we elected to take as references two approaches. First, we predicted future co-occurrence in the research literature. This measure is indirect, as co-occurrence does not necessarily reflect a functional tie between the drug and disease. Furthermore, this measure is particularly susceptible to annotation influence - well studied drugs and diseases have a higher rate of future publications and are thus more likely to be linked. The second reference collection tested was extracted from the CTD, which records bonafide drug-disease links. The performance measurements reflect a similar literature bias on the CTD results, which may reflect a tendency for well-studied drugs to be tested for utility in well-studied disease therapy.

Within this report, we observe that the MeSHOP comparisons perform better than simple annotation measures, which indicates that the similarity assessment has value. Furthermore, we were able to identify and correct for the annotation bias influence on the analysis. It is our hope that future annotation-based similarity measures will be evaluated for the biases we observe here.

The source of the annotation biases identified in the validation sets may lie in methodological bias or be intrinsic to the nature of drug-disease relationships. The case for methodological bias notes the relationship between the existence of experimental protocols and the publication of related research. The study of disease involves the availability of appropriate animal models, a family with a history of the condition, a large-scale association study, and an accurate protocol to diagnose the condition. As well, the rarity and severity of the disease will also change the degree of research interest. Likewise, the study of drugs also benefits from animal models, bioassays to detect the compound, the ability and ease to generate the compound, and the ability to deliver an appropriate dosage of the compound to the targets of interest. Other factors motivating research directions are availability of funding and the focus of existing lab personnel and their research towards more popular directions of research.

However, the bias may also intrinsic to the nature of the disease or of the drug. Gillis and Pavlidis [28] have previously observed that multifunctional genes are a strong driver in gene function prediction. They identify gene multifunctionality through protein interaction and co-expression datasets, which encompass previous definitions of the "hub-ness" of a particular gene. A drug may have a more global effectiveness, due to targeting these multifunction genes or their pathways, and thereby be involved in more drug-disease associations. Similarly, there may be diseases that are involved in key processes, and therefore be the target of many potential drugs. Whether the biases are intrinsic to the biology of drugs and diseases, primarily introduced by the human nature in the research, or some combination of these factors will ultimately be revealed by the results of future research. As our knowledge of the nature of drugs and diseases increases and matures, the human elements and methodological biases will increasingly become less significant, leaving us to identify the degree this bias is due to the biological mechanism and nature of the drugs and diseases.

The underlying principle motivating the comparison approach to reveal novel drug repositioning candidates is that there will be shared characteristics of the drug actions and disease properties. While the current approach utilizes universal comparisons across all MeSH terms, it may be beneficial to restrict the analysis to a subset of more relevant MeSH terms. Development of a procedure to restrict the terms (the features) of MeSHOPs may allow for more specific drug repositioning candidates to emerge in the future.

Future work

MeSH provides a wide spectrum of medically relevant topics, however, some applications may be better served by a vocabulary with more specific terms in the field of interest. For example, there are only eight terms in MeSH (Akathisia, Drug-Induced; Drug Eruptions; Drug Toxicity; Dyskinesia, Drug-Induced; Epidermal Necrolysis, Toxic; Erythema Nodosum; Serotonin Syndrome; Serum Sickness) relating directly to adverse drug events. Instead, there are several subheadings including "adverse effects", "poisoning", "toxicity" and "contraindications" which can occur with drug terms, or "chemically induced" and "complications" subheadings occurring with adverse outcomes. Expanding the analysis to look specifically for these subheading modifiers could allow us to extract a subset of articles directly relevant to adverse drug reactions for MeSHOP analysis. Alternatively, an alternative source linking side effects to articles could be employed to supplement our existing analysis with side-effect data.

CitationRank [29] was used to highlight genes involved in adverse drug reaction by analyzing the co-occurrence of genes in articles relating to an adverse drug reaction. Looking at the comprehensive network of MeSHOP similarity between genes, drugs and diseases would allow a similar network-style analysis, adding the information of the gene entities.

Rather than predicting drug-disease associations directly, another application of the method could be to highlight potential links between drugs and mechanisms of action. Drug therapies can be effective even when the understanding of the underlying mechanism of action is incomplete. These predicted drug-mechanism links could be also related back to relevant diseases, indirectly helping hypothesize on the biology of a disease and effective mechanisms for treatment.


Comparing MeSHOPs allows quantitative analysis of MeSH biomedical topics shared between drugs and diseases through their MEDLINE-indexed primary literature. Quantitatively measuring MeSHOP similarity is shown to infer functional relationships between drugs and diseases. Specifically, the similarity between drug MeSHOPs and disease MeSHOPs is highly predictive of future drug-disease ties. The best similarity metric, using Euclidean distance of the log-p of overlapping terms, achieves a mean AUC of 0.94, an 11% improvement over baseline. However, bibliometric characteristics, such as the number of terms in the disease MeSHOP, are demonstrated to have a strong bias in drug-disease association. We describe here a correction that eliminates this bias in the scoring metrics, separating the effects of the similarity scoring from the annotation bias.


This work was supported by the Canadian Institutes for Health Research [to WWW]; the Ontario Institute for Cancer Research through funding by the government of Ontario [to BFFO]; the National Sciences and Engineering Research Council of Canada [to WWW and WAC]; the Michael Smith Foundation for Health Research (MSFHR) [to WWW and WAC]; the National Institute of General Medical Sciences [R01GM084875 to WWW]; and the Canadian Institutes of Health Research/MSFHR Strategic Training Program in Bioinformatics [to WAC].


The authors are grateful to Leon French, Paul Pavlidis and Raf Podowski for comments and discussion on the research and Joseph Yamada for help with the website development.


The publication costs for this article were funded by the Centre for Molecular Medicine and Therapeutics (funds awarded to WWW) and the Ontario Institute for Cancer Research (funds awarded to BFFO).

This article has been published as part of BMC Medical Genomics Volume 6 Supplement 2, 2013: Selected articles from the Second Annual Translational Bioinformatics Conference (TBC 2012). The full contents of the supplement are available online at


  1. Ashburn TT, Thor KB: Drug repositioning: identifying and developing new uses for existing drugs. Nature reviews Drug discovery. 2004, 3: 673-83. 10.1038/nrd1468.

    Article  CAS  PubMed  Google Scholar 

  2. Dudley JT, Deshpande T, Butte AJ: Exploiting drug-disease relationships for computational drug repositioning. Brief Bioinform. 2011, 12: 303-311. 10.1093/bib/bbr013.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL: How to improve R&D productivity: the pharmaceutical industry's grand challenge. Nature reviews Drug discovery. 2010, 9: 203-14.

    CAS  PubMed  Google Scholar 

  4. Deftereos SN, Andronis C, Friedla EJ, Persidis A, Persidis A: Drug repurposing and adverse event prediction using high-throughput literature analysis. Wiley interdisciplinary reviews. Systems biology and medicine. 2011, 3: 323-34. 10.1002/wsbm.147.

    Article  CAS  PubMed  Google Scholar 

  5. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ: Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Science Translational Medicine. 2011, 3: 96ra77-96ra77. 10.1126/scitranslmed.3001318.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Dudley JT, Sirota M, Shenoy M, Pai RK, Roedder S, Chiang AP, Morgan AA, Sarwal MM, Pasricha PJ, Butte AJ: Computational Repositioning of the Anticonvulsant Topiramate for Inflammatory Bowel Disease. Science Translational Medicine. 2011, 3: 96ra76-96ra76. 10.1126/scitranslmed.3002648.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Sardana D, Zhu C, Zhang M, Gudivada RC, Yang L, Jegga AG: Drug repositioning for orphan diseases. Brief Bioinform. 2011, 12: 346-356. 10.1093/bib/bbr021.

    Article  CAS  PubMed  Google Scholar 

  8. Sanseau P, Agarwal P, Barnes MR, Pastinen T, Richards JB, Cardon LR, Mooser V: Use of genome-wide association studies for drug repositioning. Nature Biotechnology. 2012, 30: 317-320. 10.1038/nbt.2151.

    Article  CAS  PubMed  Google Scholar 

  9. Kinnings SL, Liu N, Buchmeier N, Tonge PJ, Xie L, Bourne PE: Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS computational biology. 2009, 5: e1000423-10.1371/journal.pcbi.1000423.

    Article  PubMed Central  PubMed  Google Scholar 

  10. Li YY, An J, Jones SJM: A large-scale computational approach to drug repositioning. Genome informatics. International Conference on Genome Informatics. 2006, 17: 239-47.

    CAS  PubMed  Google Scholar 

  11. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ: Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Science Translational Medicine. 2011, 3: 96ra77-96ra77. 10.1126/scitranslmed.3001318.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Yang L, Agarwal P: Systematic drug repositioning based on clinical side-effects. PloS one. 2011, 6: e28025-10.1371/journal.pone.0028025.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research. 2008, 36: D901-6.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, Hewett M, Lin Z, Liu Y, Liu S, Oliver DE, Rubin DL, Shafa F, Stuart JM, Altman RB: Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. The pharmacogenomics journal. 2001, 1: 167-70. 10.1038/sj.tpj.6500035.

    Article  CAS  PubMed  Google Scholar 

  15. Hewett M, Oliver D, Rubin D, Easton K, Stuart J, Altman R, Klein T: PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 2002, 30: 163-165. 10.1093/nar/30.1.163.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Davis AP, King BL, Mockus S, Murphy CG, Saraceni-Richards C, Rosenstein M, Wiegers T, Mattingly CJ: The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 2011, 39: D1067-D1072. 10.1093/nar/gkq813.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P: A side effect resource to capture phenotypic effects of drugs. Molecular systems biology. 2010, 6: 343-

    Article  PubMed Central  PubMed  Google Scholar 

  18. Swanson DR: Somatomedin C and arginine: implicit connections between mutually isolated literatures. Perspectives in biology and medicine. 1990, 33: 157-86.

    Article  CAS  PubMed  Google Scholar 

  19. Frijters R, Van Vugt M, Smeets R, Van Schaik R, De Vlieg J, Alkema W: Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases. PLoS Computational Biology. 2010, 6: e1000943-10.1371/journal.pcbi.1000943.

    Article  PubMed Central  PubMed  Google Scholar 

  20. Andronis C, Sharma A, Virvilis V, Deftereos S, Persidis A: Literature mining, ontologies and information visualization for drug repurposing. Briefings in bioinformatics. 2011, 12: 357-68. 10.1093/bib/bbr005.

    Article  CAS  PubMed  Google Scholar 

  21. Plake C, Schroeder M: Computational polypharmacology with text mining and ontologies. Current pharmaceutical biotechnology. 2011, 12: 449-57. 10.2174/138920111794480624.

    Article  CAS  PubMed  Google Scholar 

  22. Chapter 11 Relationships in Medical Subject Headings. []

  23. Cheung WA, Ouellette BF, Wasserman WW: Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs). BMC bioinformatics. 2012, 13: 249-10.1186/1471-2105-13-249.

    Article  PubMed Central  PubMed  Google Scholar 

  24. Cheung WA, Ouellette BF, Wasserman WW: Inferring novel gene-disease associations using medical subject heading over-representation profiles. Genome medicine. 2012, 4: 75-10.1186/gm376.

    Article  PubMed Central  PubMed  Google Scholar 

  25. Fedorov O, Müller S, Knapp S: The (un)targeted cancer kinome. Nature chemical biology. 2010, 6: 166-169.

    Article  CAS  PubMed  Google Scholar 

  26. Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH: Too many roads not taken. Nature. 2011, 470: 163-5. 10.1038/470163a.

    Article  CAS  PubMed  Google Scholar 

  27. Gottlieb A, Stein GY, Ruppin E, Sharan R: PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology. 2011, 7: 496-

    Article  PubMed Central  PubMed  Google Scholar 

  28. Gillis J, Pavlidis P: The Impact of Multifunctional Genes on "Guilt by Association" Analysis. PLoS ONE. 2011, 6: e17258-10.1371/journal.pone.0017258.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Yang L, Xu L, He L: A CitationRank algorithm inheriting Google technology designed to highlight genes responsible for serious adverse drug reaction. Bioinformatics. 2009, 25: 2244-2250. 10.1093/bioinformatics/btp369.

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding authors

Correspondence to BF Francis Ouellette or Wyeth W Wasserman.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed to the design of the method and the analysis and interpretation of the data. WAC implemented and carried out the study. All authors read and approved the final manuscript.

Electronic supplementary material


Additional file 1: Comparison of drug-disease candidates for five disorders. The top 20 drug candidates for gout, cardiac arrhythmia, lupus, jaundice and asthma are provided. We contrast the corrected and uncorrected drug candidate lists for each disorder. The uncorrected list is heavily biased to general compounds such as Monoclonal Antibodies, Norepinephrine and Iron, whereas the corrected drug candidates focus on drugs that are much more specific to the disorder. This file is in Excel format. (XLS 74 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cheung, W.A., Ouellette, B.F. & Wasserman, W.W. Compensating for literature annotation bias when predicting novel drug-disease relationships through Medical Subject Heading Over-representation Profile (MeSHOP) similarity. BMC Med Genomics 6 (Suppl 2), S3 (2013).

Download citation

  • Published:

  • DOI: