We implemented a workflow to identify potential therapeutics for CAD by integrating the two following data sets (Figure 1):
-
1.
A predicted candidate gene data set for CAD, obtained by reanalysing the WTCCC-GWAS data [9], with Gentrepid using five bioinformatic modules: CMP, PPI, CPS, CRT and MIR [15, 16];
-
2.
A drug-gene target data set retrieved from three publically available drug databases namely TTD, DrugBank and PharmGKB [17, 19, 21].
Candidate gene data set
In our previous work, we predicted a total of 647 candidate genes for CAD by careful reanalysis of the WTCCC GWAS data on CAD [9] using the Gentrepid candidate gene prediction system [16].
The WTCCC study used a highly stringent significance threshold (p ≤ 5 × 10-7) to correct for multiple testing in GWAS analysis [9]. While robust, this approach resulted in association of only one genetic locus with CAD [14]. To address the high false negative rate of GWAS studies, we previously proposed a bioinformatics strategy to sift through genes near the implicated loci of a large number of SNPs of slightly lower significance thresholds. We considered four thresholds of decreasing stringency: a highly significant set (HS - p ≤ 5 × 10-7), a Medium highly significant set (MHS - p ≤ 10-5), a Medium weakly significant set (MWS - p ≤ 10-4), and a weakly significant set (WS - p ≤ 10-3). In total, we constructed a series of four SNP sub sets comprising a total of 757 SNPs for CAD [14].
An additional problem arises when mapping these SNPs to nearby genes. Although the causal SNPs are likely to be in linkage disequilibrium with the implicated SNPs, genomic architecture is still not well understood. The implicated SNP may be in a control region distal to the transcribed region of the gene. Six different search spaces - three of fixed-widths and three proximity-based, were created around each SNP-based genetic locus, for analysis by the Gentrepid candidate gene prediction system [14]. Thus, we utilized six gene selection methods around each SNP to construct the gene search spaces, using four SNP sets acquired by incrementally lowering the significance threshold of the data, resulting in a total of 24 search spaces [9, 14].
For each of these 24 search spaces, we used the following five bioinformatic modules to predict and prioritize candidate genes for CAD using the Gentrepid candidate gene prediction tool [16]: Two systems biology approaches - a) Common Pathway Scanning (CPS) and, b) Protein-Protein interaction module (PPI); one domain homology module - c) Common Module Profiling (CMP) and; two nucleic acid based regulatory modules - d) Common Regulatory Targets (CRT) and, e) the Micro-RNA regulatory module (MIR) [16].
The two systems biology modules, CPS and PPI, are based on the principle that common phenotypes are associated with proteins that participate in the same protein complex or biochemical pathway [22]. The domain-homology module, CMP, is a sequence analysis approach based on the assumption that candidate genes are similar in function to disease genes already determined for the phenotype [23]. We have described these methods in detail in our previously published work [14, 15, 20].
The two nucleic acid-based regulatory modules: CRT and MIR are based on the assumption that disruption of regulatory elements controlling gene expression can cause diseases [24]. CRT searches for genes in the susceptibility genetic loci that bind with common transcription factors. Regulatory information for genes of the search space was retrieved from the Open REGulatory ANNOtation (oRegAnno) database, a publically available database of curated known regulatory elements from the scientific literature [25]. The MIR module is based on the assumption that dysfunction of micro-RNAs (miRNAs) plays a key role in the heart, central nervous system, and immune system-related diseases [26]. MIR searches the genetic susceptibility loci for genes which are common miRNA targets and present in regulatory hubs [16]. MicroRNA information for this module was extracted from the mirBase database, an online repository for microRNA sequences and annotations [27].
Drug-gene target data set
We used a drug-gene target data set compiled from three online drug databases: DrugBank [17], PharmGKB [21] and TTD [19], described in detail in our previously published work [20].
DrugBank is a chemical and clinical drug database [18], combining detailed drug data and disease information with comprehensive drug-target associations [17]. Previously, we retrieved 6,711 drug entries active against 3,410 unique drug targets for several species from DrugBank [20]. We used the G-profiler conversion tool to translate human drug target information to official HUGO gene symbols [20, 28], resulting in a dataset comprising 3,910 drugs associated with 2,022 human drug targets [20].
The Pharmacogenomics Knowledgebase (PharmGKB) is a clinical drug database, combining information about drugs, diseases and targeted genes [21]. This database describes around 3,097 drugs and 26,961 human genes, but not all of these genes are associated with drugs. We obtained a licensed PharmGKB annotation dataset, describing a total of 382 drugs associated with 566 human drug targets [20].
The Therapeutic Target Database (TTD) is a chemical drug database, integrating drug data with therapeutic targets [19]. TTD contains 17,816 drugs (approved, clinical and experimental) associated with 2,025 human and non-human drug targets. We replaced the UniProt accession numbers with official HUGO gene symbols using the G-profiler conversion tool [28], extracting 2,960 drugs for 544 unique human drug targets [20].
Pooling the data from DrugBank, TTD and PharmGKB, we obtained a total of 2,494 unique gene targets from all the databases, comprising ~ 8% of the entire human genome [20]. A comparison of the extracted drug-target datasets from the three databases revealed that only 4% of human drug targets were common to all three drug databases [20]. We retrieved the maximum number of unique targets from DrugBank (1,495), followed by TTD (129), and PharmGKB (326) [20]. In pairwise comparisons, DrugBank and TTD share the maximal number of drug targets (398), while TTD and PharmGKB share the fewest (111) [20].
Of the 9,991 unique drugs contained in these three drug databases [20], 50% of them are found only in DrugBank, while the unique contributions from TTD and PharmGKB were 15-18% [20]. In pairwise comparisons, TTD and PharmGKB share 15-19% of their retrieved drugs with DrugBank [20]. DrugBank and PharmGKB share the maximal number of drugs (1620), while TTD and PharmGKB share the fewest (1352) [20]. In total, we retrieved a total of 7,252 unique drugs associated with 2,494 human drug targets from all three drug databases [20].
Identification of novel therapeutics and therapeutic targets
We mapped predicted therapeutic targets from the predicted candidate genes with the extracted drug-gene target association files. A total of 647 predicted candidate genes for CAD were mapped separately with the three drug-target association files, and results were retrieved.
Within this set, we distinguished known and novel therapeutic targets and therapeutics for CAD. If a drug associated with a therapeutic target is not registered as a therapy for CAD, it is designated as a novel therapeutic directed towards a predicted candidate gene target for CAD. Novel therapeutics may be suitable for repositioning towards treatment of CAD.
Comparison with previous studies
We compared therapeutic targets obtained in our previous study with this study to identify therapeutic targets for CAD using the WTCCC-GWAS data. In our previous study, we utilised the CMP and CPS bioinformatic modules to predict candidate genes and therapeutic targets [14, 20]. In this study, we integrated the results from five bioinformatic modules: CMP, PPI, CPS, CRT and MIR. Thus, we compared therapeutic targets obtained from two different bioinformatic studies conducted to reanalyse the same WTCCC-GWAS data [9].
The CARDIoGRAMplusC4D consortium meta-analysis compared 63,746 CAD cases with 130,681 controls identifying 15 genetic loci and 20 candidate genes for CAD [29]. Our previous reanalysis of this data with Gentrepid replicated 11 of the 20 candidate genes and made three novel gene predictions (LRPPRC, GUCY1B3, MAP3K4) [16]. In this study, we identified potential therapeutic targets after mapping the 20 candidate genes obtained from the CARDIoGRAMplusC4D study data and the three novel genes predicted by Gentrepid with the extracted drug-gene target dataset. We also compared the identified therapeutic targets from the CARDIoGRAMplusC4D study with the Gentrepid-predicted therapeutic targets.
Validation of predicted therapeutic targets
We validated the predicted therapeutic targets using two benchmarks as described in our previously published work [20]. The first benchmark tested the ability of Gentrepid to replicate known therapeutics for CAD. However, this benchmark does not give any idea about the validity of the novel predictions for CAD. Therefore, we performed a second benchmark to assess the validity of the candidate gene predictions using text mining of the existing Pubmed literature for CAD.
In the first benchmark, we classified genes present in the six search spaces as "CAD candidates" or "CAD non-candidates". We considered genes which are already known drug targets for CAD as "true positives". Targets which were not predicted by Gentrepid, but present in the search space and targeted by currently registered therapeutics for the CAD, were designated "false negatives". Genes, which were neither predicted for CAD nor targetable by CAD drugs, were designated as "true negatives"; and predicted novel therapeutic targets were selected as "false positives". Finally, we plotted a Receiver Operation Characteristic (ROC) curve considering six thresholds based on the number of targets present in the six search spaces constructed (see candidate gene dataset section in Methods for details). Non-linear regression analysis, was performed to fit the ROC curves (see Validation of predicted therapeutic targets in Results and Discussion).
In the second benchmark, we extracted Pubmed IDs of literature related to CAD from Pubmed in Feb. 2014. We mapped the retrieved Pubmed IDs to the gene citation data downloaded from Entrez Gene (ftp//http://ftp.ncbi.nih.gov/gene/) to calculate the number of article citations for each target, using both the gene name and the phenotype name (CAD). Further, a ROC curve was created considering four thresholds of article citations (one, five, ten and fifteen). Finally, non-linear regression analysis was performed to fit the ROC curve (see Validation of predicted therapeutic targets in Results and Discussion).