Drug-induced gene expression data
CMap (build 02) is a collection of 6,100 gene expression profiles for 13,469 human genes from four cell lines (MCF7, HL60, PC3, and SKMEL5) treated with 1,309 bioactive small compounds. The CEL files of CMap were downloaded from the database website [21]. The CMap annotation file (cmap_instances_02.txt) indicates the distinct instance ID for each pair of treatment-control samples with experimental conditions (i.e., concentration, cell line, and batch). A filtering process was applied to this dataset as follows. First, MCF7 cell line instances were selected because MCF7 is the most frequently used of the four cell lines. Next, the instance with the highest concentration of treatment was selected when the same compounds were assigned different instances. The instance with a smaller batch ID value was selected if the instance with the same condition instance was present in different batches. Following this filtering process, 1,294 instances (i.e., compounds) were finally selected. MAS5 normalization was applied to all selected samples [22]. The GeneChip array (HG-U133A) has multiple probes assigned to one gene. The unique representative probe was selected by using the highest average rank based on the rank ordered matrix of expression changes between treatments and controls. The fold change score was calculated for each treatment against the corresponding control, and the base-2 logarithm was calculated. Finally, a 1,294 × 13,469 gene expression matrix (comprising 1,294 compounds in rows and 13,469 genes in columns) was constructed and denoted by X.
The gene expression similarities of compounds and of proteins (hereafter referred to as “compound expression similarities” and “protein expression similarities,” respectively) were evaluated by using Pearson’s correlation coefficients on the row and column profiles of the gene expression matrix, respectively. The expression profile of each compound is a real-valued feature vector, so we used Pearson’s correlation coefficient for "compound expression similarities", and the expression profile of each protein is a real-valued feature vector, so we used Pearson’s correlation coefficient for "protein expression similarities". In fact, Pearson’s correlation coefficient is a standard similarity measure in transcriptomic data analysis.
Compound chemical structures and protein amino acid sequences
The compound structures used in CMap were obtained from ChemBank [23, 24]. The 2D frequency chemical descriptor of each compound was calculated by using the DragonX program [25], where the number of chemical substructures in the DragonX descriptor was 780. The chemical structure similarity scores of compound pairs were evaluated with generalized Jaccard coefficients. Protein amino acid sequences were downloaded from UniProt [26, 27]. The sequence similarity scores of protein pairs were calculated by using the Smith-Waterman algorithm in the EMBOSS water program (parameters: gap open penalty = 10.0, gap extension penalty = 0.5), and they were normalized via a cosine operation such that the maximal value was 1 and the minimal value was 0 [28, 29].
Compound–target interactions
The information about compound–target interactions was obtained from DrugBank [30, 31] and ChEMBL [32, 33]. For the target information in ChEMBL, we used only target proteins labeled “active” in the comments section. In total, there were 4,870 unique compound–protein interactions in the merged dataset, with 756 compounds and 584 proteins involved in these interactions. We used all possible compound-protein pairs excluding positive examples as negative examples. Thus, the number of positive examples is 4870 and the number of negative examples is 436634. We used this dataset as the gold standard to evaluate the performance for compound target prediction. To investigate the validity of new predictions, we used experimental data from ChEMBL on compound–target interactions (e.g., binding affinity of <30 μ M in IC50), as well as independent information about compound–target interactions stored in KEGG [34], Matador [35], and PDSP Ki [36].
Direct method for compound target prediction
The most straightforward method for compound target prediction involves directly using the fold change values of the associated compound–protein pairs in the gene expression matrix, X. The rationale for using this method is that drug-affected genes are highly variable in the observed gene expression. Thus, high scoring compound–protein pairs are predicted to be candidates for interacting compound–protein pairs. In this study, this method is referred to as the “direct method.”
We tested three possible scores for compound–protein pairs: i) X, ii) –X, and iii) abs (X). For the X score, the original fold change values of X are used, and higher scores correspond to up-regulated genes. For the –X score, the sign-inversed values of X are used, and higher scores correspond to down-regulated genes. For the abs (X) score, the absolute values of X are used, and higher scores correspond to up- or down-regulated genes. Compound-protein pairs with high scores in i) X, ii) –X, and iii) abs (X) are predicted to be candidates of interacting compound-protein pairs.
Classification method for compound target prediction
Another method for compound target prediction involves comparing the gene expression profiles of query compounds with those of other compounds of known target proteins. For this method, the rationale is that drugs with similar gene expression patterns are likely to share common target proteins.
We attempted to solve the problem of compound target prediction by supervised classification, which consists of the following two steps. First, a classifier for discriminating interacting compound–protein pairs from the other pairs is learned based on partially known compound–protein interactions. Second, the learned classifier is applied to new compound–protein pairs, and high scoring compound–protein pairs are predicted to be candidates for interacting pairs. In this study, this method is referred to as the “classification method.” The related classification algorithms or their variants have also been used in the context of chemogenomics in several previous studies [2–7].
Given a training set of n
x
× n
y
compound–protein pairs (x
i
, y
j
) (i = 1 ⋯ n
x
; j = 1, ⋯ n
y
), f(x ', y ') should be estimated to predict whether a compound x’ interacts with a protein y’. Among the pair classification algorithms, we used pairwise kernel regression (PKR) [9] because of its computational efficiency. We define a similarity function k
compound
for compounds and a similarity function k
protein
for proteins; thus, a statistical model of PKR for a given compound–protein pair (x’,y’) is defined as follows:
$$ f\left(\mathrm{x}\hbox{'},\mathrm{y}\hbox{'}\right)={\displaystyle \sum_{i=1}^{n_x}{\displaystyle \sum_{j=1}^{n_y}{\beta}_{ij}{k}_{compound}\left({\mathrm{x}}_j,\mathrm{x}\hbox{'}\right)}{k}_{protein}\left({\mathrm{y}}_j,\mathrm{y}\hbox{'}\right)} $$
β
ij
is a weight parameter to be optimized based on the training set. In practice, high-scoring compound–protein pairs are predicted as the interacting pairs. The inputs of the PKR model are similarity scores for both compound pairs and protein pairs. Therefore, the performance depends heavily on the similarities of compounds and proteins.
In the transcriptomic approach, we used compound expression similarity (Pearson’s correlation) for “compound similarity” and protein expression similarity (Pearson’s correlation) for “protein similarity”. In the chemogenomic approach, we used chemical structure similarity (generalized Jaccard coefficient) for “compound similarity” and protein sequence similarity (normalized Smith-Waterman score) for “protein similarity”. In the integrative approach, we used the average of compound expression similarity and compound structure similarity for “compound similarity” and the average of protein expression similarity and protein sequence similarity for “protein similarity”.
Predictive performance measures
We evaluated the predictive performance by using the receiver operating characteristic (ROC) curve and Precision-Recall (PR) curve, which are plots of true positive rates against false positive rates based on various thresholds (ROC) and precision (positive predictive value) against recall (sensitivity) based on various thresholds for the prediction score (PR), respectively. We computed the area under the ROC curve (i.e., the AUC score), where 1 is returned for a perfect inference and 0.5 is returned for a random inference, and the area under the PR curve (i.e., AUPR score), where 1 is returned for a perfect inference and the ratio of positive examples against all samples in the gold standard data is returned for a random inference. We used the ROCR package in the R language. In the program, many threshold values (e.g., 1, 0.9, 0.8, 0.7, ….,–0.8,–0.9, -1) were prepared for the predictions scores, the true positive rates against the false positive rates at each threshold value were plotted, and the ROC curve was drawn by connecting the dots. The AUC and AUPR scores were computed for each compound, and the average scores were used to summarize the results.
We also evaluated the performance of top ranked predictions by determining whether known target proteins appeared in the top 10 or top 50 high-scoring predictions. The accuracy of top ranked predictions is important in practice because experimental validation will be preferentially conducted on high-confidence predictions rather than low-confidence predictions. We computed the ratio of the number of correctly predicted compounds with at least one known target protein in the top 10 or top 50 against the number of all compounds of known targets in the gold standard data. We denote the ratio scores for top 10 and top 50 as “Top10 ratio” and “Top50 ratio,” respectively. AUC/AUPR represents the global accuracy of all compounds in each benchmark dataset, while Top10/Top50 represents the local accuracy of top compounds in each benchmark dataset. Note that the prediction accuracy scores were evaluated from different viewpoints.
Experimental protocol for the classification method
The classification method requires supervised learning with pre-existing knowledge about known compound–protein interactions. To evaluate the predictive performance of the classification method, we performed a cross-validation experiment using the gold standard compound–protein interaction data. This experiment was designed to simulate the practical situation in which researchers are required to predict the potential target proteins of new drug candidate compounds.
We performed the following 5-fold cross-validation. (i) We randomly split compounds into five compound subsets. (ii) We took each compound subset as test compounds, and constructed a test set of compound–protein pairs. (iii) We took the remaining compound subsets as training compounds, and constructed a training set of compound–protein pairs. (iv) We trained a classifier using the training set, and applied it to the test set. (v) We computed the prediction scores for compound–protein pairs in the test set, and evaluated the prediction accuracy over the 5-fold process. It should be noted that only compounds were split into the training and test sets, and the list of proteins was common across both sets.
Construction of several benchmark datasets containing compounds with diverse chemical structures
The gold standard dataset contained many drugs that are structurally almost identical because some of them are derivatives of the same lead compound. If these identical drugs were divided into a training set and a test set, the prediction in the cross-validation experiment would be trivial. To avoid overestimation of the prediction accuracy, we therefore performed filtering of similar drugs based on their chemical structures. Hence, we used only diverse drugs that were structurally different to some extent.
The process proceeded as follows. First, we calculated the generalized Jaccard coefficient (Tanimoto coefficient for real-valued feature vectors) of chemical descriptors for all compounds. Second, we identified the compounds that shared high Jaccard coefficients and selected one representative compound based on a threshold. Third, we constructed a set of representative compounds that shared low Jaccard coefficients. Finally, we prepared seven sets of benchmark data consisting of representative drugs by gradually varying the chemical similarity threshold (e.g., from 0.4 to 1.0 by increments 0.1) on the dendrogram. When the threshold is <0.4, the number of drug clusters is very small; thus, it was not possible to test thresholds of 0.1–0.3 in the 5-fold cross-validation.