Integrating PPI datasets with the PPI data from biomedical literature for protein complex detection

Background Protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein-protein interactions (PPIs), making it possible to predict protein complexes from protein-protein interaction networks. On the other hand, the rapidly growing biomedical literature provides a significantly large and readily available source of interaction data, which can be integrated into the protein network for better complex detection performance. Methods We present an approach of integrating PPI datasets with the PPI data from biomedical literature for protein complex detection. The approach applies a sophisticated natural language processing system, PPIExtractor, to extract PPI data from biomedical literature. These data are then integrated into the PPI datasets for complex detection. Results The experimental results of the state-of-the-art complex detection method, ClusterONE, on five yeast PPI datasets verify our method's effectiveness: compared with the original PPI datasets, the average improvements of 3.976 and 5.416 percentage units in the maximum matching ratio (MMR) are achieved on the new networks using the MIPS and SGD gold standards, respectively. In addition, our approach also proves to be effective for three other complex detection algorithms proposed in recent years, i.e. CMC, COACH and RRW. Conclusions The rapidly growing biomedical literature provides a significantly large, readily available and relatively accurate source of interaction data, which can be integrated into the protein network for better protein complex detection performance.


Background
Protein complexes are molecular aggregations of proteins assembled by multiple protein-protein interactions. Many proteins are functional only after they are assembled into a protein complex and interact with other proteins in this complex. These protein complexes can help us to understand the principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, which makes it possible to uncover protein complexes from protein interaction networks. A protein interaction network can be modeled as an undirected graph, where vertices represent proteins and edges represent interactions between proteins.
Protein complexes are groups of proteins that interact with one another, so they are usually dense sub-graphs in PPI networks. Various algorithms based on graph theory have been applied to identify protein complexes and functional modules from protein interaction networks, including CFinder [1], CMC [2], COACH [3], MCL [4], RRW [5] and ClusterONE [6].
At the same time, a number of databases, such as Gavin [7], Krogan [8], Collins [9], DIP [10], and BioGRID [11], have been created to store protein interaction information in structured and standard formats. These datasets were usually derived with different experimental techniques: the Collins, Krogan and Gavin datasets include the results of TAP tagging experiments only; the DIP dataset include the results of Y2H experiments; the BioGRID dataset contains a mixture of TAP tagging, Y2H and low-throughput experimental results. However, even for model species, only a fraction of true physical interactions are known [12,13] and experimental verification of all remaining potential interactions is unlikely in the near future [14]. On the other hand, the rapidly growing biomedical literature provides a significantly large and readily available supplemental source of PPI data for complex detection methods. What is more, since these data from biomedical literature are contributed by biologists and, therefore, relatively accurate, the integration of them into the existing PPI datasets can be hopeful for better complex detection performance.
Our work aims to quantifying the contribution of PPI data from biomedical literature as a supplement to the existing PPI datasets. In this paper, we present an approach of integrating PPI datasets with the PPI data from biomedical literature for protein complex detection. The approach applies a sophisticated natural language processing system, PPIExtractor [15], to extract new interactions from biomedical literature. These data are then integrated into the PPI datasets for protein complex detection. The experimental results on several PPI datasets show that in most cases the performances of some state-of-the-art protein complex detection methods are improved through the integration of protein-protein interactions and the PPI data extracted from literature.

Extracting PPIs with PPIExtractor
In this work, we apply the PPIExtractor system to extract PPI data from biomedical literature, which are then integrated into the protein network for protein complex detection.
Among the popular machine learning approaches to extracting PPIs from biomedical literature, kernel-based methods including tree kernels [16], shortest path kernels [17], and graph kernels [18] have been proposed for PPIs extraction. Kernel-based methods retain the original representation of objects and use the object in algorithms only via computing a kernel function between a pair of objects. However, each kernel utilizes a portion of the structures to calculate useful similarity. The kernel cannot retrieve the other important information that may be retrieved by other kernels.
In previous work, we presented PPIExtractor to automatically extract protein-protein interactions from biomedical literature. PPIExtractor is a multiple kernels learning based system which combines the feature-based, convolution tree and graph kernels to extract PPIs. The combined kernel can reduce the risk of missing important features, yielding new useful similarity measures. More specifically, the weighted linear combination of individual kernel used instead of assigning the same weight to each individual kernel is experimentally proven to contribute to the performance improvement. Experimental evaluations show that PPIExtractor can achieve state-of-the-art performance on a DIP subset with respect to comparable evaluations. More complete details are presented in [15].
PPIExtractor contains four modules: (i) Named Entity Recognition (NER) module which aims to identify the protein names in the biomedical literature; (ii) Normalization module which determines the unique identifier of proteins identified in NER module; (iii) PPI extraction module which extracts the PPI information in the biomedical literature and (iv) PPI visualization module which displays the extracted PPI information in the form of a graph. Figure 1 shows the architecture of PPIExtractor.
The biomedical literature PPI data we used is 127,217 PubMed abstracts downloaded from PubMed website (http://www.ncbi.nlm.nih.gov/pubmed) with the query string "((Saccharomyces cerevisiae) OR yeast) AND protein". 126,165 protein interactions were extracted from these abstracts by the PPIExtractor system.
Most of the protein names in the PPI databases are systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast') while those in PubMed abstracts are not. Therefore, we built a yeast protein alias name list with about 6,000 entries from the UniProt website (http://www. uniprot.org/uniprot/?query=yeast&sort=score). The list is used to convert the protein names in PubMed abstracts to systematic names for nuclear-encoded ORFs. In our method, a PPI can be added into a PPI dataset only if the two proteins in the PPI already exist in the PPI dataset.

Yeast PPI datasets
As in [6], five different yeast PPI datasets in our experiments were used to verify the effectiveness of our method, including three high-throughput experimental datasets (Gavin, Krogan-core and Krogan-extended), a computationally derived network that integrates the results of these studies (Collins), and a compendium of all known yeast protein-protein interactions (BioGRID). The Gavin data set was obtained by considering all PPIs with a socio-affinity index larger than five, proposed by the original authors. The Krogan data set was used in two variants: the core data set and the extended data set. The core data set contained only highly reliable interactions, whose probability > 0.273. The extended data set contained more interactions with less reliability, whose probability > 0.101. The Collins data set was retained the top 9,074 interactions according to their purification enrichment score, as suggested in the original paper. The BioGRID data set was downloaded from version 3.1.77 and contained all physical interactions that involve yeast proteins only. The details of the interaction datasets are shown in Table 1. Self-interactions and isolated proteins were filtered from all the datasets.

Integration of the extracted PPIs into the PPI datasets
Each extracted PPI is assigned a weight by PPIExtractor which represent the reliability of the PPI. In our method, a certain amount of PPIs with the weights higher than a threshold can be integrated into the PPI datasets. Since BioGRID is an unweighted dataset, the weights of these PPIs are discarded. For the weighted datasets, Gavin, Krogan-core and Krogan-extended and Collins, the weights of these PPIs are adjusted proportionately to the ones in the PPI datasets which are usually calculated using complicated machine learning approaches that operate on the original noisy experimental datasets to reflect the reliability of the PPI [6]. In addition, we integrate a PPI with the weight equal to or higher than a threshold into the PPI dataset only if both two proteins in the PPI already exist in the PPI dataset. As shown in Figure 2, since the BioGRID dataset has the most proteins (5,460), the most PPIs are integrated into it: with the threshold -0.6, 6,025 PPIs are integrated into it. The amounts of the PPIs added into the PPI datasets with different thresholds are shown in Table 2.

Protein complex detection methods
In our experiments, a state-of-the-art complex detection method, ClusterONE [6], was used to evaluate our method's effectiveness on PPI datasets for protein complex detection. The ClusterONE is a method for detecting potentially overlapping protein complexes from protein interaction network. The algorithm uses a greedy growth process to find groups in a protein interaction network. The main algorithm consists of three steps: first, it grows groups with high cohesiveness from selected seed proteins. Second, it merges highly overlapping pairs of locally optimal cohesive groups. Last, the complex candidates that contain less than three proteins or whose densities are below a given threshold are discarded. Experimental results show that ClusterONE outperforms the other approaches both on weighted and unweighted PPI networks, matching more complexes with a higher accuracy and providing a better one-toone mapping with reference complexes in almost all the data sets.
In addition, we also evaluated the effectiveness of our method on three other complex detection algorithms proposed in recent years, i.e. CMC, COACH and RRW. CMC is a clique based method that uses a protein-protein interaction iteration method to update the network [2]. COACH is based on the core-attachment architecture developed by Gavin et al. [7], and selects some subgraph as the core structure first, and then adds the attachment to the core to construct a complex. The RRW algorithm derives complexes from results of repeated restarted random walks on the graph of protein-protein interactions [5]. For each algorithm, its parameters are set as those described in [6] which have been optimized to yield the best possible results as measured by the maximum matching ratio on the gold standards.

Results and discussion
Gold standard protein complexes Like [6], the MIPS catalog of protein complexes [19] (18 May 2006) and the Gene Ontology (GO)-based protein complex annotations from SGD [20] (11 Aug 2010) were used as our gold standards. To avoid selection bias, all MIPS categories containing at least three and at most 100 proteins as protein complexes are considered. MIPS category 550 and all its descendants, as these categories correspond to unconfirmed protein complexes that were predicted by computational methods.
For SGD, GO annotations are maintained [21] for all yeast proteins. The complexes were derived from proteins annotated by descendant terms of the Gene Ontology term 'protein complex' (GO:0043234). Annotations with modifiers such as 'NOT' or 'colocalizes_with' and annotations supported by 'IEA' evidence code only were ignored. The details of the gold standard protein complex datasets are shown in Table 3.

Evaluation metrics
Like [6], we used three independent quality measures to assess the similarity between a set of predicted complexes and a set of reference complexes. The first measure is the fraction of pairs between predicted and reference complexes with an overlap scoreω larger than 0.25. The overlap score between two protein sets A and B is defined as follows: The threshold of 0.25 is chosen because it represents the case when the intersection is at least half of the complex size if the two complexes being compared are equally large.
The second measure we used is the geometric accuracy as introduced by Broh´ee and van Helden [22], which is the geometric mean of two other measures, namely the clustering-wise sensitivity (Sn) and the clustering-wise positive predictive value (PPV). Let n be the number of the benchmark complexes and m be the number of the predicted complexes. Construct a confusion matrix T, and let T ij denote the number of proteins that are found both in reference complex i and predicted complex j. Sn and PPV are defined as follows: Here, we define Ni is the number of proteins in the benchmark complex i, then T .jis defined as: Generally, a high Sn value indicates that the prediction has a good coverage of the proteins in the true complexes, Figure 2 The amounts of the PPIs added into the original PPI datasets.  whereas a high PPV value indicates that the predicted complexes are likely to be true complexes. So it is necessary to balance the two measures by introducing the geometric accuracy (Acc), which is simply the geometric mean of the clustering-wise sensitivity and the positive predictive value: The third measure we used is the maximum matching ratio (MMR) which was introduced in [6]. This measure is based on a maximal one-to-one mapping between predicted and standard complex. Let R as the standard complex, and P as the predicted complex. An edge connects a standard complex and a predicted complex if their neighborhood affinity score is larger than zero. Given n standard complexes and m predicted complexes, let j be the member of the predicted complexes, MMR then defined as follows: The geometric accuracy measure explicitly penalizes predicted complexes that do not match any of the reference complexes. However, gold standard sets of protein complexes are often incomplete [23]. As a consequence, predicted complexes not matching any known reference complexes may still exhibit high functional similarity or be highly co-localized, and therefore they could still be prospective candidates for further in-depth analysis. In other words, a predicted complex that does not match a reference complex is not necessarily an undesired result, and optimizing for the geometric accuracy measure might prevent us from detecting novel complexes from a PPI dataset. The maximum matching ratio sidesteps this problem by dividing the total weight of the maximum matching with the number of reference complexes. Therefore, in the performance comparison, the MMR is used as the main metric, and the Acc is only used as an auxiliary one.

The performances of ClusterONE on PPI datasets
First, we tested ClusterONE on the Collins, Gavin, Krogancore, Krogan-extended and BioGRID dataset. Tables 4, 5 and 6 contain the results of Accuracy, MMR and fraction of matched complexes when the MIPS dataset was used as the gold standard, respectively. Figure 3 depicts the MMR performances of ClusterONE on PPI datasets using the MIPS gold standard, which show that, in most cases, better performance of ClusterONE can be achieved when the PPIs extracted from literature are added into the original PPI datasets. When the PPIs with weights larger than or equal to threshold -0.6 are added, ClusterONE achieves the highest average MMR improvement on all five PPI datasets: the average improvements of 2.938 and 3.976 percentage units in Accuracy and MMR over that on the original datasets are achieved on the new datasets. With the lower thresholds (-0.7 to -0.9), the MMR performance begin to decline. The reason is that the lower threshold means more less reliable PPIs are introduced, which will deteriorate the performance of complex detection algorithms.
The similar results were obtained when the SGD dataset was used as the gold standard as shown in Figure 4 and Tables 7, 8

The performances of other algorithms on PPI datasets
The performances of three other complex detection algorithms proposed since 2009 (i.e. COACH, CMC and    Figure 3 The MMR performances of ClusterONE on PPI datasets using the MIPS gold standard.

Figure 4
The MMR performances of ClusterONE on PPI datasets using the SGD gold standard. On the BioGRID dataset, the performances of these algorithms decrease with the threshold -0.6: in term of MMR, there is an 8.41 percentage unit decrease in the performance of the RRW algorithm using the MIPS gold standard while there are 11.15 and 4.89 percentage unit decreases in the performance of the CMC and RRW algorithms using the SGD gold standard, respectively. Through the analysis of the results, we found that these algorithms obtain more clusters on BioGRID with the threshold -0.6 than on the original BioGRID. However, many of them are not matched one, i.e. they can not match with any complex in the gold standards, which deteriorates the performances of the complex detection algorithms.
The reason behind it is that, in our method, a PPI with the weight equal to or higher than a threshold is integrated into the PPI dataset only if both two proteins in the PPI already exist in the PPI dataset. Since the BioGRID dataset includes the most proteins (5,460), the most PPIs are integrated into it as shown in Figure 2: with the threshold -0.6, 6,025 PPIs are integrated into it while the numbers are 926, 1,324, 2,457 and 3,962 for Collins, Gavin, Krogancore, Krogan-extended, respectively. In fact, according to [6], the BioGRID network is structurally very different from the other four datasets, and particularly it shows an unexpectedly high fraction of star-like structures. If many candidate complexes with star-like structures are predicted, the effectiveness of the complex detection algorithms may be hampered. The reason is that these complexes usually have low density values (where the density of a complex with n proteins is defined as the total weight of its internal edges, divided by n * (n − 1)/2 and, in the unweighted BioGRID dataset, the total weight of the complex is the number of its internal edges; an example is shown in Figure 5a) and a considerable number of real complexes form a clique in the interaction graph and have high density values though there are many other topological structures that may represent a complex on a PPI graph [24]. For example, the experimental results in [6] show that the performance of various protein complex detection algorithms on BioGRID is the worst among all PPI databases. In these cases the authors of [6] recommended that use higher value for the density threshold in order to discard trivial clusters. Given an unweighted network, ClusterONE automatically tests the value of the transitivity and sets the density threshold to either 0.5 or 0.6 (for the BioGRID dataset it uses 0.6).
On a dataset like BioGRID, many candidate complexes with star-like structures and low density values should have been discarded based on the density threshold by complex detection algorithms. However, when the PPI data from literature are integrated, many such candidate complexes will be retained since the density values of these complexes are increased with the inclusion of new PPI data. As shown in the example of Figure 5, a candidate complex with star-like structure (Figure 5a) will be discarded since its density is 0.5 while the density threshold. However, when the edge between protein A and C is added (Figure 5b), the complex's density increases to 0.67 and it will be retained by ClusterONE (the density threshold 0.6).
This assumption can be supported by the following fact: with the threshold -0.6, a total of 6,025 PPIs are Table 7 The Accuracy performances of ClusterONE on PPI datasets using the SGD gold standard MMR(-0.6) denotes the MMR value when with the threshold -0.6; Δ(-0.6) denotes the MMR improvement when with the threshold -0.6 over that on the original datasets. MMR(0) denotes the MMR value when with the threshold 0; Δ(0) denotes the MMR improvement when with the threshold 0 over that on the original datasets. On the other hand, we found if the threshold is set to 0 and less PPIs (1,210) are integrated into BioGRID, much better performance can be achieved using any gold standard (MIPS and SGD) as shown in Figures 9  and 10. Therefore, with the databases with the low transitivity like BioGRID, the threshold should be set to higher to ensure less PPIs are integrated into the databases, and, in other cases, the threshold can be set to -0.6. In this way, the performances of protein complex detection algorithms can be improved through the integration of PPI datasets and the PPI data extracted from literature.    The performance comparison of various protein complex detection algorithms on BioGRID between the threshold -0.6 and 0 using MIPS as gold standard.

Conclusions
Protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, making it possible to predict protein complexes from protein-protein interaction networks. On the other hand, the rapidly growing biomedical literature provides a significantly large, readily available and relatively accurate source of interaction data, which can be integrated into the protein network for better protein complex detection performance. In this paper, we present an approach of improving protein complex detection methods with integrated PPI data from biomedical literature. The approach applies PPIExtractor to extract PPI data from biomedical literature, which are then integrated into the protein network for protein complex detection. The experimental results of ClusterONE on five yeast PPI datasets show the effectiveness of our method: compared with the original networks, the average improvements of 3.976 and 5.416 percentage units in MMR are achieved on the new networks using the MIPS and SGD gold standards, respectively. In addition, our method also proves to be effective for three other algorithms proposed in recent years, CMC, COACH and RRW.
Through the analysis of the experimental results, we found the choice of the threshold usually can be set to -0.6. However, for the databases with the low transitivity like BioGRID, the threshold should be set to higher. In this way, the performances of the state-of-the-art protein complex detection algorithms can be improved through the integration of the existed PPI datasets and the PPI data extracted from literature.
A rapidly growing literature corpus ensures that PPI data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. PPI data provides a significantly large and readily available source of interaction data which, together with the guidelines and results reported here, will prove valuable especially for organisms in which protein-protein interaction data is sparse.