iOPTICS-GSO for identifying protein complexes from dynamic PPI networks

Background Identifying protein complexes plays an important role for understanding cellular organization and functional mechanisms. As plenty of evidences have indicated that dense sub-networks in dynamic protein-protein interaction network (DPIN) usually correspond to protein complexes, identifying protein complexes is formulated as density-based clustering. Methods In this paper, a new approach named iOPTICS-GSO is developed, which is the improved Ordering Points to Identify the Clustering Structure (OPTICS) algorithm with Glowworm swarm optimization algorithm (GSO) to optimize the parameters in OPTICS when finding dense sub-networks. In our iOPTICS-GSO, the concept of core node is redefined and the Euclidean distance in OPTICS is replaced with the improved similarity between the nodes in the PPI network according to their interaction strength, and dense sub-networks are considered as protein complexes. Results The experiment results have shown that our iOPTICS-GSO outperforms of algorithms such as DBSCAN, CFinder, MCODE, CMC, COACH, ClusterOne MCL and OPTICS_PSO in terms of f-measure and p-value on four DPINs, which are from the DIP, Krogan, MIPS and Gavin datasets. In addition, our predicted protein complexes have a small p-value and thus are highly likely to be true protein complexes. Conclusion The proposed iOPTICS-GSO gains optimal clustering results by adopting GSO algorithm to optimize the parameters in OPTICS, and the result on four datasets shows superior performance. What’s more, the results provided clues for biologists to verify and find new protein complexes.


Background
Proteins are the indispensable components in various types of cells and tissues, and the executors of the biological functions. At the same time, each protein in the cell does not exist in isolation, and the occurrence of every life process must involve more than one protein [1]. Protein complexes are not only the basis of normal biological processes, also play important role in the pathological processes [2]. Therefore, identifying protein complexes play an important role in understanding the cellular organizations and functional mechanisms [3]. As a variety of protein interaction database have produced, it is possible to identify protein complexes from protein-protein interaction (PPI) networks. Living organisms are always changing, so are PPIs in living cells [4]. In addition, the interactions between proteins are changing over time not only with the presence and degradation of protein, but also with the environment. In [5], the authors incorporated the "time" factor for proteins in the form of cell-cycle phases into the analysis of complexes and studied the dynamic phenomena of complexes assembly and disassembly across various cell cycles. To express the dynamics, many dynamic data, including gene expression profiles [6], have been used to construct dynamic PPI networks (DPINs).
The discovery of protein complexes is equivalent to find subsets of function-related proteins from a data set. Clustering is an effective method, which can find subsets that have some common attributes from the database [7]. Therefore, the development of improved clustering algorithms has received a lot of attention in the last few years. The clustering algorithm based on density is an important type of clustering analysis method and one of its main advantages is able to detect any shape of cluster while being not sensitive to noise [8]. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [9], which was proposed by Ester et al., is a clustering algorithm based on density. The DBSCAN algorithm is applicable to any shape and size of the dataset. It is noisetolerant and independent of ordering of data objects. However, it has two initial parameters, the field radius and the minimum point within the field radius. The DBSCAN algorithm requires the user to manually input these two parameters while the clustering results are very sensitive to the values of two parameters.
The DBSCAN algorithm also needs initialization parameters. In order to overcome those shortcomings of DBSCAN algorithm, Ankerst et al. [10] proposed a new algorithm called Ordering Points to Identify the Clustering Structure (OPTICS). Its basic idea is similar to DBSCAN when identifying clusters, and both searching for high density regions.
In real life, many optimization problems require not only to calculate the extremum, but also obtain their optimal values. This kind of problem is a serious challenge to the traditional algorithm. In this case, a growing number of swarm intelligence algorithms are successively put forward, such as Genetic Algorithm (GA) [11], Particle Swarm Optimization (PSO) [12]. Glowworm swarm optimization algorithm (GSO) [13], proposed by Krishnan and Ghose in 2005, is a bionic swarm intelligence algorithm. GSO simulates the glowworm group in motion guided by fluorescence to attract other glowworms or foraging around, the greater the value of fluorescein, the bright the glowworm is, and the more attractive it is.
OPTICS algorithm does not produce cluster for a data set explicitly; but instead creates an augmented ordering queue representing its density-based clustering structure. Then we need to deal with clusterordering and get clustering results. For each network clustering, different parameters settings produce different results. In this study, we put forward the algorithm named iOPTICS-GSO which is the improved OPTICS algorithm by using GSO to optimize the parameters in OPTICS. In order to investigate its performance, iOPTICS-GSO with other seven computing methods including DBSCAN [9], CFinder [14], MCODE [15], CMC [16], COACH [17], ClusterOne [18], MCL [19] and OPTICS_PSO [20]. At the same time, we also use the p-value for function enrichment analysis. The experiment results illustrated that iOPTICS-GSO achieved better performance compared with other competing algorithms.
The outline of this paper is as follows. In Section 2, after reviewing the GSO algorithm, basic OPTICS and our iOPTICS-GSO are presented. In Section 3, experimental results and analysis are described and discussed, and the conclusions are in Section 4.

GSO algorithm
In the GSO algorithm, glowworms with higher fluorescein are more attractive to other glowworms, and thus a group of glowworms move towards the glowworms with high fluorescein. Each glowworm in its dynamic decision domain radius chooses a glowworm whose fluorescein value is higher than its own fluorescein value to move towards and updates its dynamic decision-making domain. Then some glowworms are selected according to probability to update the position from dynamic decision-making domain. Finally, the decision domain updated. GSO algorithm has two important phases as follows.
The phase for updating the fluorescein.
The fluorescein value of each glowworm is related to the value of previous generation of fluorescein and the current fitness function. Let x i (t) represent the location of the i-th glowworm in the t-th generation, J(x i (t)) represent the fitness function of the i-th glowworm in the t-th generation. The fluorescein value l i (t) of the i-th glowworm in the t-th generation is calculated as follows: where ρ and γ are two parameters with the values between 0 and 1.
The phase of updating the position.
Each new position of the glowworms is a small movement from the original position, which is calculated as follows: where S is the update step length of the glowworms, S 0 is the initial step length, and t max is the largest number of iterations. Here, we adopt the method of linear regressive instead of fixed step length [21], in order to improve optimization ability of the algorithm when updating the population.
In the GSO, each glowworm is looking for the neighborhood within its field of vision, and then moves to a brighter glowworm. Each time the moving direction depends on the neighborhood selection. In addition, the glowworm decision domain radius size is influenced by the number of glowworms in different neighborhoods, when the number of glowworms is too small, glowworms will increase their decisions radius in order to find more glowworms; On the contrary, they will reduce their own decisionmaking radius. At the end, the GSO makes most of the glowworms gathered in a better position.

Optics
The key idea of density-based clustering such as OPTICS is that for each object in a cluster the neighborhood within a given radius has to contain at least a minimum number of objects (MinPts), which is the cardinality of the neighborhood. The condition Card(N ε (q)) ≥ MinPts is called the "core object condition". If this condition holds for an object p, then we call p a "core object". Only from core objects, can other objects be directly density-reachable.
In PPI networks, the node degrees obey power-law distribution, we select all nodes as core nodes so that the node which degree is small can be considered. As a result, we redefined two definitions as follows.
Let p be a protein in a PPI network, Distance MinPts (p) be the MinPts-th maximum distance from node p to all the other nodes. Then, the core-distance of p is defined as follows: Definition 2: (Distance reachability of node p). Let nodes p and o be two proteins in a PPI network, let N(o) be the set which contains neighbors of node o. Then, the Distance reachability is defined as follows: where d op is the distance from node p to node o. As can be seen above, the reachability distance of a node cannot be smaller than the core distance of node o. Thus OPTICS creates an ordering queue of all nodes, and stores the core distance as well as a suitable reachability distance for each node.

The proposed iOPTICS-GSO
In this section, we elaborate the proposed iOPTICS-GSO how to identify protein complexes. The following four subsections describe the calculation of distance between proteins, clustering PPI networks, iOPTICS-GSO algorithm and its time complexity analysis, respectively.

Calculating the distance in a PPI network
In a PPI network, we use the similarity between two proteins to measure their distance. As we know, the fewer the number of same neighbors between two proteins is, the less the similarity of two proteins is, and the smaller the probability that they belong to the same protein complex is. On the contrary, the higher the similarity of the two proteins is, the more likely they belong to the same protein complex [22]. Therefore, the similarity is determined according to the number of same neighbors the two nodes share in the PPI network. Consider a PPI network PN, A is adjacency matrix of PN, and the binary vector X i = (A i1 , A i2 , …, A in ) indicates the interactions between protein i and other proteins, then we calculate the number of common neighbor(CN) between proteins i and j by the equation: CN ij = |N i ∩N j |. Here N i and N j expresses the neighbor that proteins i and j have, respectively. Therefore, if CN ij ≠ 0, the similarity between proteins i and j is calculated as follows [23]: Considering in the PPI network, the two nodes which have no common neighbor also have connection, and there have multiple protein complexes which only contains two proteins in standard complexes. we redefined the similarity S as follows: The greater the similarity between two proteins, the smaller the distance between them is. Then the distance can be calculated as follows: We use the D ij to replace the Euclidean distance in OPTICS for measuring the distance between two proteins in a PPI network.
2. Clustering PPI network. Fig. 1 shows a PPI network with distances between node o and other nodes. In this study, we set the MinPts to be 4, and then from Fig. 1, we select firstly the core to be node o. For obtaining the core distance of o, we calculate all distances between core o and its neighbors according to Eq. (8). From the definition, we get the value Distance reachability (d, o) = 0.64. In the same manner, we obtain a sequence of values of all nodes.
We can now improve the algorithm to preserve the track of all the reachability distance values and use them to save the expensive operations identified above. We can obtain an augmented ordering queue from OPTICS, and convert the ordering queue into a reachability-plot. Fig. 2 shows such a reachabilityplot and an example of cluster. Each sunken part in Fig. 2a can be viewed as a cluster. That is, the new cluster starts from a steep down region and end up with next steep down region. As a result, form the reachability-plot, the algorithm can find all clusters.
For example, in Fig. 2b we can see a cluster starting at object #1 and ending at object #15. Note that object #1, which is the last object with a high reachability value, is part of the cluster, its high reachability indicates that it is far away from the previous cluster. It has to be close to object #2. However, because object #3 has a low reachability value, indicating that it is close to one of the objects #1 or #2. Because the next object that OPTICS chooses is in the cluster-ordering, it has to be close to #2 (if it were close to object #l it would have been assigned index 1 and not index 2). A similar argument holds for object #15, which is the last object with a low reachability value, and therefore is also a member of the cluster.

iOPTICS-GSO Algorithm.
Although the OPTICS algorithm can find all clusters, the dynamic PPI network has more than one sub-network, and the size and topological structure of these sub-networks are quite different. For example, when we apply OPTICS to dynamic PPI network with 12 sub-networks, 12 reachability-plots are obtained; and each reachability-plot is different from others. The optimal parameters and the corresponding performance of each sub-network are shown in Table 1. It is evident that each sub-network has its own optimal parameters and the performances of the clustering result are different. It also can be seen that the OPTICS with global density parameters is not suitable for datasets with different densities.
It is well known that the GSO algorithm has less parameters, simple operation and good stability, etc. GSO algorithm simulates the characteristic of glowworms   in nature, by comparing the size of the fluorescein value to achieve the purpose of communication, so as to realize the optimization of the problem. So we introduce the GSO algorithm to optimize the parameters of OPTICS, in order to obtain optimal results. Algorithm2 describes the details of iOPTICS-GSO. After several circulations iterative process, a glowworm constantly updates its position and iteratively approaches to the best position. At last, the glowworm finds the best position.
The corresponding relationships between GSO and OP-TICS are showed in Fig. 3. When we adopt the GSO algorithm to optimize the parameter ɛ in OPTICS, the position of glowworms in GSO also is related to the value of parameter ɛ. By updating its dynamic decision domain radius, a glowworm moving its position corresponds to searching for the optimal value of parameters ε. When fitness function achieves the maximum value in GSO after a number of positions are updated, OPTICS finds the best clustering result.
In Algorithm: iOPTICS-GSO, firstly, the fluorescein values, the decision domain radius and the positions of glowworms are initialized. Secondly, GSO algorithm is used to optimize the parameter ɛ in OPTICS. In this part, one position of a glowworm is one parameter value. Then OPTICS is run by using this parameter value. For each value (position), a corresponding clustering result is obtained. Next the clustering performance is evaluated for each value (position). Next the fluorescein value is updated and the glowworms move accordingly. After iterations, the new positions of glowworms are found. The maximum fitnessvalue is selected as the optimal position. In summary, the time complexity of iOPTICS-GSO is O (maxiter * (num 2 + PopSize * num 2 + PopSize 2 + PopSize)). Finally, the time complexity of this algorithm is O (maxiter * PopSize * num 2 ).

Experimental datasets
In this study, we used four static PPI networks for yeast, including DIP [24], Krogan [25], MIPS [26] and Gavin [27] to evaluate our proposed iOPTICS-GSO. The DIP data consists of 4995 proteins and 21,554 interactions,  MinPts  3  3  3  3  3  3  3  3  3  3  3 [30] with access number GSE3431. The data contained 9336 genes at 36 time points in the 3 cell life cycles. DPINs are constructed from static PPI network and gene expression data, we use the three-sigma principle to judge whether a gene is expressed in a particular timestamp. For example, we preset a threshold value, if the value of a protein is greater than the threshold at a certain timestamp t, this protein is judged to be an active protein at t timestamp. Each sub-network is constituted by these active proteins and the interactions between them. Then these sub-networks together form the DPIN. As a result, we get four DPINs from DIP, Krogan, MIPS and Gavin, respectively. Table 2 shows different scales of different sub-networks from these four static PPI networks.

Performance evaluation
In order to evaluate the clustering results, we have adopted three kinds of commonly used statistical metrics: precision, recall and f-measure [31]. Precision and recall measure the accuracy of the protein complexes identified by algorithm matching the known protein complexes in the standard dataset and the accuracy of the known protein complexes matching the identified protein complexes, respectively. f-measure is used to evaluate the closeness between the known protein complexes and the identified protein complexes. Precision, recall and f-measure are calculated as follows: where X is the set of proteins in an identified protein complexes and F is the set of known complexes in the standard dataset. |pc| is the number of proteins in the identified protein complex and |kc| is number of proteins in the known protein complex. The overlapping score (OS) evaluates how many proteins in the true protein complexes can be recovered by the identified protein complexes [32,33]. Usually we consider an identified protein complex matches the known protein complex when the OS is equal to or larger than 0.2 [5]. We also use the p-value to evaluate the statistical and biological significance of the identified protein complexes [34]. In detail, given k proteins in a true protein complex C with a biological function shared by an identified proteins complex F from a total set V of proteins, the p-value is defined as:  Timestamps  1  2  3  4  5  6  7  8  9  10  11  12   Proteins  797  941  796  623  610  530  493  944  1090  591  661  461   Interactions  981  1444  1188  745  750  646  573  1705  2185  856  974  526   Krogan data  Timestamps  1  2  3  4  5  6  7  8  9  10  11  12   Proteins  336  379  320  256  206  189  202  580  626  304  330  250   Interactions  334  464  331  234  210  184  213  1025  1081  314  373  which is the probability that an identified protein complex is enriched by a true protein complex only by chance [35]. A low p-value of an identified protein complex means the collective occurrence of these proteins belongs to the same complex not by chance, yet with a high statistical significance. That is to say, the lower the p-value of a protein complex is, the stronger biological significance the protein complex possesses, while the protein complex with p-value greater than 0.01 is considered to be insignificant. In the experiments, pvalue was calculated on biological process ontologies.

The effect of parameter
In iOPTICS-GSO algorithm, there is one parameter to be preset, which is the value of MinPts. According to the topological properties of PPI networks, if the value of MinPts is too large, there would be no meaningful cluster that can be identified by the algorithm. For example, when we set MinPts to 10, there is no meaningful cluster that can be identified from the DPIN network. On the contrary, if the value of MinPts is too small, it will be too many proteins in the same cluster and the number of identified protein complexes will be few. In this study, the value of MinPts is set according to Fig. 4 for the four datasets. The x-axis represents the values of parameter which range from 2 to 8, and the y-axis represents the values of f-measure. Each value of parameter corresponds to a value of f-measure,a set of values form the line chart, as shown in Fig. 4. The blue line represents the result on DIP data, the orange line represents the result on Krogan data, the green line represents the result on MIPS data, and the yellow line represents the result on Gavin data. In Fig. 4, the effect of different values of MinPts on f-measure is not very big, and this also confirms that the reachability-plot is rather insensitive to the input parameter of the method. We observe that the value of f-measure increases initially as the value of MinPts increases and decreases after reaching the maximum. Then we chose the value of MinPts at which the f-measure reaches the maximum in iOPTICS-GSO. As a result, we find that the optimal values of MinPts are 3, 2, 2 and 4 for DIP, Krogan, MIPS and Gavin, respectively.

Clustering comparisons
In order to directly validate its performance, the iOPTICS-GSO is compared with other seven competing algorithms, DBSCAN [9], CFinder [14], MCODE [15], CMC [16], COACH [17], ClusterOne [18] MCL [19] and OPTICS_PSO [20]. At the same time, the iOPTICS-GSO is also compared with the basic OPTICS. All comparisons are on the DIP, Krogan, MIPS and Gavin The bold data in Tables 4 are the result of our four datasets datasets. Each algorithm uses its best parameter when comparing, and it was found that these algorithms can get best results under the default parameter setting. The performances of all clustering algorithms are reported in Table 3 which contains the category of each algorithm, the number of identified protein complexes, and the average size of protein complexes. From Table 3, we can see that the numbers of clusters obtained by the proposed algorithm on four datasets are smaller than those compared methods. The reason of this result is that the number of interactions in most sub-networks is sparse, so the distance of these nodes calculated by Eq. (7) would be up to 1, and these nodes were regarded as a class, respectively. In the final phase, we filtered the results from each sun-network clustering, and deleted some clustering modules whose density was smaller or had only one node. Fig. 5 depicts the precision, recall, f-measure of each algorithm on four datasets. From Fig. 5, we can see that the proposed algorithm obtains the higher precision and f-measure than other competing algorithms. After combining OPTICS with GSO algorithm, the iOPTICS-GSO algorithm can produce the clustering results based on the optimal parameters. Therefore, it obtains a much better performance than the OPTICS algorithm. From the last green and blue column in Fig. 5, we can clearly see that the proposed algorithm obtains the higher precision and f-measure than other competing algorithms.
To evaluate the biological significance and functional enrichment of the complexes identified by our algorithm, we calculated the p-value of the identified protein complexes on Biological Process ontologies based on four datasets by using the tool SGD's GO: TermFinder (http:// www.yeastgenome.org/cgi-bin/GO/goTermFinder.pl). We calculate the p-value of the protein complexes identified by six algorithms, COACH, MCL, MCODE, ClusterOne, OPTICS and OPTICS_PSO, whose size are greater than or equal to 3. The comparison results are showed in Table 4. From Table 4, it is obvious that the proposed algorithm achieves the better performance on DIP data, Krogan data, MIPS data and Gavin data. While the MCL and ClusterOne obtain poor performance on four datasets. There is a few protein complexes identified by iOPTICS-GSO that are insignificant. Especially on the Krogan data, no protein complex is insignificant. That is to say, all protein complexes identified by iOPTICS-GSO on Krogan data are significant. In detail, in DIP data, Krogan data, and Gavin data, the percentages of complexes with p-value < E-15 in predicted complexes by iOPTICS-GSO was the highest. It accounted for 8.70%, 12.22% and 26.25%, respectively. In MIPS data, the percentage of complexes with p-value < E-15 in protein complexes identified by iOPTICS-GSO was the highest. It accounted for 20.00%. As for the comparison with OPTICS_PSO, the percentage of complexes which are significant identified by iOPTICS-GSO was the higher on The proteins in bold have well matched some known protein complex in benchmark complex dataset DIP data and Krogan data. In MIPS data and Gavin data, the percentage of complexes with p-value < E-10 in protein complexes identified by iOPTICS-GSO was the higher. In general, the statistical results in Table 4 indicate that iOPTICS-GSO algorithm was more biologically meaningful than others for identifying significant protein complexes. We list some identified protein complexes in Gavin data shown in Table 5. These protein complexes are not well matched with the benchmark dataset (the value of OS is low), but both have low p-value of GO terms. The p-value of the identified protein complexes is calculated on Molecular Function. In each row, the proteins in bold have well matched some known protein complex in benchmark complex dataset, and the additional proteins probably share the similar functions with other proteins. For example, 5 proteins do not matches the known protein complex in the first predicted protein complex, while 4 proteins of which (namely YNL248C, YJR063W, YOR340C and YIL021W) share the similar annotations-DNA-directed 5′-3′ RNA polymerase activity-with the true protein complex. We visualize this protein complex shown in Fig. 6. Fig. 6a describes the interaction relationship between 16 proteins, and (b) shows the common GO slim between every two proteins. We can see clearly that the interactions in (a) are much less than those in network (b). This shows that even if there is no interaction between some proteins, but they still have the common GO slim, meaning that they as complex implement some functions with a high probability. Given the incompleteness of protein complex set, the predicted protein complexes have low value of OS but with small p-value are highly likely to be true protein complexes. Therefore, the results provided clues for biologists to verify and find new protein complexes.

Conclusions
Protein complexes are not only the basis of normal biological processes, but also play an important role in the pathological process. Therefore, identifying protein complexes play an important role in understanding the cellular organizations and functional mechanisms. In this study, we have put forward the algorithm named iOPTICS-GSO, which is the improved OPTICS algorithm by using GSO to optimize the parameter in OP-TICS, and we changed the concept of core node and redefine the similarity which makes more accord with the actual situation of PPI network. As different parameter setting have different results on each sub-network of DPIN, we have used GSO algorithm to optimize these parameters, and finally checked the quality of every cluster and gained the optimal cluster results. The experiment results have shown that our iOPTICS-GSO outperforms competing algorithms in terms of f-measure and p-value. It means the results from iOPTICS-GSO are more biologically meaningful than others for identifying significant proteins complexes. However we also found that the number of clustering modules is relatively small and the recall of clustering results is lower than other algorithms in iOPTICS-GSO results. The reason may be that each protein only can belong to one cluster in iOPTICS-GSO, which causes that other clustering modules are small. Therefore, it would be our focus to discover the effective strategy to improve the result and detect more protein complexes in the future.