iOPTICS-GSO for identifying protein complexes from dynamic PPI networks
© The Author(s). 2017
Published: 28 December 2017
Identifying protein complexes plays an important role for understanding cellular organization and functional mechanisms. As plenty of evidences have indicated that dense sub-networks in dynamic protein-protein interaction network (DPIN) usually correspond to protein complexes, identifying protein complexes is formulated as density-based clustering.
In this paper, a new approach named iOPTICS-GSO is developed, which is the improved Ordering Points to Identify the Clustering Structure (OPTICS) algorithm with Glowworm swarm optimization algorithm (GSO) to optimize the parameters in OPTICS when finding dense sub-networks. In our iOPTICS-GSO, the concept of core node is redefined and the Euclidean distance in OPTICS is replaced with the improved similarity between the nodes in the PPI network according to their interaction strength, and dense sub-networks are considered as protein complexes.
The experiment results have shown that our iOPTICS-GSO outperforms of algorithms such as DBSCAN, CFinder, MCODE, CMC, COACH, ClusterOne MCL and OPTICS_PSO in terms of f-measure and p-value on four DPINs, which are from the DIP, Krogan, MIPS and Gavin datasets. In addition, our predicted protein complexes have a small p-value and thus are highly likely to be true protein complexes.
The proposed iOPTICS-GSO gains optimal clustering results by adopting GSO algorithm to optimize the parameters in OPTICS, and the result on four datasets shows superior performance. What’s more, the results provided clues for biologists to verify and find new protein complexes.
Proteins are the indispensable components in various types of cells and tissues, and the executors of the biological functions. At the same time, each protein in the cell does not exist in isolation, and the occurrence of every life process must involve more than one protein . Protein complexes are not only the basis of normal biological processes, also play important role in the pathological processes . Therefore, identifying protein complexes play an important role in understanding the cellular organizations and functional mechanisms . As a variety of protein interaction database have produced, it is possible to identify protein complexes from protein-protein interaction (PPI) networks. Living organisms are always changing, so are PPIs in living cells . In addition, the interactions between proteins are changing over time not only with the presence and degradation of protein, but also with the environment. In , the authors incorporated the “time” factor for proteins in the form of cell-cycle phases into the analysis of complexes and studied the dynamic phenomena of complexes assembly and disassembly across various cell cycles. To express the dynamics, many dynamic data, including gene expression profiles , have been used to construct dynamic PPI networks (DPINs).
The discovery of protein complexes is equivalent to find subsets of function-related proteins from a data set. Clustering is an effective method, which can find subsets that have some common attributes from the database . Therefore, the development of improved clustering algorithms has received a lot of attention in the last few years. The clustering algorithm based on density is an important type of clustering analysis method and one of its main advantages is able to detect any shape of cluster while being not sensitive to noise . The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) , which was proposed by Ester et al., is a clustering algorithm based on density. The DBSCAN algorithm is applicable to any shape and size of the dataset. It is noise-tolerant and independent of ordering of data objects. However, it has two initial parameters, the field radius and the minimum point within the field radius. The DBSCAN algorithm requires the user to manually input these two parameters while the clustering results are very sensitive to the values of two parameters. The DBSCAN algorithm also needs initialization parameters. In order to overcome those shortcomings of DBSCAN algorithm, Ankerst et al.  proposed a new algorithm called Ordering Points to Identify the Clustering Structure (OPTICS). Its basic idea is similar to DBSCAN when identifying clusters, and both searching for high density regions.
In real life, many optimization problems require not only to calculate the extremum, but also obtain their optimal values. This kind of problem is a serious challenge to the traditional algorithm. In this case, a growing number of swarm intelligence algorithms are successively put forward, such as Genetic Algorithm (GA) , Particle Swarm Optimization (PSO) . Glowworm swarm optimization algorithm (GSO) , proposed by Krishnan and Ghose in 2005, is a bionic swarm intelligence algorithm. GSO simulates the glowworm group in motion guided by fluorescence to attract other glowworms or foraging around, the greater the value of fluorescein, the bright the glowworm is, and the more attractive it is.
OPTICS algorithm does not produce cluster for a data set explicitly; but instead creates an augmented ordering queue representing its density-based clustering structure. Then we need to deal with cluster-ordering and get clustering results. For each network clustering, different parameters settings produce different results. In this study, we put forward the algorithm named iOPTICS-GSO which is the improved OPTICS algorithm by using GSO to optimize the parameters in OPTICS. In order to investigate its performance, iOPTICS-GSO with other seven computing methods including DBSCAN , CFinder , MCODE , CMC , COACH , ClusterOne , MCL  and OPTICS_PSO . At the same time, we also use the p-value for function enrichment analysis. The experiment results illustrated that iOPTICS-GSO achieved better performance compared with other competing algorithms.
The outline of this paper is as follows. In Section 2, after reviewing the GSO algorithm, basic OPTICS and our iOPTICS-GSO are presented. In Section 3, experimental results and analysis are described and discussed, and the conclusions are in Section 4.
In the GSO algorithm, glowworms with higher fluorescein are more attractive to other glowworms, and thus a group of glowworms move towards the glowworms with high fluorescein. Each glowworm in its dynamic decision domain radius chooses a glowworm whose fluorescein value is higher than its own fluorescein value to move towards and updates its dynamic decision-making domain. Then some glowworms are selected according to probability to update the position from dynamic decision-making domain. Finally, the decision domain updated. GSO algorithm has two important phases as follows.
The phase for updating the fluorescein.
The phase of updating the position.
In the GSO, each glowworm is looking for the neighborhood within its field of vision, and then moves to a brighter glowworm. Each time the moving direction depends on the neighborhood selection. In addition, the glowworm decision domain radius size is influenced by the number of glowworms in different neighborhoods, when the number of glowworms is too small, glowworms will increase their decisions radius in order to find more glowworms; On the contrary, they will reduce their own decision-making radius. At the end, the GSO makes most of the glowworms gathered in a better position.
The key idea of density-based clustering such as OPTICS is that for each object in a cluster the neighborhood within a given radius has to contain at least a minimum number of objects (MinPts), which is the cardinality of the neighborhood. The condition Card(N ε (q)) ≥ MinPts is called the “core object condition”. If this condition holds for an object p, then we call p a “core object”. Only from core objects, can other objects be directly density-reachable.
In PPI networks, the node degrees obey power-law distribution, we select all nodes as core nodes so that the node which degree is small can be considered. As a result, we redefined two definitions as follows.
Definition 1: (Distance core of node p).
Definition 2: (Distancereachability of node p).
The proposed iOPTICS-GSO
Calculating the distance in a PPI network
Clustering PPI network.
The value of parameters which corresponding to the best result in each sub-network on DIP
It is well known that the GSO algorithm has less parameters, simple operation and good stability, etc. GSO algorithm simulates the characteristic of glowworms glow in nature, by comparing the size of the fluorescein value to achieve the purpose of communication, so as to realize the optimization of the problem. So we introduce the GSO algorithm to optimize the parameters of OPTICS, in order to obtain optimal results. Algorithm2 describes the details of iOPTICS-GSO. After several circulations iterative process, a glowworm constantly updates its position and iteratively approaches to the best position. At last, the glowworm finds the best position.
Time complexity analysis of iOPTICS-GSO algorithm
The time complexity of OPTICS algorithm is O (num 2).
The time complexity of computing the fitness of glowworms is O (PopSize * O (num 2).
The time complexity of glowworms moving process is O (PopSize 2).
The time complexity for updating the position O is (PopSize).
Results and discussion
In this study, we used four static PPI networks for yeast, including DIP , Krogan , MIPS  and Gavin  to evaluate our proposed iOPTICS-GSO. The DIP data consists of 4995 proteins and 21,554 interactions, Krogan data consists of 2674 proteins and 7075 interactions, MIPS data consists of 4546 proteins and 12,319 interactions and Gavin data consists of 1430 proteins and 6531 interactions. For verifying protein complexes identified by our proposed method, the set of protein complexes derived from CYC2008  is selected as the gold standard dataset in this study, which includes 408 protein complexes and covers 1492 proteins,
The number of proteins and interactions in each sub-network of the four datasets contain
The effect of parameter
In Fig. 4, the effect of different values of MinPts on f-measure is not very big, and this also confirms that the reachability-plot is rather insensitive to the input parameter of the method. We observe that the value of f-measure increases initially as the value of MinPts increases and decreases after reaching the maximum. Then we chose the value of MinPts at which the f-measure reaches the maximum in iOPTICS-GSO. As a result, we find that the optimal values of MinPts are 3, 2, 2 and 4 for DIP, Krogan, MIPS and Gavin, respectively.
Description of clusters predicted by several clustering algorithms
From Table 3, we can see that the numbers of clusters obtained by the proposed algorithm on four datasets are smaller than those compared methods. The reason of this result is that the number of interactions in most sub-networks is sparse, so the distance of these nodes calculated by Eq. (7) would be up to 1, and these nodes were regarded as a class, respectively. In the final phase, we filtered the results from each sun-network clustering, and deleted some clustering modules whose density was smaller or had only one node.
Comparison of the functional enrichment of protein complexes with other algorithms on four datasets
Some examples of the predicted complexes with small p-value on Gavin data
Predicted protein complex
Gene Ontology term
YKL144C YNR003C YPR110C YPR190C YDL150W YKR025W YNL151C YBR154C YJL011C YNL113W YDR045C YNL248C YJR063W YOR340C YIL021W YML010W
YJL069C YLR409C YLR222C YLR129W YDR449C YCR057C YGL171W YDR365C YKR060W YDR299W YGR145W YDL213C YNL075W YHR148W YLR186W YLL011W YJR002W YPL217C YGR128C YNL132W YMR093W YCL059C YPR144C YER082C YPR137W YBR247C YPL126W YDR324C YHR196W YOR078W YDL148C YJL109C YMR128W YOL010W YNL308C YHR169W YPR112C YDL166C YLR003C YGR081C YOR056C YGR054W YKL143W YNL207W YPL204W YCL011C YJL033W YKL059C YLR115W YAL043C YLR277C YNL317W YKL018W YJR093C
YML114C YCR042C YPL011C YDR167W YMR236W YBR198C YGL112C YMR005W YML015C YDR145W YMR227C YBR081C YLR055C YDR448W YGR252W YDR392W YPL254W
YCR042C YML114C YMR005W YML015C YPL011C YMR236W YGR274C YBR198C YGL112C YLR055C YCL010C YDR448W YPL254W
YLR129W YLR409C YDR449C YCR057C YPL266W YPR112C YDR299W YGR128C YPL126W YJR002W YDR324C YNL132W YPL217C YBL004W YDL148C YER082C YHR196W YGR090W YCL059C YLR003C YCL011C YCL031C YDL213C
YLR418C YGL244W YOL145C YBR279W YOR123C YGL019W YOR039W YMR309C YPL181W
YHL025W YBR289W YPL016W YPR034W YJL176C YFL049W YHR023W YPL082C YNL059C YNL272C YML114C YPL011C YDR176W YBR198C YDR392W YGL066W YOL148C YDR145W YER164W YKR001C YDR073W YML069W YKL088W YMR172W
YHR156C YHR165C YER172C YPR082C YDL087C YGR013W YDR283C YJL203W YDR416 YGL128C YLR117C YAL032C YPR178W YBL104C YGL100W YIL061C
Protein complexes are not only the basis of normal biological processes, but also play an important role in the pathological process. Therefore, identifying protein complexes play an important role in understanding the cellular organizations and functional mechanisms. In this study, we have put forward the algorithm named iOPTICS-GSO, which is the improved OPTICS algorithm by using GSO to optimize the parameter in OPTICS, and we changed the concept of core node and redefine the similarity which makes more accord with the actual situation of PPI network. As different parameter setting have different results on each sub-network of DPIN, we have used GSO algorithm to optimize these parameters, and finally checked the quality of every cluster and gained the optimal cluster results. The experiment results have shown that our iOPTICS-GSO outperforms competing algorithms in terms of f-measure and p-value. It means the results from iOPTICS-GSO are more biologically meaningful than others for identifying significant proteins complexes. However we also found that the number of clustering modules is relatively small and the recall of clustering results is lower than other algorithms in iOPTICS-GSO results. The reason may be that each protein only can belong to one cluster in iOPTICS-GSO, which causes that other clustering modules are small. Therefore, it would be our focus to discover the effective strategy to improve the result and detect more protein complexes in the future.
We are grateful to the help of National Natural Science Foundation of China. We appreciate the experimental conditions provided by our college. Especially, we thank our laboratory members for useful discussion and comments.
This paper is supported by the National Natural Science Foundation of China (61,672,334, 61,502,290, and 61,401,263).
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 10 Supplement 5, 2017: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2016: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-10-supplement-5.
X.L. conceive the study, guided the design of the method and the algorithm. H.L. designed and performed the experiment and analyzed the data. X.L. and H.L. drafted the manuscript. A.ZH. and F.X.W revised the manuscript and polished the English expression. All the authors read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick M, Michon AM, Cruciat CM, Remor M, Höfert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141–7.View ArticlePubMedGoogle Scholar
- Kazemipour A, Goliaei B, Pezeshk H. Protein complex discovery by interaction filtering from protein interaction networks using mutual rank Coexpression and sequence similarity. Biomed Res Int. 2015;2015. Article ID 165186:1–7.Google Scholar
- Lage K, Karlberg EO, Størling ZM, Ólason PÍ, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007;25(3):309–16.View ArticlePubMedGoogle Scholar
- Yang ZH, Yu FY, Lin HF, Wang J. Integrating PPI datasets with the PPI data from biomedical literature for protein complex detection. BMC Med Genet. 2014;7(2):S3.Google Scholar
- Srihari S, Leong HW. Temporal dynamics of protein complexes in PPI networks: a case study using yeast cell cycle dynamics. BMC Bioinform. 2012;13(17):824–34.Google Scholar
- Li M, Zheng RQ, Zhang HH, Wang JX, Pan Y. Effective identification of essential proteins based on priori knowledge, network topology and gene expressions. Methods. 2014;67:325–33.View ArticlePubMedGoogle Scholar
- Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. [M] DBLP, 1990.View ArticleGoogle Scholar
- Pilevar AH, Sukumar M. A grid-clustering algorithm for high-dimensional very large spatial data bases. Pattern Recogn Lett. 2005;26(7):999–1010.View ArticleGoogle Scholar
- Ester M, Kriegel HP, Sander J, Xu XW. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. Menlo Park: The AAAI Press; 1996. p. 226–31.Google Scholar
- Ankerst M, Breunig M, Kriegel H, Sander J. OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Rec. 1999;28(2):49–60.View ArticleGoogle Scholar
- Holland JH. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Quarterly Review of Biology. 1975;6(2):126–137.Google Scholar
- Kennedy J, Eberhart R. Particle swarm optimization. In: Proceeding of the IEEE international conference on neural networks; 1995. p. 1942–8.View ArticleGoogle Scholar
- Krishnanand KN, Ghose D. Detection of multiple source locations using a glowworm metaphor with applications to collective robotics. Pasadena: IEEE Swarm Intelligence Sysposium; 2005. p. 84–91.Google Scholar
- Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22(8):1021–3.View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 2003;4:1–27.View ArticleGoogle Scholar
- Liu G, Wong L, Chua H. Complex discovery from weighted PPI networks. Bioinformatics. 2009;25(15):1891–7.View ArticlePubMedGoogle Scholar
- Wu M, Li X, Kwoh C, Ng SK. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform. 2009;10(1):1–16.View ArticleGoogle Scholar
- Nepusz T, Yu H, Paccanaro H. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012;9(5):471–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Dongen BSV. Graph clustering by flow simulation. Dissertation for doctoral degree, Center for Math and Computer Science (CWI). Utrecht: University of Utrecht; 2000.Google Scholar
- Lei XJ, Li H, Wu Fang-Xiang. Detecting Protein Complexes from DPINs by OPTICS Based on Particle Swarm Optimization. 2016 IEEE International Conference on Bioinformatics andBiomedicine. Shenzhen, China. 2016;1814–21.Google Scholar
- Shi BY, Eberhart R. A modified particle swarm optimizer. Proceedings of the IEEE Congress on Evolutionary Computation. Anchorage: IEEE; 1998:303–8.Google Scholar
- Yedidia J, Freeman WT, Weiss Y. Understanding belief Propa- gation and its generalizations. Int Joint Conf Artif Intell (IJCAI). 2001;54(1):276–86.Google Scholar
- Letovsky S, Kasif S. Predicting protein function from protein-protein interaction data: a probabilistic approach. BMC Bioinform. 2003;19(6):197–204.View ArticleGoogle Scholar
- Xenarios I, Salwnski L, Duan XJ, Higney P, Kim SM, Eisenberg D. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30(1):303–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP. Global landscape of protein complexes in the yeast Saccharomyces Cerevisiae. Nature. 2006;440(7084):637–43.View ArticlePubMedGoogle Scholar
- Güldener U, Münsterkötter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stümpflen V. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006;34:D436–41.View ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Furga GS. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440(7084):631–6.View ArticlePubMedGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009;37(3):825–31.View ArticlePubMedGoogle Scholar
- Lei XJ, Wang F, Wu FX, Zhang AD, Pedrycz W. Protein complex identification through Markov clustering with firefly algorithm on dynamic protein-protein interaction networks. Inf Sci. 2016;329:303–16.View ArticleGoogle Scholar
- Tu BP, Kudlicki A, Rowicka M, McKnight SL. Logic of the yeast metabolic cycle: temporal compart mentalization of cellular processes. Science. 2005;310:1152–8.View ArticlePubMedGoogle Scholar
- Zhang AD. Protein interaction networks: computational analysis. New York: Cambridge University Press; 2009.View ArticleGoogle Scholar
- Brohée S, Helden JV. Evaluation of clustering algorithms for protein–protein interaction network. BMC Bioinform. 2006;7(1):1–19.View ArticleGoogle Scholar
- Friedel CC, Krumsiek J, Zimmer R. Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. In. In: Vingron M, Wong L, editors. Proceedings of the 12th annual conference on research in computational molecular biology (RECOMB); 2008. p. 3–16.View ArticleGoogle Scholar
- Sadeque A, Serão NV, Southey BR, Delfino KR, Rodriguez-Zas SL. Identification and characterization of alternative exon usage linked glioblastoma multiforme survival. BMC Med Genet. 2012;5(1):59.Google Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection ofprotein complexes in large interaction networks. BMC Bioinformatics. 2006;7:207–19.View ArticlePubMedPubMed CentralGoogle Scholar