Similarity between sequences can be assessed by local or global alignment score. A score is a numerical value that describes the overall quality of an alignment. The protein sequence classification algorithm which relies solely on an optimal alignment score is not practical for a large protein database. Protein sequence similarity can be readily found by suboptimal alignment using protein-protein BLAST [2]. Bitscore is a rescaled version of the raw alignment score that is independent of the size of the search space. P-value is the probability that the match is random. For multiple testing, P-value is corrected by multiplying it with size of the search space to get the E-value. The lower the Evalue, the more significant the score is. Thus E-values and Bit-scores carry slightly different information. So there should be an unified measure which combines the properties of both. To do this, the EB-score between sequences s1 and s2 is defined as follows,
(1)
where H(·) is the Heaviside-step function,
(2)
The performance comparison given in the section 'Results', shows importance of the EB-score over the Bitscore.
Graph construction
The proposed algorithm uses the graphical modeling of protein sequences from each protein family/class. Consider the large protein database having M classes. Let the set of class labels be given as . Consider one of the protein classes, , then let be the ith training sequence in that class. Similarly is the new query or test sequence and its class label c
q
is what we have to find.
Let the protein family or the class c has N training sequences (N may be different for different ), then the set of vertices for class c is defined in the similarity space as, . Strength of an edge between the vertices and is given as,
(3)
These edges form a set . Now the graph of a protein class is given by G(Vc, Ec). This is a weighted and an undirected graph. An edge weight is nothing but the strength of protein sequence similarity.
To construct the graph of each protein class, we just need to consider interaction (EB-score) among all protein sequences within that class. Number of proteins in a class is far smaller than that of the entire database. So graphs of all protein classes can be easily and independently constructed by using protein-protein BLAST within the corresponding classes.
Graph analysis
In protein similarity graphs, modularity, local clustering and scale-free topology [19] coexist. To explain this phenomenon we need the hierarchy, so graphs are analyzed in the hierarchical manner. At each hierarchical level, the edges with weights lower than a certain threshold are pruned. Now surviving edges are considered to be weightless. So the graph structure changes along hierarchical levels, as well as the graph becomes unweighted and remains undirected. This hierarchical analysis helps to extract different graph features for weakly similar hits (sequence matches) and thus captures the complex relationship between sequence similarity and protein function.
For any set , consider , and i as an indicator variable, and let ∅ be the null (empty) set, ∅ = {} then 'set element' is formed as
(4)
Cardinality of a set (), is the number of elements present in it. For the graph of c, at certain hierarchy (i.e. at threshold t), the edge set is given as,
(5)
The corresponding graph is given as . For notation simplicity lets represent it as G(Vc, Ec, t), instead of and note that G(Vc, Ec) = G(Vc, Ec, 0).
For any sets and , define the set addition function (like a multiset sum) as,
(6)
and note that , thus added set contains repetitive elements when . For example, let and then . This operation obeys the associative and the commutative laws like numerical addition. As defined earlier, is the query sequence from the unknown class c
q
. Now , and edges among the vertices in the set is given by an edge set . After adding to the original graph G(Vc, Ec), we will get the new graph .
Graph Structured Features (GSF)
Most of the real world and biological (scale-free) networks communicate via a few highly connected nodes known as Hubs. These hubs determine the properties of networks [19]. In real world networks like airline route maps, the important cities form hubs. Proteins with high degrees of connectedness are more likely to be essential for survival than proteins with lesser degrees [22]. Gene duplication leads to growth and preferential attachment in biological networks [23]. This leads to translating the proteins having high similarity. This shows the possibility of hub formation in the protein family graphs, G(Vc, Ec), in the similarity space.
Figure 2(b) shows building of the PPS network with a varying threshold for one of the COG [6] protein families. We can see that as the threshold is lowered, trivially more edges are formed but most of them are associated with only particular nodes (hubs). Thus hubs are getting stronger and becoming more evident in the PPS network. We are not interested in the detailed assessment of whether the network is scale-free (a power-law degree distribution [19]) or not. But the above analysis helps to guide us for finding proper features which take graph structure (i.e. complex relationships among the protein sequences) into account. Also, the different protein families have different characteristics. Thus use of a single graph feature may not be effective. Features are selected such that they could extract different but vital network information.
Average Clustering coefficient (AC)
For a node n, the clustering coefficient C
n
, measures the extent to which neighbors of n are also the neighbors of each other [19]. Thus it is nothing but the density of sub-graph induced by the neighborhood of n. Consider the graph G(V, E) with n ∈ V . Let ℕ
n
be the number of neighbors of n and be the number of edges between them, then the AC is given by,
(7)
A clique is a maximal complete sub-graph where all the vertices are connected. C
n
quantifies how close the neighbors of a node are, to form a clique among themselves. It represents the potential modularity of a network and C
n
of the most real networks is much larger than that of a random networks. AC distribution is found to be effective for an identification of a modular organization of the metabolic networks [24]. Consider the example shown in Figure 3(a). Node c has 4 neighbors , having 2 connected edges among them (a to b and d to e), which forms and then . When is attached to G(Vc, Ec), it may change its AC. For a given the change in AC is given as,
(8)
Rich Club coefficient (RC)
The 'rich-club' phenomenon refers to the tendency of nodes with high centrality to form tightly interconnected communities. Degree (d) of a node is the number of directly connected neighbors. High degree nodes (rich nodes) are much more likely to form tight and well interconnected sub-graphs than low degree nodes [25]. Thus hubs are generated through 'rich-gets-richer' mechanism. A quantitative definition of the rich-club phenomenon is given by the rich-club coefficient (φ). Let be the number of vertices having degree greater than r and be the number of edges among those vertices then,
(9)
For the example shown in Figure 3(a), . In a complex network, φ is a novel probe for finding topological correlations and it yields vital information about the underlying architecture of the network [25]. Similarly as explained earlier, the change in RC is given as,
(10)
Star Motifs (SM)
Previous analysis showed the existence of hubs in the PPS network. Features like mean path length and degree distribution [19], nicely quantifies hub like properties of the network. But mean path length is computationally very expensive for large protein database, so simple and elegant features are needed. Hubs indicate existence of star shaped patterns (star-motifs ). Let power (p) of SM be the degree of the central (center of the 'star' pattern) node. Figure 3(b) shows the SM with various powers. Each node in the training graph is already assigned with the node degree, so that the SM can be easily computed. This simple feature answers the questions like, 'does the give rise to the new hubs' and 'what is the increased strength of the hubs'. Let SM(·, p) finds the number of SM of power p then change in SM is given as,
(11)
Triangles (TR)
In a graph, the smallest clique with three nodes is a triangle. The number of triangles gives an important information about structure of the PPS network. Let the graph G(Vc, Ec) be represented by N × N adjacency matrix [Ec], whose (i, j)th entry is given as, , then the number of triangles in the graph are given as,
(12)
Either the graph is dense or sparse, TR can be computed readily within the same time as the computation is only associated with the matrix of the same size. TR inherently captures different network properties than SM. Figure 3(c) illustrates this difference more elaborately. Query node q interacts with the nodes a, b, c and d. So they constitute an 'interacting neighborhood' to the query. TR has capability to simultaneously assess the interactions within interacting neighborhood. After the query interaction, one extra triangle is formed since only the nodes b and c were previously interacting. Whereas at node a SM power has increased from 3 to 4. Formation of new triangles in a graph indicates the fact that 'query interacts simultaneously to the already interacting nodes'. Newly formed number of triangles due to query interaction ΔTR(c, t), are found similarly as equation (8).
Graph Energy (GE)
The original graph structure changes when the query interacts with it. Figure 3(d) shows two different type of query interaction with the same graph, where an edge length is proportional to its weight. When only q1 interacts with a graph then its spread remains almost unaffected but in case of q2 interaction, the spread of a graph alters drastically. GE is defined as the sum of the absolute values of the eigenvalues of the adjacency matrix [26], and given as follows,
(13)
Since GE depends only on the adjacency matrix, the density of the graph does not affect its computation time. The effect of query interaction on the original graph structure is captured by change in GE (i.e. ΔGE(c, t)), given similarly like equation (8). We are dealing with graphs whose edge strength is the similarity between connecting nodes, which is inverse of the usual edge length definition. So we need to look for the maximum ΔGE(c, t).
Algorithm
Graphs are analyzed hierarchically (as discussed in section 'Graph analysis') and threshold plays an important role in making hierarchical graph structure. If the query can interact at the higher layer of GP (see Figure 2(a)), then it means it is a strong interaction and it accounts for the global feature. Because in GP as the level rises, the threshold also increases and at every level, the graph edges can only be formed if their strength is greater than the given threshold. Here interactions are in the sequence similarity space, so a sstrong interaction means the highest similarity between corresponding sequences, which occurs when both sequences match at most of the nucleic acids (i.e. match globally). Hence strong interaction accounts for the global feature and similarly weak interactions account for local features.
Let , be a set of sets, defining few threshold levels corresponding to each GTF. Consider, t
AC
= {t1, t2, t3} and t1 < t2 < t3. Let us consider there are only 3 protein classes i.e. . Figure 4 explains the hierarchical query classification subroutine (algorithm 1) for the feature AC. Each q
i
is analyzed first at the highest level (t3) where we look for class , having ΔAC(c, t) > T
AC
and collect them in . For query q1 we can not find any such classes c at t3, so we descend the GP to t2 and discover with threshold t∗ = t2. The secondary threshold (T
AC
) is necessary, otherwise there will be many spurious classes i.e. false positives (FP), having nonzero change in average clustering coefficient .
In the subroutine given in an algorithm 1, t∗ is the maximum threshold (also defines maximum GP level) at which the query starts interacting with the graph of some class. Let us represent the subroutine by abusing the notation for simplicity and readability as,
(14)
Maximum value of H(·) is 1, so all the arguments (classes c) are assigned to whenever it produces output 1. For reducing FP, the subroutine for SM and TR slightly changes (see algorithm 2). Here we look for k maximally influenced classes from by the query. Thus classes from only are assessed (voted) again by the features SM and TR. Subroutine for GE finds the class from for which ΔGE(·) is maximum. Thus this produces a hierarchical class voting scheme which helps to improve the classification accuracy and to reduce the computational load.
Each GSF has an ability to extract different information from different levels of the protein family GP. However, applying each GSF to entire GP, is computationally inefficient when dealing with large numbers of protein families. In addition, it may add up the FP, when a decision is being made in the much lower GP level than the level defined by t∗. To avoid these problems, the GP based hierarchical voting scheme is necessary. And the rational behind placing different GSF at different GP levels, is explained in the next 'Results' section.
Algorithm 1 Graph pyramid search subroutine
input: ; secondary threshold T
AC
; primary threshold , where t
i
>t
j
for i >j
output:
1:
2: for t ∈ t
AC
from t
n
to t1 do
3: if then
4: for all classes do
5: I
AC
= H (ΔAC(c,t)−(T
AC
))
6:
7: t* = t
8: end for
9: end if
10: end for