 Research article
 Open Access
 Published:
Clustering analysis of microRNA and mRNA expression data from TCGA using maximum edgeweighted matching algorithms
BMC Medical Genomics volume 12, Article number: 117 (2019)
Abstract
Background
microRNA (miRNA) is a short RNA (~ 22 nt) that regulates gene expression at the posttranscriptional level. Aberration of miRNA expressions could affect their targeting mRNAs involved in cancerrelated signaling pathways. We conduct clustering analysis of miRNA and mRNA using expression data from the Cancer Genome Atlas (TCGA). We combine the Hungarian algorithm and blossom algorithm in graph theory. Data analysis is done using programming language R and Python.
Methods
We first quantify edgeweights of the miRNAmRNA pairs by combining their expression correlation coefficient in tumor (T_CC) and correlation coefficient in normal (N_CC). We thereby introduce a bipartite graph partition procedure to identify cluster candidates. Specifically, we propose six weight formulas to quantify the change of miRNAmRNA expression T_CC relative to N_CC, and apply the traditional hierarchical clustering to subjectively evaluate the different weight formulas of miRNAmRNA pairs. Among these six different weight formulas, we choose the optimal one, which we define as the integrated mean value weights, to represent the connections between miRNA and mRNAs. Then the Hungarian algorithm and the blossom algorithm are employed on the miRNAmRNA bipartite graph to passively determine the clusters. The combination of Hungarian and the blossom algorithms is dubbed maximum weighted merger method (MWMM).
Results
MWMM identifies clusters of different sizes that meet the mathematical criterion that internal connections inside a cluster are relatively denser than external connections outside the cluster and biological criterion that the intracluster Gene Ontology (GO) term similarities are larger than the intercluster GO term similarities. MWMM is developed using breast invasive carcinoma (BRCA) as training data set, but can also applies to other cancer type data sets. MWMM shows advantage in GO term similarity in most cancer types, when compared to other algorithms.
Conclusions
miRNAs and mRNAs that are likely to be affected by common underlying causal factors in cancer can be clustered by MWMM approach and potentially be used as candidate biomarkers for different cancer types and provide clues for targets of precision medicine in cancer treatment.
Background
Cancer is an abnormal growth of cells, which divide without control and spread into surrounding tissue. According to the website of the Cancer Statistics Center from the American Cancer Society (https://cancerstatisticscenter.cancer.org/#!/), in 2017 in the U.S., there were an estimated 1,688,780 new cancer cases and 600,920 cancer deaths. Cancer is a genetic disease caused by alterations of genes that control the cell behavior, like cell growth and division. Genetic, transcriptional as well as other alterations can be comprehensively identified from next generation sequencing (NGS) data of samples collected from tumorous tissue and normal adjacent tissue in the same patients suffering from a specific type of cancer. Those data are accumulated and organized by different projects such as International Cancer Genome Consortium (ICGC) [1], Encyclopedia of DNA Elements (ENCODE) [2], and the Cancer Genome Atlas (TCGA) [3]. The TCGA project was initiated in 2006 to develop a publiclyaccessible infrastructure of data. TCGA finalized tissue collection with matched tumor and normal tissues from 11,000 patients with 33 cancer types and subtypes, including 10 rare types of cancer. TCGA data has been used to characterize key genomic changes, find novel mutations, define intrinsic tumor types, discover similarities and differences across cancer types, reveal therapy resistance mechanisms, and collect tumor evolution evidence [3].
microRNA (miRNA) is a very short RNA (~ 22 nt) that can regulate gene expression at the posttranscriptional level [4]. Mainly from either intronic or intergenic regions of noncoding or coding genes [5, 6], miRNAs are transcribed primarily by RNA polymerase II to be parts of longer primary miRNA (pri–miRNA) transcripts that are capped, spliced, and polyadenylated [7, 8]. In the nucleus, pri–miRNA is processed, by the Microprocessor complex that consists of the RNase III enzyme Drosha and its cofactor DGCR8, to produce precursor miRNA hairpin (pre–miRNA). The resulting pre–miRNA is then exported to the cytoplasm and cleaved by Dicer to produce miRNA:miRNA duplex. Then the functional miRNA strand and Argonaute (AGO2) proteins are incorporated into the RNA–induced silencing complex (RISC) [9]. The base pairing between miRNA and mRNA relies on the seed region, about 2–8 nt in an miRNA, which functions as a part in the RISC, bound to the complementary region in the 3′ UTR of its target mRNA [10]. The miRNA guides RISC to silence the target mRNAs by means of mRNA cleavage, translational repression, or deadenylation [11].
Regulation of the miRNA and mRNA network is complex. A single miRNA can target many mRNAs, while many miRNAs are able to cooperatively target a single mRNA. This allows for finetuned gene expression regulation [12]. The cooperativity within some miRNA families or genomic clusters that target the same mRNAs is likely to be mainly additive [10]. miRNA also has sponge function for mRNA. When one of the mRNAs targeted by a specific miRNA change its expression level, the specific miRNA will redistribute and alter the protein translation of several transcripts. [13]. Thereby, considering these complexity, the observed expression correlation coefficient of a particular miRNAmRNA targeting pair can range from − 1 to 1, not always negative, even if the miRNAmRNA has predicted or experimentally validated targeting relationships. The aberration of miRNA expression could affect a large number of mRNAs and cancerrelated signaling pathways [14]. Some previous studies discovering and explaining this complexity in cancer are summarized as follows:
In a breast cancer study, miR183 was experimentally proven to directly target the 3′UTR of its target gene RAB21, by cotransfecting the luciferase reporters with 33 bp of the predicted target regions. The miR183/− 96–182 genomic cluster also has transcription factors HSF2 and ZEB1 that are experimentally validated to bind to the upstream of the TSS region of the miR183/− 96–182. Nevertheless, analyzing the 508 clinical samples from TCGA data, the correlations between miR183/− 96/− 182 cluster miRNAs and their target/regulators do not exhibit simply positive or negative correlations. Experimentally verified direct correlation between miR183 and the expressions of RAB21 could not be found based on the TCGA data analysis. But some interesting correlations between them in different subtypes were found [15], indicating the clue of solving the miRNAmRNA network complexity by grouping the subtypes of the cancer types.
In a study of ovarian cancer, it was found that the miRNAmRNA pair hsamiR1403p/RAD51AP1, was negatively correlated in both normal and tumorous samples with downregulated miRNA and upregulated mRNA expression values in tumor relative to normal samples, suggesting the expressional dysregulation of a direct miRNAmRNA interaction mechanism. However, some miRNAmRNA pairs were positively or negatively correlated in the tumor samples, but not in the normal samples, implying that the miRNAs de novo gain functions in tumor. Some miRNAmRNA pairs show correlations in normal samples but not in tumor samples, implying that the miRNA de novo lose functions in tumors. Intriguingly, the miRNAmRNA pairs that are positively correlated in both tumor and normal samples were identified, suggesting potential indirect pathways or intermediate regulatory mechanisms in tumorigenesis [16].
There are bioinformatics tools clustering the miRNAmRNA interaction network [17]. Some tools are based on miRNAmRNA expressional correlation coefficients calculated from NGS expression data [18]. Clustering results can be enhanced or filtered by integrated analysis of known or predicted miRNAmRNA targeting information [19]. For example, MAGIA2 utilizes negative expressional correlation coefficients between miRNA and mRNA across many matched or unmatched samples [19]. However, MAGIA2 neglects the situation that the miRNAs that have positive correlation coefficients also play a role in the regulatory network. miRMAP studies both significant negative and positive correlations between miRNA and mRNA; its bicluster analysis of miRNAmRNA bipartite graph provides insights into how modules of miRNAs regulate groups of functionally related mRNAs [10]. However, miRMAP only considered tumor condition. Thereby it lacks the view of the correlation coefficient changes between normal and tumor tissues. MMiRNAViewer visualizes altered expressional correlation coefficients of miRNA and mRNA in both tumor and normal; the correlation coefficient of a miRNAmRNA pair could be the same or inverted in sign in tumor compared to normal [18]. However, the connections between miRNA and mRNA are not combined together to quantify the miRNAmRNA correlation coefficient changes from normal to tumor.
Although Jansson and Lund explained the potential mechanisms of how a target mRNA may become uncoupled from its targeted miRNA [14], the factors inverting the miRNAmRNA expression correlation coefficient from normal to tumor are still unclear, indicating the complex direct and indirect regulation of the miRNA and mRNA network. In this study, we proposed six weight formulas to quantify the change of miRNAmRNA expressional correlation coefficients in tumor relative to in normal. We used the traditional hierarchical clustering algorithm to evaluate different formula weights of miRNAmRNA pairs and chose the integrated mean value weight. Then, we developed a novel bioinformatics pipeline called maximum weighted merger method (MWMM) based on objective optimization algorithms, namely the Hungarian and blossom algorithm, to cluster the miRNA–mRNA pairs. We hypothesized that the miRNAmRNA pairs with higher weights, if properly clustered together, are more likely to be intensely affected by common causal factors in the complex direct and indirect network.
Methods
This study focused on the expression correlation coefficient changes of miRNAmRNA pairs that were inverse from in normal to in tumor. Six edge weight formulas, which were proposed to simultaneously quantify the changes, were evaluated using the subjective traditional hierarchical clustering algorithm. After evaluation, integrated mean value weight was used to quantify the changes. Then, a maximum weighted merger method (MWMM) pipeline that consists of continuous iterations of Hungarian algorithm and several rounds of blossom algorithm was used to passively cluster the miRNAmRNA pairs based on the maximum weighted edge matching in the bipartite graph and general graph. The clustered miRNAmRNA pairs were validated mathematically by the clustering criteria that the inner weights of the clusters are larger than the outer weights of the clusters and biologically by the criteria that intracluster’s average GO term similarity distance scores are larger than the intercluster’s. Then the genes in clusters were enriched via functional analysis like KEGG pathway or GO term. Finally, the effectiveness of MWMM was tested by applying the MWMM approach to other 14 cancer types and comparing to other six clustering algorithms in terms of GO term similarity distance scores. The methodology is illustrated in Fig. 1.
Source data of BRCA from TCGA
The invasive breast carcinoma (BRCA) NGS expression data of miRNA and mRNA in tumor and in normal were downloaded from the TCGA data portal. BRCA data set has 863 samples consisting of 87 normal samples and 776 tumor samples. The number of samples in BRCA and other TCGA data sets used in this study are described in the first table in [18]. We processed the downloaded BRCA data set following the same procedure described in [20], and got four expression matrices across samples, including miRNA expression in tumor, miRNA expression in normal, mRNA expression in tumor, and mRNA expression in normal. The expression matrices involve 1046 miRNA and 20,531 mRNA in both tumor and normal samples. The row names of the expression matrices are miRNA names or gene names. The column names of the expression matrices are sample names.
Then we used MMiRNATar [20] that takes four expression matrices as input to calculate the expression correlation coefficient (Pearson correlation coefficient) across samples for each possible miRNAmRNA pair combination. The resultant miRNAmRNA pairs in tumor and normal are filtered following three cutoff criteria: false discovery rates (FDR) are ≤ 0.1, the correlation coefficients in tumor are opposite to in normal samples, and target predictions are supported by at least one of three target prediction databases: TargetProfiler [21], TargetScan [22], and miRanda [23]. Eventually we end up with 20,661 pairs of miRNA and mRNAs with their expression correlation coefficients in tumor (T_CC) and correlation coefficients in normal (N_CC) calculated and organized into a table for downstream analysis. The table is exemplified in Table 1.
Edgeweighted bipartite graph model
A simple graph is defined as G = (V, E), where V(G) or V denotes a set of vertices, and E(G) or E denotes a set of edges. E is 2element subsets of V. An edge is associated with two vertices. w(e) is defined as edge weight for each edge. In this study, the miRNAmRNA interaction network is visualized as an edgeweighted bipartite graph G = (V, E), where V consists of vertices of mRNAs (V_{L}) and miRNAs (V_{R}), i.e., V = (V_{L} + V_{R}) and E represents the weighted edges between the mRNA and miRNA vertices. Let i be the vertex subscript in the V_{L} and j be the vertex subscript in the V_{R}. Then v_{i}v_{j} is the connection between v_{i} and v_{j}, namely the edge connecting a vertex in V_{L} to a vertex in V_{R}. An example of a miRNAmRNA bipartite graph is shown in Fig. 2. The edge list denoting the bipartite graph is shown in Table 2.
Edge weight calculation
We combine T_CC and N_CC simultaneously to quantify the connections or weighted edges between miRNA and mRNA vertices. The connections reflect the intensity of inversion of miRNAmRNA expression correlation coefficients from in normal to in tumor. The formulas of calculating edge weights are described as follows.
We propose 6 types of edge weights that consider information of T_CC and N_CC simultaneously to measure the connection of the edge v_{i}v_{j}, in the case of BRCA, for 1 ≤ i ≤ 312 and 1 ≤ j ≤ 7874. Based on the foregoing three cutoff criteria, the number of selected miRNAs is 312 and the number of selected mRNAs is 7874.
The miRNAmRNA expression correlation coefficients are separated into two classes based on their parity change, as shown in Fig. 3. One class has N_CC > 0 and T_CC < 0, i.e., the correlation coefficients are converted from positive in normal to negative in tumor. The other class has N_CC < 0 and T_CC > 0, i.e., the correlation coefficients are converted from negative in normal to positive in tumor.
Intuitively, the arithmetic mean of absolute values (ie., T_CC and N_CC) is an option to quantify the inversion of their expressional correlation coefficient, namely, inversion of N_CC to T_CC for a miRNAmRNA pair as shown in Fig. 3. However, to increase the contrast of values in the data, a coefficient can be generated for each value by dividing that value by the arithmetic mean of the data. The value is then multiplied by its coefficient, so that values larger than the arithmetic mean of the data will become larger, and values smaller than the arithmetic mean of the data will become smaller. Using notation, let the T_CC values have the arithmetic mean, m_{T _ CC}. A specific T_CC value is denoted x, while \( \frac{x}{m_{T\_ CC}} \) is that value’s coefficient. The new value is given by \( \frac{x}{m_{T\_ CC}}\times x \), which enhances the importance of the value x if x is greater than the average m_{T _ CC}, and weakens the importance of the value x if x is smaller than the average m_{T _ CC}.
In our expression correlation coefficient data, the T_CC values consist of two groups: positive values and negative values. We calculate the arithmetic mean of the positive values of T_CC as \( {\boldsymbol{m}}_{\boldsymbol{T}\_\boldsymbol{CC}}^{+} \) and arithmetic mean of absolute value of the negative values of T_CC as \( {\boldsymbol{m}}_{\boldsymbol{T}\_\boldsymbol{CC}}^{} \) . Similarly, we calculate the \( {\boldsymbol{m}}_{\boldsymbol{N}\_\boldsymbol{CC}}^{+} \) and \( {\boldsymbol{m}}_{\boldsymbol{N}\_\boldsymbol{CC}}^{} \) for N_CC’s groups, as shown in Fig. 3. Then, the integrated mean value weight is calculated by assignments of coefficients λ_{1} and λ_{2}, shown in Table 3. In such a way, we can quantify the inversion of the correlation coefficients from the positive values of N_CC, N_CC^{+}, to the negative values of T_CC, T_CC^{−} (λ_{1}), and likewise, from N_CC^{−} to T_CC^{+} (λ_{2}) because these two sets represent different correlation change directions, as shown in Fig. 3.
Based on the abovementioned reasoning, we propose the formula of the integrated mean value weight to combine the T_CC and N_CC information simultaneously, as shown in Table 3 and Table 4. For the sake of comparison, we also propose other formulas that are common in basic mathematics, listed in Table 4. For example, all negative value weight cannot reflect the correlation coefficient changes of a miRNAmRNA pair in tumor relative to normal, but it can act as a random formula as a comparison to see if the proposed integrated mean value weight is random. Thereby, more possibilities exist beyond these six formulas.
For each miRNAmRNA pair in every row of the input table exemplified in Table 1, we calculated their six different weights. For example, the first row of the input table, the OBFC1 and hsamir383 pair has TCC value − 0.092 and N_CC value 0.271. The six weights of the pair are 0.225, 0.092, 0.271, 0.182, 0.158 and 0.271, respectively using weight formulas in Table 4.
Evaluation of six edge weight formulas by traditional hierarchical clustering algorithm
The traditional hierarchical clustering algorithm [24] was performed to evaluate six edge weight formulas. A particular clustering is not defined in the traditional hierarchical clustering algorithm. Instead, a sequence of clusters is given for researchers to interpret. To run the traditional hierarchical clustering algorithm on our bipartite graph edge list, the original pseudocode in [24] is adapted and shown in Fig. 4.
To cluster the miRNAmRNA pairs using traditional hierarchical clustering algorithm, we subjectively run n steps, namely add topn largest weighted edges to the empty graph to obtain a new graph with n edges. First, we initialize an empty bipartite graph. In each step, from the input table of edge list, we choose a miRNAmRNA pair if its edge weight is currently the maximum edge weights, and then add the pair to the bipartite graph; the chosen pair is removed from the input table of edge list. The process is repeated n times. n is subjectively determined by the user, rather than determined by a criterion inside the algorithm. Finally, there are n edges of miRNAmRNA pair in the bipartite graph. The miRNA and mRNA vertices with weighed edges are visualized using igraph [25] and ggnet2 (https://briatte.github.io/ggnet/) packages in R programming language.
The traditional hierarchical clustering algorithm can be used to evaluate six edge weight formulas. Given a specific threshold of number of edges, ie., steps, in the traditional hierarchical clustering, an edge weight formula that produces smaller number of disjoint clusters suggest that more highweighted miRNAmRNA interactions are clustered based on this formula, so this edge weight formula are considered better than other weight formulas. Thereby, the correlation between the connected cluster number and different edges/steps using different weight formulas is studied following the workflow diagram in Fig. 5. Based on the result part, the integrated mean value weight is now adopted in the sequel.
The traditional hierarchical clustering algorithm can actively, not passively, cluster the miRNAmRNA pairs and also can filter the topweighted edges in the graph, because only the top weighted, namely important, edges are added to the graph. In the meantime, the smaller weight edges, which might also have biological roles are ignored. To solve this issue, we proposed an objective maximum weighted merger method (MWMM) approach that also clusters smaller weight edges and tries to achieve the global optimum instead of only clustering topweighted edges. Thereby, traditional hierarchical clustering algorithm was only used to evaluate six edge weight formulas in this study.
Graph partitioning of the bipartite graphs
Partitioning the graph G consists of dividing the vertices into clusters, such that the total weight of the edges whose end points are in different clusters is minimized. The objective of this kind of partitioning is to minimize the cut, i.e. the total weight of the edges crossing the clusters. This is equivalent to maximizing the total weight of the edges that are inside the clusters [26].
In general, a graph’s vertex set V(G) may be partitioned into c disjoint parts, V_{1}, V_{2}, …, V_{c}, such that V = V_{1}∪V_{2}∪V_{3} … ∪V_{c}. Such parts may be referred to as subgraphs, partitions, or communities, but they shall be referred to as clusters in this discussion. A cluster, with more weighted connections inside and fewer weighted connections to other clusters, indicates that the members of a cluster are more similar or linked to each other than those in the portions of the graph outside that cluster [27]. The partitioning is illustrated in Fig. 6.
Hungarian and blossom algorithm matching in graph theory
A matching in graph theory is defined as a subset of edges such that none of the edges in the subset shares a common vertex. A maximum edgeweighted matching is a matching where the weight sum of the matched edges is as large as possible. In other words, we seek a perfect matching M to maximize the total weight w(M) = ∑_{e ∈ M}w(e).
The Hungarian algorithm is a combinatorial optimization algorithm used to solve the assignment problem. For example, if the performance of each of n people on each of n jobs is scored numerically, the assignment problem tries to assign people to jobs to make the sum of the scores as large as possible [28]. A tiny example of Hungarian algorithm is drawn in Fig. 7.
The Blossom Algorithm is an algorithm for finding the maximum matching in a general graph through shrinking cycles in the graph to reveal augmenting paths. The Blossom Algorithm is used to solve assignment problem, traveling salesman problem, etc. Given a general graph G = (V, E), the Blossom algorithm finds a matching M such that each vertex in V is incident with at most one edge in M and the edge weight w(M) is maximized [29]. A tiny example of blossom algorithm is drawn in Fig. 7.
MWMM procedure
The maximum weighted merger method consists of two major stages. First, MWMM implements Hungarian algorithm to find maximum edgeweighted matchings in bipartite graphs. We iteratively constructed and combined maximum edge–weighted matchings via the Hungarian algorithm to produce disjointed star graphs, labeled from K_{1,1} to K_{1,k}. Second, MWMM implements the Blossom algorithm to find maximum edgeweighted matchings in general graphs. We iteratively merge the initialized disjointed stars derived from the continuous iteration of Hungarian algorithm to form new edge–weighted clusters. The pseudocode of the MWMM pipeline is described in Fig. 8. The workflow of MWMM pipeline is depicted in Fig. 9.
Taken together, taking an edge list of edgeweighted bipartite graph, MWMM approach partition it into clusters that have higher internal connection density inside a cluster and lower external connection density outside the cluster. In other words, inner weights of a valid cluster should be greater than or equal to its outer weights. This clustering criterion is passive and objective to evaluate the quality of resulting clusters. This passive evaluation approach is different from, and better than, the subjective judgement of the cluster in the traditional hierarchical clustering approach: “clusters are in the eyes of the beholder”.
Application of Hungarian algorithm
The Hungarian algorithm takes input of a bipartite graph matrix that has miRNAs as row names, mRNAs as column names, and edge weights as entries. This raw bipartite graph matrix is converted from the raw edge list exemplified in Table 2.
After applying each round of the Hungarian algorithm, we get an edge list of matched miRNAs and mRNAs. We remove the matched zero edge weight miRNAmRNA pairs from the matched pairs so that the miRNAs or mRNAs in the zero edge weight matched pairs can participate in the next round of Hungarian algorithm to match their nonzero edge weight miRNA or mRNA mates, instead of being discarded. In other words, the matched zero edge weight miRNAmRNA pairs are still in the remaining bipartite matrix for the next round of Hungarian algorithm. The matched, nonzero edge weight mRNA and miRNA pairs are used to construct star graphs shown in Fig. 10.
Before the next round of Hungarian algorithm application, the columns of matched mRNAs are removed from the remaining bipartite graph matrix, whereas the rows of matched miRNAs are usually not removed from the remaining bipartite graph matrix. However, miRNAs that have zero edge weight with all mRNAs are removed from the remaining bipartite graph matrix, when each miRNA row of the remaining bipartite graph matrix is checked before the next round of Hungarian algorithm. Since these miRNAs has been matched and stored in the internal nodes of star graphs, keeping these used miRNAs in the remaining bipartite graph matrix make the Hungarian algorithm hard to match perfectly. Finishing these processing, the next round of the Hungarian algorithm was applied to the updated remaining bipartite graph matrix.
This Hungarian algorithm was repeated until all the miRNAs and mRNAs are removed from the remaining bipartite graph matrix. Eventually, the Hungarian algorithm merging process yields 312 K_{1,k} (S_{k}) star graphs. The Hungarian algorithm implementation is provided by clue [30] package in R programming language.
Star graph construction by the Hungarian algorithm
In graph theory, a star (graph) is a complete bipartite graph that has 1 internal node and k leaves, and accordingly the star graph is named K_{1,k} star or S_{k}. Note that there are 312 miRNAs and 7874 mRNAs in the raw bipartite graph. Since one miRNA can target multiple mRNAs, after continuous iterations of the Hungarian algorithm, we derived 312 merged star graphs. To facilitate programming, disjoint K_{1,k} star graphs are stored in communities object in igraph objects in R programming language. The star graph construction process is illustrated in Fig. 10.
Cross weight of vertices denoting clusters in the auxiliary graph
The 312 star graphs constructed by Hungarian algorithm are initial clusters that will be merged to form new clusters. Then the blossom algorithm is used to combine these star graphs or clusters. An edgeweighted auxiliary graph with 312 vertices denoting star graphs or clusters is formed by contracting each star graphs or (merged) clusters of miRNAs and mRNAs to a vertex in the auxiliary graph. For instance, we contract clusters C_{i1}, C_{i}, C_{j}, and C_{j + 1} in Fig. 6 to vertices C_{i1}, C_{i}, C_{j}, and C_{j + 1}. in the auxiliary graph. The auxiliary graph is illustrated in Fig. 11.
Cross weight is defined as the sum of the weights of the connections between two star graphs or clusters averaged by the number of vertices in the two star graphs or clusters. Averaging prevents larger clusters to be merged preferentially only because they are large. The connections consist of two scenarios. First, the miRNA(s) in a star graph/cluster has existing connections to the mRNA(s) in the other star graph/cluster. Second, the mRNA(s) in a star graph/cluster has existing connections to the miRNA(s) in the other star graph/cluster. The mathematical meaning of the cross weight is to detect the compounded connections between every two cluster candidates. An example diagram of a cross weight calculation for two disjoint K_{1,25} (S_{25}) star graphs is shown in Fig. 12.
Then the cross weight of vertices denoting clusters or star graphs in the auxiliary graph are calculated before each round of the blossom algorithm. The calculated cross weights are assembled into an edge list with vertex names of star graphs or clusters as first two column names and cross weights as the third column name. If a row of cross weight edge list has zero value cross weight, the row of the two star graphs or clusters is discarded. The edge list of nonzero cross weight is the input of each iteration of the blossom algorithm.
The blossom algorithm for merging partitions
Taking the edge list of cross weight of 312 K_{1,k} (S_{k}) star graphs as initial input, the blossom algorithm is repeatedly applied to match and merge clusters. After applying each round of the the blossom algorithm to the cross weight edge list, the maximum edgeweighted matching of vertices of clusters in the auxiliary graph is found. If there is no match for some star gaphs or clusters, those star graphs or clusters are put aside and not used in the next round of the blossom algorithm. Then, every two matched star graphs or clusters are merged to form new clusters. Cross weights of every two newly merged clusters are calculated for the next round of blossom algorithm. Then the blossom algorithm is repeatedly applied to the edge list of cross weight of vertices of clusters in newly formed auxiliary graph. The output of each round of blossom is a communities object in R programming language containing merged star graphs or clusters. The blossom algorithm was implemented using NetworkX package [31] in Python programming language.
Evaluation of six edge weight formulas by MWMM
As for the six different edge weight formulas, it would be interesting to check how different the obtained final partitions are. If the traditional clustering algorithm is used to see the final partitions, all the edges will be added to the graph, and thereby, the final partitions of six edge weights formula would be identical. Furthermore, the final partition will have 20,661 edges in the case of BRCA such that the graph would be indistinguishable. Instead, certain number of edges/steps, say, 38 edges/steps, can be used to compare the resultant partitions of traditional hierarchical clustering algorithm. However, the partitions from six different edge weight formula might have different number of nodes. Thereby it is hard to use the global evaluation metrics such as the adjusted rand index to compare the similarity of the partitions. The adjusted Rand index (ARI) can be used to measure the similarity of the two communities of clusters. ARI needs the knowledge of the ground truth classes, which is not available in real data set or requires manual annotation such as in the supervised learning (https://scikitlearn.org/stable/modules/clustering.html). The ARI has a value close to 0.0 for random labeling independently on the number of clusters and samples and has a value exactly 1.0 when the clusters are identical (https://scikitlearn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) [32]. So in this study the ARI cannot be used to tell whether the predicted clusters are similar to the true clusters. But ARI can be used to compare the similarity of resultant clusters of six weight edge formulas produced by the MWMM pipeline.
Results
Evaluation of the six kinds of edge weights by traditional hierarchical clustering algorithm
The traditional hierarchical clustering algorithm can subjectively cluster the miRNAmRNA pairs by filtering the topweighted edges in the graph. It can also be used to evaluate the proposed six edge weight formulas. We calculated the six kinds of edge weights and output the results as the edge list with miRNA node, mRNA node, and their edge weight, as is shown in Fig. 2 and Table 2. Then we ran the traditional clustering algorithm to cluster the miRNAmRNA pairs based on six proposed edge weights.
Given a specific number of steps in the traditional hierarchical clustering, smaller number of disjoint clusters suggest that more highweighted miRNAmRNA interactions are clustered. If the miRNAmRNA pairs with large edge weights fall into more disjoint small clusters, there will be a larger number of disjoint clusters, which suggest that there is no coordinated interaction within the clusters. From Fig. 13, we can see that under most of the step values, the integrated mean value weight has the fewest disjoint clusters and thereby is the preferable formula in this study. Although we chose the integrated mean value weight formula, researchers facing different data can still propose other formulas. These formulas should simultaneously combine the miRNAmRNA expressional correlation coefficient changes from in normal to in tumor.
Traditional hierarchical clustering algorithm on top integrated mean value weight edges
Since we chose integrated mean value weight to quantify the correlation change as edge weight of the miRNAmRNA bipartite graph thereafter, we wanted to see how the clusters derived from traditional hierarchical clustering algorithm look like. Since traditional hierarchical clustering algorithm selects the top weighted edges subjectively by users’ setting, it is sensible to get a threshold of edge number. Therefore, we plot a histogram to show the distribution of integrated mean value weight in Fig. 14. From Fig. 14, we can see that there are 38 edge weights greater than 0.7, so we subjectively ran 38 steps of traditional hierarchical clustering algorithm on the miRNAmRNA pairs with the integrated mean value weights. In other words, we selected the top 38 edgeweighted miRNAmRNA pairs to form a new bipartite graph that is shown in Fig. 15. As a comparison, top 38 edgeweighted miRNAmRNA pairs of all six edge weight formulas clustered by traditional hierarchical clustering algorithm are provided in Additional file 1
From Fig. 15, we can see that using the integrated mean value weight, the top 38 weighted edges are all green color, which means that in tumor all the three miRNAs inhibit the target mRNAs whereas in normal all these miRNAs are positively correlated with their target mRNAs. Larger weights are supposed to represent the bigger correlation inversion from normal to tumor. The same correlation change direction suggests that these miRNA and mRNAs are likely to be affected by the common causal factors and that the miRNAs playing suppressive roles to their target mRNAs is a characteristics of the cancer development.
Literature spotchecks of the most enriched miRNAs
To study function of the three most enriched miRNAs based on the integrated mean weight edge shown in Fig. 15, we looked up these miRNAs in the literature. We found that all three miRNAs are functionally related to cancer [33,34,35]. Their functions are listed in Table 5
Apply MWMM pipeline in BRCA
The whole MWMM pipeline first calculates edge weights from table of edge list of miRNAmRNA pairs with their expression T_CC and N_CC, exemplified in Table 1. The calculated edge weights of integrated mean value weight are exemplified in Table 2. The raw edge list contains 20,661 pairs of miRNA and mRNAs, including 312 unique miRNAs and 7874 unique mRNAs. This raw edge list is converted to raw bipartite graph matrix for the Hungarian algorithm to run on. The number of Hungarian algorithm iterations is 202 rounds. The merging process yields 312 K_{1,k} (S_{k}) star graphs, from which the blossom algorithm is repeatedly applied to match and merge clusters. The blossom algorithm run 8 times to eventually merge 312 starting star graphs or clusters into a single cluster. One single cluster doesn’t make sense for the purpose of clustering, but the clusters of last iteration of Hungarian algorithm and each round of blossom algorithm are output to communities objects in R programming language. Users can use mathematical and biological metrics to select clusters derived from Hungarian algorithm or from first several rounds of blossom algorithm to achieve the tradeoff between cluster size and cluster number.
Evaluation of six edge weight formulas by MWMM
To see the effects of six different edge weight formulas, ARI was used to compare the similarity of resultant clusters based on six weight edge formulas produced by MWMM pipeline. MWMM started with six different edge weight formulas and produced six communities structures of clusters, respectively. Communities is a structure in igraph package in R programming language to represent clusters. We compared similarity of two communities structures of clusters derived from every two different edge weight formulas using ARI. The communities structures of clusters produced by Hungarian algorithm and blossom 01 of the MWMM approach were shown as examples in Tables 6 and 7, respectively, to tell whether different edge weight formulas lead to different communities structures. From Tables 6 and 7, we found that the communities structure of clusters derived from every two edge weight formulas using Hungarian or blossom algorithm 01 were very similar, because their ARI values are in the same order of magnitude The overall similarity using blossom 01 is higher than that using Hungarian algorithm, perhaps because the blossom algorithm merge the clusters generated from Hungarian algorithm, and thereby the communities structures of clusters are more similar using blossom algorithm than these using Hungarian algorithm. Since ARI only compares clusters based on their topological structures, ignoring their edge weights, ARI is not a suitable metrics to select a good edge weight formula, whereas traditional hierarchical clustering algorithm did the job.
Mathematical metrics of MWMMderived clusters
A wellpartitioned cluster should have more weighted connections inside the cluster and fewer weighted connections to any other outside clusters, so that the members of the cluster are more similar or linked to each other than to the members of other outside clusters. This characteristics of clusters leads to the clustering metrics used in this study. We define three weight notations to describe the connections between the microRNA (emitter) and mRNA (receiver) within and across clusters to measure the inner weight and outer weight of clusters generated by the Hungarian algorithm and the blossom algorithm. The three definitions are listed in Table 8. The diagram of the calculation of inner and outer weights are portrayed in Fig. 16.
Accordingly, we propose two conditions to validate the mathematical significance of a candidate cluster C_{i}. Condition one specifies IW > E2MROW. Condition two specifies 2 × IW > E2MROW + R2MEOW. It is noteworthy that in the condition two the inner weight should be doubled, because the condition two’s outer weights measure the connections from miRNAs inside a cluster to mRNAs outside that cluster as well as the mRNAs inside that cluster to the miRNAs outside that cluster; correspondingly, the inner weight of a cluster should also be measured twice to characterize the connections from miRNAs inside a cluster to mRNAs inside that cluster as well as from mRNAs inside that cluster to miRNAs inside that cluster. The difference of meeting condition one and condition two may result from that condition one is based on only the miRNA side, whereas condition two is based on both miRNA and mRNA sides. Condition one is less stringent than condition two, because condition one only compares the inner and the outer connections to the miRNAs inside a cluster, whereas condition two compares the inner and the outer connections to both miRNAs and mRNAs inside a cluster. Therefore, condition one is easily met by fewer rounds of merging algorithm.
The whole MWMM procedure consists of continuous iterations of the Hungarian algorithm and several rounds of the Blossom algorithm. The MWMM procedure tries to merge existing clusters to generate new clusters that have greater inner weight than the outer weight, seen in Fig. 17. If we keep merging, eventually there will be one or several very dense clusters, which have small outer weight. Thereby, we need to make a trade–off between the sizes and density of the clusters. In other words, we want denser clusters with proper sizes.
From Fig. 17, we can see that continuous application of the Hungarian algorithm produces 312 star graphs. Afterwards, using eight rounds of application of Blossom algorithm. The 312 star graphs/clusters are merged to one cluster round by round. The merger effects are evaluated by the abovementioned condition one and condition two. We can see that as more rounds of merging algorithm are applied, the number of clusters first decreases dramatically and then tends to be stable; the clusters of different sizes satisfying condition one and/or condition two are produced by the MWMM procedures and the percent of clusters that meets the condition one (IW > E2MROW) and condition two (2 × IW > E2MROW + R2MEOW) gradually increases to 100% and becomes stable. The detailed metrics of the final clusters defined in Table 8 are provided in Additional file 2.
Kyoto encyclopedia of genes and genomes (KEGG) analysis of the clustering results
KEGG function analysis shows the biological significance of genes that are potentially regulated by miRNAs in the derived clusters. The biological factors enriched in the clusters provide a new viewpoint on how mRNA–miRNA pairs contribute to cancers. Functional analysis of genes in clusters is implemented using clusterProfiler, an R package for comparing biological themes among gene clusters [36]. For example, in Fig. 18, we analyze 312 clusters derived from the Hungarian algorithm result and get 22 clusters enriched in KEGG pathways with pvalueCutoff = 0.01 and qvalueCutoff = 0.05. For example, genes in 189th cluster are enriched in cell cycle, p53 signaling pathway, progesterone−mediated oocyte maturation, and oocyte meiosis, suggesting the theme of the genes in the cluster related to cancer. The members of 189th cluster (star graph) are visualized in Fig. 19, where the internal node miRNA hsamir379 is reported to be a tumor suppressor playing a role in inhibiting cell proliferation, migration, and invasion in breast cancer [38], cervical cancer [39], glioma [40], nonsmall cell lung cancer [41], bladder cancer [42], osteosarcoma [43], hepatocellular carcinoma [44], gastric cancer [45]. The genes interacting with hsamir379 in the cluster is worth further experimental exploration, for example, CCNB1, MCM4, CCNB2, and CDK1 that are involved in cell cycle.
Clustering method comparison
To see if our MWMM approach surpasses existing clustering methods, we need to conduct performance comparisons. Which clustering approaches are suitable for comparison? First, the MWMM method is a downstream analysis approach taking certain input format, an edge list with miRNA vertex name, mRNA vertex name, and their edge weight. The integrated mean value weight characterizes the correlation change in two conditions. Therefore, the clustering approaches that only consider one condition like MAGIA2 or miRMAP are not comparable to the MWMM. Second, other known clustering algorithm might not fit the data structure of bipartite graph in a form of edge list. For example, in a study using time course mRNA microarray data, a nonlinear primary component analysis (PCA) neural network was used to extract the feature vector that was afterwards fed into a probabilistic principal surfaces (PPS) model to find and visualize latent variables or clusters of genes that were afterwards merged by an agglomerative clustering algorithm based on negentropy information. This negentropy clustering (NEC) algorithm can automatically determine the cluster of numbers [46], so it is better than the traditional hierarchical clustering algorithm that needs subjective determination of the cluster number. However, this study concentrates in the miRNAmRNA interactions, in which a bipartite graph is constructed, so the clustering approaches like PPSNEC [46], kmeans [47], or WGCNA [48] that have been used to find gene expression “modules” or clusters are unsuitable for comparison. Third, the miRNAmRNA interaction bipartite graph is not a connected graph, and thereby, some clustering algorithms like minimum spanning tree cannot be applied. Considering the abovementioned constraints, we choose louvain, fast_greedy, walktrap, leading_eigen, label_propagation, and edge_betweenness to compare with the MWMM approach. Implementations of these clustering approaches are derived from igraph package in R programming language.
The biological validation would benefit from a systematic methodology in addition to literature spotchecks. Thereby, we biologically validate the derived clusters by calculating their average Gene Ontology (GO) term similarity distance scores. The GO similarity scores would give an idea of how the genes within a cluster or across clusters are functionally related or similar. Based on the definition of clustering, elements within a cluster are more similar or linked than the elements among clusters in some traits, for example, GO term similarity. Thereby, clusters identified by a good clustering algorithm should have higher intracluster GO similarity distance scores and lower intercluster GO similarity distance scores. In other words, the difference between intracluster GO similarity score and intercluster GO similarity score should be higher for a good clustering algorithm.
To compare and evaluate clusters generated by different clustering algorithms in BRCA, the GO similarity distance scores of genes in the clusters are calculated using GOSemSim, an R package for measuring semantic similarity among GO terms and gene products [49]. GO similarity distance score is calculated in three categories of GO terms: molecular function (MF) describing molecular activities of gene products, cellular component (CC) describing where gene products are active, and biological process (BP) describing pathways and larger processes made up of the activities of multiple gene products. From Figs. 20, 21, and 22, we can see that compared to other algorithms, the Hungarians or Blossom algorithm 01 have relatively higher intracluster similarity and relatively lower intercluster similarity in all three GO term categories. This result shows the advantage of MWMM approach over other approaches in biological meaning.
Besides comparison with different methods like Louvain, fast_greedy, walktrap, leading_eigen, label_propagation, and edge_betweenness algorithms in terms of GO terms, a mathematical comparison in term of strength of the connection inside the cluster and outside the clusters is also meaningful. So we calculated the inner weight and outer weight and conditions defined in Table 8 for the MWMM pipeline and other compared algorithms. Different algorithms produced different number of clusters in the resultant communities structures. To summarize, the IW, E2MROW, and R2MEOW of clusters produced using each algorithm were averaged, and the percent of how many clusters produced using each algorithm meet condition 01 or 02 were also calculated, respectively. The result summary is listed in Table 9. From Table 9, we can see that louvain, fast_greedy, and leading_eigen algorithms yielded clusters with the larger inner weights relative to outer weights and high percent of condition 01 and 02 satisfaction. By comparison, the Hungarian and blossom algorithms in the MWMM approach at the beginning did not produce clusters with the larger inner weights relative to outer weights and high percent of condition 01 and 02 satisfaction, however, as the merger process went on in the MWMM, the Hungarian and blossom algorithms in the MWMM approach gradually generated clusters with the larger inner weights relative to outer weights and high percent of condition 01 and 02 satisfaction. These phenomena comply with expectations, because all the clustering algorithms try to make clusters based on mathematical criteria, while clusters are defined as inner connections or similarities greater than the outer connections or similarities.
The running speed of different algorithms were compared by running on the same data set: the edge list of miRNA and mRNAs with integrated mean value edge weight from BRCA. The running time was recorded respectively and listed in Table 10. From Table 10, we can see that the labelpropagation, Louvain, fast_greedy, leading_eigen, and walkstrap algorithms are fast. Our hungagrian_blossom (MWMM) approach is acceptable. Edge_betweenness algorithm is slow.
Clustering algorithm validation on test data sets
The MWMM approach is developed using BRCA as training data set. Can this approach also applied to some test data sets and achieve similar clustering results in terms of mathematical cluster traits and biological meaning? To answer this question, we ran MWMM approach and other six foregoing clustering algorithms on other 14 cancer types: Bladder Urothelial Carcinoma (BLCA), Colon adenocarcinoma (COAD), Esophageal carcinoma (ESCA), Head and Neck squamous cell carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney renal clear cell carcinoma (KIRC), Kidney renal papillary cell carcinoma (KIRP), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Prostate adenocarcinoma (PRAD), Stomach adenocarcinoma (STAD), Thyroid carcinoma (THCA), and Uterine Corpus Endometrial Carcinoma (UCEC). Similar to BRCA, the input table of the 14 cancer types exemplified in Table 1 were derived from results of our previous study [18].
We find that similar to in BRCA the MWMM can also detect clusters that has internal weights greater than or equal to external weights in the test data sets of 14 cancer types. Graph of inner weights, outer weights, and cluster sizes of KIRP is drawn in Fig. 23 as an example. Graphs of other 13 cancer types are supplied in Additional file 3.
We also tried to find out whether the MWMM approach can cluster miRNAs and mRNAs such that the difference between intracluster and intercluster average GO similarity distance score is relatively larger compared to other algorithm results. The clustering algorithms that obtain the highest differences between intracluster and intercluster average GO similarity distance score in each GO term category in each cancer type are summarized in Table 11. We can see that in MWMM has the best GO metrics in terms of BP in 11 out of 15 cancer types, CC in 13 out of cancer types, and MF in 14 out of 15 cancer types. The results suggest that the MWMM are also effective in other cancer types, though it is not always the best. The supporting materials for Table 11 are provided in Additional file 4.
Discussion
There are some miRNAmRNA clustering studies, however, these studies did not focus on the expression correlation coefficient changes of miRNAmRNA pairs that are inverse from in normal to in tumor. The miRNAs and mRNAs can be clustered based on their expression correlation coefficient changes under the assumption that the changes are not random but caused by factors involved in cancer development. Hence, we tried to capture and cluster these miRNAmRNA interactions.
To simultaneously quantify the changes, we proposed integrated mean value weight that increases the contrast of values in the data as well as other five edge weight formula as comparison or control. Then the subjective traditional hierarchical clustering algorithm was used to evaluate the advantages of different edge weight formulas. After evaluation, integrated mean value weight was favored because it can produce more connected clusters at certain steps. We did not just use the traditional hierarchical clustering algorithm only to cluster miRNA and mRNA pairs in this study; instead, we only use it as a tool to evaluate the edge weight formulas. This is because traditional hierarchical clustering algorithm is subjective and thereby makes the researchers feel difficult to determine the cluster number. Furthermore, traditional hierarchical clustering algorithm only cluster the top miRNAmRNA pairs, and thereby it doesn’t reach a global optimal clustering that should also involve the low edge weight miRNAmRNA pairs. To get around these limitations, we proposed the maximum weighted merger method (MWMM) pipeline.
The MWMM pipeline includes continuous iterations of Hungarian algorithm and several rounds of blossom algorithm. MWMM pipeline passively clusters miRNAmRNA pairs using maximum weighted edge matching in the bipartite graph and general graph. Based on the GO similarity results, the Hungarian algorithm or blossom 01 can produce clusters that have a good tradeoff between the cluster size and GO similarity, compared to the other algorithms that produce several hugesized clusters along with some smallsized clusters. Functional enrichment analysis such as KEGG pathway and GO terms was performed to find out the underlying factors or themes from genes in each derived cluster. For example, genes involved in p53 signaling pathway and cell cycle pathways were successfully identified.
The effectiveness of MWMM was validated both mathematically and biologically. Mathematically, the MWMMderived clusters were analyzed with respect to their inner weights and outer weight. The percent of clusters that meet the condition one and two gradually increases as the MWMM merger process goes on. Eventually, all MWMMderived clusters have inner weights greater than their outer weight, namely, greater inside connection than outside connection. Biologically, MWMMderived clusters have intracluster’s average GO term similarity distance scores much larger than the intercluster’s, compared to other six algorithms. MWMM approach was also applied to other 14 cancer types and it can merge initial clusters to yield clusters that mostly keep the inner weights larger than or equal to the outer weight in other 14 cancer types. Biologically, the MWMM approach yields clusters that has relatively higher intracluster and relatively lower intercluster average GO term similarity distance scores compared to other six clustering algorithms in most of cancer types that are tested. This shows that the MWMM can also be applied to data sets other than BRCA.
In the future, more information could be integrated into MWMM pipeline. First, the expression fold change of miRNAs and mRNAs could also be considered into the edge weights of the miRNAmRNA interactions to see the relationship between the expression fold change and correlation coefficient change. Second, the current study is configured such that it only considers the inverse correlation coefficient change, namely from positive to negative or from negative to positive. It would be interesting to see whether from high positive to low positive or from high negative to low negative matters. Third, more filters could be applied to the clustering algorithm such as filtering out the smallest weight edges of miRNAmRNA pairs. Fourth, more underlying factors or themes of each derived clusters would be easier to be unraveled by considering other factors like gene mutations, transcription factors, long noncoding RNAs, other regulatory elements, etc. This needs incorporating literature studies and other formats of omics data.
Conclusions
In this study, the expression correlation coefficient changes of miRNAmRNA pairs that are inverse from in normal to in tumor were quantified by integrated mean value weight out of proposed six edge weight formulas. The integrated mean value weight was favored based on the evaluation of the subjective traditional hierarchical clustering algorithm. Then, a maximum weighted merger method (MWMM) approach combining the Hungarian algorithm and blossom algorithm was used to passively cluster the miRNAmRNA pairs using the maximum weighted edge matching in the bipartite graph and general graph. The resultant clusters can effectively capture and enrich cancerassociated miRNAmRNA pair candidates in different cancer types and achieve more biologically significant clusters than other existing, available algorithms such as Louvain, fast greedy, walktrap, leading eigen, label propagation, and edge betweenness algorithms. In the future study, it is worthwhile to investigate how to use the clustered miRNAs and mRNAs as candidate biomarkers for different cancer types, identify cancer driver genes, provide clues for targets of precision medicine in cancer, and develop new treatment strategies.
Availability of data and materials
The source codes supporting the conclusions of this article are available in the GitHub at https://github.com/BaiLab/MWMM.
Abbreviations
 ARI:

Adjusted Rand Index
 BLCA:

Bladder Urothelial Carcinoma
 BRCA:

Breast invasive carcinoma
 COAD:

Colon adenocarcinoma
 E2MROW:

Emitter to matched receiver outer weight
 ENCODE:

Encyclopedia of DNA Elements
 ESCA:

Esophageal carcinoma
 GO:

Gene ontology
 HNSC:

Head and Neck squamous cell carcinoma
 ICGC:

International Cancer Genome Consortium
 IW:

Inner weight
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 KICH:

Kidney Chromophobe
 KIRC:

Kidney renal clear cell carcinoma
 KIRP:

Kidney renal papillary cell carcinoma
 LIHC:

Liver hepatocellular carcinoma
 LUAD:

Lung adenocarcinoma
 LUSC:

Lung squamous cell carcinoma
 MWMM:

Maximum weighted merger method
 N_CC:

The miRNAmRNA expressional correlation coefficients in normal.
 NEC:

Negentropy clustering
 NGS:

Next generation sequencing
 PPS:

Probabilistic principal surfaces
 PRAD:

Prostate adenocarcinoma
 R2MEOW:

Receiver to matched emitter outer weight
 STAD:

Stomach adenocarcinoma
 T_CC:

The miRNAmRNA expressional correlation coefficients in tumor
 TCGA:

The Cancer Genome Atlas
 THCA:

Thyroid carcinoma
 UCEC:

Uterine Corpus Endometrial Carcinoma
References
 1.
Zhang J, Baran J, Cros A, Guberman JM, Haider S, Hsu J, Liang Y, Rivkin E, Wang J, Whitty B, et al. International Cancer genome consortium data portal—a onestop shop for cancer genomics data. Database. 2011;2011:bar026.
 2.
ENCODE_Project_Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.
 3.
Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;19(1a):A68–77.
 4.
Shahab SW, Matyunina LV, Mezencev R, Walker LD, Bowen NJ, Benigno BB, McDonald JF. Evidence for the complexity of microRNAmediated regulation in ovarian cancer: a systems approach. PLoS One. 2011;6(7):e22508.
 5.
Saini HK, GriffithsJones S, Enright AJ. Genomic analysis of human microRNA transcripts. Proc Natl Acad Sci U S A. 2007;104(45):17719–24.
 6.
Rodriguez A, GriffithsJones S, Ashurst JL, Bradley A. Identification of mammalian microRNA host genes and transcription units. Genome Res. 2004;14(10a):1902–10.
 7.
Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN. MicroRNA genes are transcribed by RNA polymerase II. EMBO J. 2004;23(20):4051–60.
 8.
Cai X, Hagedorn CH, Cullen BR. Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA. 2004;10(12):1957–66.
 9.
Melamed Z, Levy A, AshwalFluss R, LevMaor G, Mekahel K, Atias N, Gilad S, Sharan R, Levy C, Kadener S, et al. Alternative splicing regulates biogenesis of miRNAs located across exonintron junctions. Mol Cell. 2013;50(6):869–81.
 10.
Bryan K, Terrile M, Bray IM, DomingoFernandéz R, Watters KM, Koster J, Versteeg R, Stallings RL. Discovery and visualization of miRNA–mRNA functional modules within integrated data using bicluster analysis. Nucleic Acids Res. 2014;42(3):e17.
 11.
Winter J, Jung S, Keller S, Gregory RI, Diederichs S. Many roads to maturity: microRNA biogenesis pathways and their regulation. Nat Cell Biol. 2009;11(3):228–34.
 12.
Nunez YO, Truitt JM, Gorini G, Ponomareva ON, Blednov YA, Harris RA, Mayfield RD. Positively correlated miRNAmRNA regulatory networks in mouse frontal cortex during early stages of alcohol dependence. BMC Genomics. 2013;14(1):1–21.
 13.
Karapetyan AR, Buiting C, Kuiper RA, Coolen MW. Regulatory roles for long ncRNA and mRNA. Cancers (Basel). 2013;5(2):462–90.
 14.
Jansson MD, Lund AH. MicroRNA and cancer. Mol Oncol. 2012;6(6):590–610.
 15.
Li P, Sheng C, Huang L, Zhang H, Huang L, Cheng Z, Zhu Q. MiR183/−96/−182 cluster is upregulated in most breast cancers and increases cell proliferation and migration. Breast Cancer Res. 2014;16(6):1–17.
 16.
Miles GD, Seiler M, Rodriguez L, Rajagopal G, Bhanot G. Identifying microRNA/mRNA dysregulations in ovarian cancer. BMC Res Notes. 2012;5(1):1–10.
 17.
da Silveira W, Renaud L, Simpson J, Glen W, Hazard E, Chung D, Hardiman G. miRmapper: a tool for interpretation of miRNA–mRNA interaction networks. Genes. 2018;9(9):458.
 18.
Bai Y, Ding L, Baker S, Bai JM, Rath E, Jiang F, Wu J, Jiang H, Stuart G. Dissecting the biological relationship between TCGA miRNA and mRNA sequencing data using MMiRNAviewer. BMC Bioinformatics. 2016;17(13):336.
 19.
Bisognin A, Sales G, Coppe A, Bortoluzzi S, Romualdi C. MAGIA (2): from miRNA and genes expression data integrative analysis to microRNA–transcription factor mixed regulatory circuits (2012 update). Nucleic Acids Res. 2012;40(Web Server issue:W13–21.
 20.
Liu Y, Baker S, Jiang H, Stuart G, Bai Y. Correlating bladder cancer risk genes with their targeting microRNAs using MMiRNAtar. Genomics Proteomics Bioinformatics. 2015;13(3):177–82.
 21.
Oulas A, Karathanasis N, Louloupi A, Iliopoulos I, Kalantidis K, Poirazi P. A new microRNA target prediction tool identifies a novel interaction of a putative miRNA with CCND2. RNA Biol. 2012;9(9):1196–207.
 22.
Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120(1):15–20.
 23.
John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. Human MicroRNA targets. PLoS Biol. 2004;2(11):e363.
 24.
Hunter DJ. Thinking through applications. In: Essentials of Discrete Mathematics, vol. 396. 3rd ed. Burlington: Jones & Bartlett Learning, LLC; 2015.
 25.
Csárdi G, Nepusz T. The igraph software package for complex network research. InterJournal Complex Syst. 2006;5:1695.
 26.
Boulle M. Compact mathematical formulation for graph partitioning. Optim Eng. 2004;5(3):315–33.
 27.
Qi X, Tang W, Wu Y, Guo G, Fuller E, Zhang CQ. Optimal local community detection in social networks based on density drop of subgraphs. Pattern Recog Lett. 2014;36:46–53.
 28.
Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist. 1955;2(1–2):83–97.
 29.
Edmonds J. Paths, trees, and flowers. Can J Math. 1965;17(3):449–67.
 30.
Hornik K. A CLUE for CLUster Ensembles. J Stat Softw. 2005;14(12):1–25.
 31.
Hagberg A, Schult D, Swart P. Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th Python in Science Conference (SciPy 2008). Los Alamos: Los Alamos National Lab.(LANL); 2008. p. 11–5.
 32.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.
 33.
Zeng M, Zhu L, Li L, Kang C. miR378 suppresses the proliferation, migration and invasion of colon cancer cells by inhibiting SDAD1. Cell Mol Biol Lett. 2017;22(1):12.
 34.
He H, Tian W, Chen H, Jiang K. MiR944 functions as a novel oncogene and regulates the chemoresistance in breast cancer. Tumour Biol. 2016;37(2):1599–607.
 35.
Ma W, Ma CN, Li XD, Zhang YJ. Examining the effect of gene reduction in miR95 and enhanced radiosensitivity in nonsmall cell lung cancer. Cancer Gene Ther. 2016;23(2–3):66–71.
 36.
Yu G, Wang L, Han Y, He Q. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7.
 37.
Yu G, Wang L, Yan G, He Q. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics. 2015;31(4):608–9.
 38.
Khan S, Brougham CL, Ryan J, Sahrudin A, O'Neill G, Wall D, Curran C, Newell J, Kerin MJ, Dwyer RM. miR379 regulates cyclin B1 expression and is decreased in breast cancer. PLoS One. 2013;8(7):e68753.
 39.
Shi X, Xiao X, Yuan N, Zhang S, Yuan F, Wang X. MicroRNA379 suppresses cervical Cancer cell proliferation and invasion by directly targeting Vcrk avian sarcoma virus CT10 oncogene homologlike (CRKL). Oncol Res. 2018;26(7):987–96.
 40.
Li L, Zhang H. MicroRNA379 inhibits cell proliferation and invasion in glioma via targeting metadherin and regulating PTEN/AKT pathway. Mol Med Report. 2018;17(3):4049–56.
 41.
Zhou F, Nie L, Feng D, Guo S, Luo R. MicroRNA379 acts as a tumor suppressor in nonsmall cell lung cancer by targeting the IGF1Rmediated AKT and ERK pathways. Oncol Rep. 2017;38(3):1857–66.
 42.
Wu D, Niu X, Tao J, Li P, Lu Q, Xu A, Chen W, Wang Z. MicroRNA3795p plays a tumorsuppressive role in human bladder cancer growth and metastasis by directly targeting MDM2. Oncol Rep. 2017;37(6):3502–8.
 43.
Xie X, Li YS, Xiao WF, Deng ZH, He HB, Liu Q, Luo W. MicroRNA379 inhibits the proliferation, migration and invasion of human osteosarcoma cells by targetting EIF4G2. Biosci Rep. 2017;37(3):BSR20160542.
 44.
Chen JS, Li HS, Huang JQ, Dong SH, Huang ZJ, Yi W, Zhan GF, Feng JT, Sun JC, Huang XH. MicroRNA3795p inhibits tumor invasion and metastasis by targeting FAK/AKT signaling in hepatocellular carcinoma. Cancer Lett. 2016;375(1):73–83.
 45.
Xu M, Qin S, Cao F, Ding S, Li M. MicroRNA379 inhibits metastasis and epithelialmesenchymal transition via targeting FAK/AKT signaling in gastric cancer. Int J Oncol. 2017;51(3):867–76.
 46.
Amato R, Ciaramella A, Deniskina N, Mondo CD, di Bernardo D, Donalek C, Longo G, Mangano G, Miele G, Raiconi G, et al. A multistep approach to time series analysis and gene expression clustering. Bioinformatics. 2006;22(5):589–96.
 47.
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics: 1967. Berkeley: University of California Press; 1967. p. 281–97.
 48.
Zhang B, Horvath S. A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol. 2005;8. https://doi.org/10.2202/15446115.1128.
 49.
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8.
Acknowledgements
The results published here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/. We thank Cheng Zhao and Peng Zhao for his input for the initial design of main algorithms, Vincenzo Isaia for technical writing assistance, Hegui Zhu and Qingsong Tang for the Matlab codes, Hui Jiang for statistical advice, Xinqing Dai for assisting to write some scripts, Tao Chen for helping plotting graphs. We also thank other members in the Bai lab at Indiana State University. The authors would like to thank Department of Internal Medicine and Health Information Technology & Services at University of Michigan for their support.
Funding
This research was supported by senior research grant funds from the Indiana Academy of Sciences to YB, startup funds from Indiana State University to YB, and Department of Internal Medicine at University of Michigan Medical School. The authors thank The Center for Genomic Advocacy (TCGA) and the Department of Mathematics and Computer Science at Indiana State University for computing servers. The funders had no role in the study design, data collection and analysis and interpretation, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
LD wrote and revised the manuscript, wrote the codes, and drew diagrams and graphs. ZF wrote codes and plotted graphs and revised the manuscript. YB designed and guided the project and finalized the revision. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not available
Consent for publication
Not applicable.
Competing interests
Author Yongsheng Bai is the editorial board member for BMC Medical Genomics. All other authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional files 1:
Description of data: top 38 edgeweighted miRNAmRNA pairs of all six edge weight formulas clustered by traditional hierarchical clustering algorithm are shown in the graphs. (ZIP 39 kb)
Additional files 2:
Description of data: inner weight, emitter to matched receiver outer weight, receiver to matched emitter outer weight, condition 01, and condition 02 of each cluster derived from a specific algorithm in the MWMM approach. (ZIP 18 kb)
Additional files 4:
Description of data: average GO similarity distance scores of intracluster, intercluster, and difference between intracluster and intercluster in each algorithm in each cancer type. (ZIP 8 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ding, L., Feng, Z. & Bai, Y. Clustering analysis of microRNA and mRNA expression data from TCGA using maximum edgeweighted matching algorithms. BMC Med Genomics 12, 117 (2019). https://doi.org/10.1186/s129200190562z
Received:
Accepted:
Published:
Keywords
 Cancer
 miRNAs and mRNAs
 Gene regulation
 BRCA
 TCGA
 Bipartite graph
 Graph partitioning
 Hungarian algorithm
 Blossom algorithm
 Clustering