 Research
 Open access
 Published:
Stable solution to l _{2,1}based robust inductive matrix completion and its application in linking long noncoding RNAs to human diseases
BMC Medical Genomics volume 10, Article number: 77 (2017)
Abstract
Backgrounds
A large number of long intergenic noncoding RNAs (lincRNAs) are linked to a broad spectrum of human diseases. The disease association with many other lincRNAs still remain as puzzle. Validation of such links between the two entities through biological experiments are expensive. However, a plethora lincRNAdata are available now, thanks to the High Throughput Sequencing (HTS) platforms, Genome Wide Association Studies (GWAS), etc, which opens the opportunity for cuttingedge machine learning and data mining approaches to extract meaningful relationships among lincRNAs and diseases. However, there are only a few in silico lincRNAdisease association inference tools available to date, and none of them utilizes side information of both the entities simultaneously in a single framework.
Methods
The recently developed Inductive Matrix Completion (IMC) technique provides a recommendation platform among two entities considering respective side information about them. However, the formulation of IMC is incapable of handling noise and outliers that may be present in the datasets, while data sparsity consideration is another issue with the standard IMC method. Thus, a robust version of IMC is needed that can solve the two issues. As a remedy, in this paper, we propose Stable Robust Inductive Matrix Completion (SRIMC) that utilizes the l _{2,1} norm based regularization to optimize the objective function with a unique 2step stable solution approach.
Results
We applied SRIMC to the available association data between human lincRNAs and OMIM disease phenotypes as well as a diverse set of side information about the lincRNAs and the diseases. The method performs better than the stateoftheart methods in terms of p r e c i s i o n @ k and r e c a l l @ k at the topk disease prioritization to the subject lincRNAs. We also demonstrate that SRIMC is equally effective for querying about novel lincRNAs, as well as predicting rank of a newly known disease for a set of wellcharacterized lincRNAs.
Conclusions
With the experimental results and computational evaluation, we show that SRIMC is robust in handling datasets with noise and outliers as well as dealing with novel lincRNAs and disease phenotypes.
Background
LincRNAdisease association inference problem
It is a surprising fact that, only 2% of the entire human genome codes for proteins [1]. In recent years, it has become evident that the nonprotein coding portion of the genome, especially the long intergenic noncoding RNAs (lincRNAs) having length more than 200 bases each with no overlaps with any annotated proteincoding regions, are of critical functional importance. These lincRNAs demonstrate diverse molecular mechanisms and implicate various human diseases [2]. With the advent of the highthroughput genomic technologies, a large number of lincRNAs have been cataloged [3]. However, fully annotating the functions of the lincRNAs and their involvements in human disease implications still remain a challenge for the researchers. Developing machine learning algorithm to rank disease implications by a given lincRNA based on prior knowledge would be beneficial to the community for tackling the challenge.
Limitations of existing algorithms
There are several long noncoding RNA (lncRNA)disease association inference tools developed in the past few years. But, there is a small number of tools that actually solved the lincRNAdisease inference problem. Due to the complexities in the relationship and the available datasets, only a small number of experimentally validated associations have been reported in the lncRNAdisease database [4]. Therefore, using multiple complementary data sources in the algorithm is important to predict potential lincRNA and disease associations. For example, LRLSLDA [5], KRWRH [6], and TslncRNAdisease [7] belong to a family of network based association identification methods. Each of the algorithms use biological networks, such as lincRNA similarity network and disease similarity network to develop their prediction model. Then by using the model they infer lincRNAdisease connections by either using random walk procedure on a derived biological network or by computing a similarity measure between nodes with known disease implications. The association inference problem can also be tackled through the use of matrix completion based algorithms; Nonnegative Matrix Factorization (NMF) belongs to this family of solution strategy. But, it suffers the cold start problem, due to the inability to address the inference predictions of the diseases for novel lincRNAs and vice versa. Furthermore, these methods were presented on a very small set of associations and developed without considering the scalability (e.g., around 200 lncRNAs compared to more than 8000 lincRNAs available to date from the research by [3] remain overlooked). However, the methods utilizing the lincRNAexpression profiles to build similarity networks dealing with a small number of disease classes. So, they fall short in generalizing their prediction to identify novel diseaseslincRNA connections. Owing to the fact that, a plethora of side information about the lincRNAs and the disease phenotypes are available, and the data is growing extensively every single day. Inductive Matrix Completion (IMC) based algorithms utilize side information about both the lincRNAs and diseases along with the known association evidences to predict missing associations [8]. But, the standard IMC uses the least square error function which is known to be unstable in presence of noise and outliers [9]. A stable and robust version of the IMC is thus needed in this problem.
Outline of our proposed approach
We propose a novel stable robust formulation of IMC using ℓ _{2,1} norm based error function, as well as ℓ _{2,1} based regularizer. We call the proposed method “robust” because it can handle noise better than the standard IMC. Also, we call the method “stable” because of the fact that it utilizes a 2step stable strategies to solve the problem.
Summary of contributions

We first describe a Robust IMC approach that introduces l _{2,1} norms in both its penalty function and the regularization. We then propose Stable Robust IMC (SRIMC) that can handle outliers and noise in the dataset and also joint sparsity. The solution strategy breaks the problem into two separate and independent problems, where each of the subproblem has stable solution and easy to compute. Hence, in terms of computational complexity and reliability SRIMC should be a better option.

We provide an application of our RIMC and SRIMC methods to solve the lincRNAdisease association inference problem. We show that RIMC and SRIMC can perform induction to decipher associations between a novel disease and a novel lincRNA, based on the side information about them we have, that are not provided during learning phase. This is unlike the traditional matrix factorization methods and networkbased inference methods discussed earlier which are transductive in nature.

We demonstrate that the integration of diverse features of the lincRNAs and the diseases available through publicly available dataservers can overcome worse predictive performance issue faced by the inference tools which occurs due to the extreme sparsity inherent to the lincRNAdisease association dataset.

We present a comparison of our proposed RIMC, SRIMC with standard IMC as well as the stateoftheart lincRNAdisease association methods.
The rest of the paper is organized as follows. In “Methods” section we propose the robust IMC formulation using ℓ _{2,1} norm, underline the advantages of the proposed algorithm compared with the standard IMC as well as standard NMF approaches. Here we also show the correctness of the proposed algorithm. In “Results” section we present the experimental setup and the dataset. In “Discussions” section, we present the results of the experiments, and comparative study on the performance of the proposed algorithm with the baseline methods. Finally, in “Conclusions” section we conclude the paper. A preliminary version of this work has been reported in [10].
Methods
Stable robust inductive Matrix completion (SRIMC) strategy
In this section we review standard Inductive Matrix Completion method; then we present our robust IMC (RIMC) formulation. And finally, we present the Stable Robust IMC (SRIMC) algorithm. Later, we provide a computational algorithm for our proposed method along with the correctness of the algorithm.
Review on standard IMC
The Inductive Matrix Completion approach [8] includes side information of both the row and column entities. The formulation solves the issue “coldstart” problem in a transductive setup (e.g., standard NMF, etc.). Therefore we can predict association between new entities that are not included in the data matrix available during the training time. Let’s consider an association matrix, \(A\in \mathbb {R}^{M\times N}\) denoting the links between M row entities and N column entities. We also have side information of both the row and column entities and the information is encapsulated in two matrices \(X\in \mathbb {R}^{M\times m}\) and \(Y\in \mathbb {R}^{N\times n}\) containing m features of the M row entities and n features of the N column entities respectively. Equation 1 defines the the objective function of the standard IMC.
where λ _{1},λ _{2} are the regularization parameters that weighing between the accrued loss on the observed entries and the trace norm regularization constraints. Here, an entry A _{ i j } is modeled as \(\mathbf {x}^{T}_{i}Z\mathbf {y}_{j}\), where \(Z\in \mathbb {R}^{m\times n}\) is a lowrank matrix to be recovered by solving Eq. 1. It is solved in a way that Z becomes the multiplication of two factor matrices W and H, that is, W H ^{T}, where \(W\in \mathbb {R}^{m\times r}\) and \(H\in \mathbb {R}^{n\times r}\). Equation 1 can be easily solved using Algorithm 1.
After Algorithm 1 returns, we get the two factor matrices W and H. These two matrices can be used to compute missing association scores between the row and the column entities. It can also provide prediction score for an association between a known row entity with an new column entity, or a known column entity with a new row entity, or both new row and column entities.
Robust IMC (RIMC) formulation
One limitation of the standard IMC is that it is prone to outliers in the given dataset. Given \(A\in \mathbb {R}^{M\times N}, X\in \mathbb {R}^{M\times m}, Y\in \mathbb {R}^{N\times n}\), the loss function of the standard IMC is:
Here, a squared residual error gets accumulated in each iteration in the optimization step, meaning only a few outliers may result in large error. Another shortcoming of the the standard IMC is that it can not handle joint sparsity across feature data matrices X and Y. Therefore, a solution to each of the limitations is needed. The initial hypothesis of RIMC was presented by [10]. The robust IMC, instead of using the ℓ _{2} norm based loss function involves ℓ _{2,1} norm in defining the loss function which is:
Due to the fact that the errors are not squared in each step, the approach has great advantage to handle outliers than that of standard IMC based approaches. The generalized objective function of the RIMC can be stated as:
Here, we have several options as the regularization function R(·); such as: \(R_{1}(B) = B^{2}_{F}\), \(R_{2}(B) = \sum _{i=1}^{M}B_{i,:}_{1}\), \(R_{3}(B) = \sum _{i=1}^{M}B_{i,:}^{0}_{2}\) and \(R_{4}(B) = \sum _{i=1}^{M}B_{i,:}_{2}\). Here, R _{1}(·) is the ridge regularization and is adapted in the standard IMC formulation, R _{2}(·) is the LASSO regularization which is a nonconvex function and difficult to optimize. R _{3}(·) involves the ℓ _{0} norm and is the most desirable [11], and R _{4}(·) employs the ℓ _{2,1} norm. R _{4}(·) was chosen because the function is convex and we can easily optimize the objective function involving this kind of regulizer [12].
Thus given the data matrices A,X,Y, in this paper we optimize the following robust IMC formulation:
Algorithm for RIMC (version 1)
In order to solve Eq. 7, Algorithm 2 can be adapted [10].
Correctness of the RIMC Algorithm (version 1)
Theorem 1
At convergence, the converged solution W ^{∗} of the updating rule in Algorithm 2 satisfies the KKT condition.
Proof
The KKT condition for W with constraints W _{ i k }≥0, with i=1⋯m,k=1⋯r is:
Now, the partial derivative is
where \(e_{s} = (1,\cdots, 1)^{T}\in \mathbb {R}^{s}\) is a vector with all 1s. Also, \(D, P \in \mathbb {R}^{m\times m}\) are the two diagonal matrices with the diagonal elements given by:
Now, let us continue from Eq. 9:
Thus, the KKT condition for W is:
But, once W converges (according to Algorithm 2), the converged solution (W ^{∗}) satisfies
which can be written as
This is identical to Eq. 43. Thus, the converged solution W ^{∗} satisfies the KKT condition. □
Theorem 2
At convergence, the converged solution H ^{∗} of the updating rule in Algorithm 2 satisfies the KKT condition.
Proof
The KKT condition for H with constraints H _{ j k }≥0, with j=1⋯n,k=1⋯r is:
Now, the partial derivative is
where D is already defined in Eq. 10, and \(Q \in \mathbb {R}^{n\times n}\) is a diagonal matrix with the diagonal elements given by:
Now, let us continue from Eq. 15:
Thus, the KKT condition for H is:
But, once H converges (according to Algorithm 2), the converged solution (H ^{∗}) satisfies
which can be written as
This is identical to Eq. 47. Thus, the converged solution H ^{∗} satisfies the KKT condition. □
Algorithm for RIMC (version 2)
We can also solve the robust IMC optimization problem (Eq. 7) without the use of the e vectors. It is demonstrated in Algorithm 3.
Convergence of the RIMC Algorithm (version 2)
Here, we present the proof of the convergence of Algorithm 3.
Theorem 3
Algorithm 3 will monotonically decrease the objective function of the problem (Eq. 7) in each iteration and converge to the global optimum of the problem.
However, it can be rephrased using the following two statements:

(A)
Updating H using the H update equation in Algorithm 3 while fixing W, the objective function of the problem (Eq. 7) monotonically decreases.

(B)
Updating W using the W update equation in Algorithm 3 while fixing H, the objective function of the problem (Eq. 7) monotonically decreases.
Proof
We prove Theorem 3 (A, B) separately in the following two sections. □
Proof of Theorem 3(A): Updating of H
Proof
We now focus on proving Theorem 3(A). The proof requires the following two lemmas: (Lemma 4 and 5). □
Lemma 4
Let, H ^{(t)} be the H at the t ^{th} iteration, and H ^{(t+1)} is obtained from the next iteration. Then, under the H update rule in Algorithm 3, the following inequality holds.
where, \(D_{{ii}} = 1 \bigg /\sqrt {\sum _{j=1}^{N} (AXW{H^{(t)}}^{T}Y^{T})^{2}_{{ij}}}\), and \(Q_{{ii}} = 1 \bigg /\sqrt {\sum _{j=1}^{r} H^{{(t)}^{T}}_{{ij}}}\)
The proof of Lemma 4 is given in section Proof of Lemma 4.
Lemma 5
Under the H update rule in Algorithm 3, the following inequality holds:
where D,P,Q matrices are defined earlier.
The proof of Lemma 5 is given in section Proof of Lemma 5.
Now, if we take a look at the right hand side of the inequality in Eq. 21, the value is negative or zero according to Lemma 4. This completes the proof that the objective function of Eq. 7 decreases monotonically.
Proof of Theorem 3(B): updating of W
Proof
We now focus on proving Theorem 3(B). The proof requires the following two lemmas: (Lemma 6 and 7). □
Lemma 6
Let, W ^{(t)} be the W at the t ^{th} iteration, and W ^{(t+1)} is obtained from the next iteration. Then, under the W update rule in Algorithm 3, the following inequality holds.
where, D,P,Q are defined earlier.
Proof of Lemma 6 is provided in section Proof of Lemma 6.
Lemma 7
Under the W update rule in Algorithm 3, the following inequality holds:
where D,P,Q matrices are defined earlier.
Proof of Lemma 7 is provided in section Proof of Lemma 7.
Now, if we take a look at the right hand side of the inequality in Eq. 23, the value is negative or zero according to Lemma 6. This completes the proof that the objective function of Eq. 7 decreases monotonically.
Proof of Lemma 4
Proof
We can rewrite Eq. 20 as follows:
where
And, according to the statement of Lemma 4, under the H update rule Algorithm 3, J(H) monotonically decreases. In order to prove the statement, we follow the approaches utilizing auxiliary functions [13, 14]. □
Definition 1
G(H,H ^{′}) is an auxiliary function for the function J(H) if G(H,H ^{′})≥J(H) for all H ^{′} and G(H,H)=J(H).
Now, we define:
So, we have
This proves that J(H ^{(t)}) is monotonically decreasing.
Now the important steps in the remainder of the proof are: (a) determine a proper auxiliary function, and (b) find the global minima of the auxiliary function.
Lemma 8
The function
is an auxiliary function for J.
Proof
Now J(H) of Eq. 25 can be rewritten as:
Now we will be applying the following inequality of matrices according to the investigations by [14, 15]:
where, Λ,B,H are nonnegative matrices, and Λ,B are symmetric matrices. And obviously the equality holds in Eq. 28 when H=H ^{′}.
In Eq. 28, if we do the substitutions: Λ=Y ^{T} Y,B=W ^{T} X ^{T} D X W,H=H,H ^{′}=H ^{′}, we see that the fifth term of Eq. 27 is smaller than the fifth term of Eq. 26. However, the equality holds when H=H ^{′}. Thus G(H,H ^{′}) in Eq. 26 is an auxiliary function of J(H). □
Now, we need to find the global minimum of Eq. 26. Let f(H)=G(H,H ^{′}). The gradient of f(H) is
However, the second order derivative (i.e., the Hessian matrix) would be
The Hessian matrix (Eq. 30 is semipositive definite implying that f(H)=G(H,H ^{′}) is a convex function. Thus, there exists a unique global minimum for f(H). The global minimum can be obtained by setting the gradient of f(H) to zero and solve for H. Thus from Eq. 29 we get
By replacing H ^{(t+1)}=H and H ^{(t)}=H ^{′}, we would obtain the H update rule in Algorithm 3. Therefore, under this rule, the objective function J(H) of Eq. 25 decreases monotonically, and hence completes the proof.
Proof of Lemma 5
Proof
We know that,
Similarly, we can see that
Then, the righthand side (r.h.s) of Eq. 21 becomes
And, the lefthand side (l.h.s) of Eq. 21 becomes
Now, we compute the difference between the l.h.s and r.h.s,
The above inequality holds because, D,Q are nonnegative matrices, and the sum of nonpositive numbers is always nonpositive. This completes the proof. □
Proof of Lemma 6
Proof
We can rewrite Eq. 22 as follows:
where
And, according to the statement of Lemma 6, under the W update rule in Algorithm 3, J(W) monotonically decreases. In order to prove the statement, we follow the approaches utilizing auxiliary functions [13, 14]. □
Definition 2
G(W,W ^{′}) is an auxiliary function for the function J(W) if G(W,W ^{′})≥J(W) for all W ^{′} and G(W,W)=J(W).
Now, we define:
So, we have
This proves that J(W ^{(t)}) is monotonically decreasing.
Now the important steps in the remainder of the proof are: (a) determine a proper auxiliary function, and (b) find the global minima of the auxiliary function.
Lemma 9
The function
is an auxiliary function for J.
Proof
Now J(W) of Eq. 41 can be rewritten as:
Now we will be applying the following inequality of matrices according to the investigations by [14, 15]:
where, Λ,B,W are nonnegative matrices, and Λ,B are symmetric matrices. And obviously the equality holds in Eq. 36 when W=W ^{′}.
In Eq. 36, if we do the substitutions: Λ=X ^{T} D X,B=H ^{T} Y ^{T} Y H,W=W,W ^{′}=W ^{′}, we see that the fifth term of Eq. 35 is smaller than the fifth term of Eq. 34. However, the equality holds when W=W ^{′}. Thus G(W,W ^{′}) in Eq. 34 is an auxiliary function of J(W). □
Now, we need to find the global minimum of Eq. 34. Let f(W)=G(W,W ^{′}). The gradient of f(W) is
However, the second order derivative (i.e., the Hessian matrix) would be
The Hessian matrix (Eq. 38) is semipositive definite implying that f(W)=G(W,W ^{′}) is a convex function. Thus, there exists a unique global minimum for f(W). The global minimum can be obtained by setting the gradient of f(W) to zero and solve for W. Thus from Eq. 37 we get
By replacing W ^{(t+1)}=W and W ^{(t)}=W ^{′}, we would obtain the W update rule in Algorithm 3. Therefore, under this rule, the objective function J(W) of Eq. 41 decreases monotonically, and hence completes the proof.
Proof of Lemma 7
Proof
We know that,
Similarly, we can see that
Then, the righthand side (r.h.s) of Eq. 23 becomes
And, the lefthand side (l.h.s) of Eq. 23 becomes
Now, we compute the difference between the l.h.s and r.h.s,
The above inequality holds because, D,P are nonnegative matrices, and the sum of nonpositive numbers is always nonpositive. This completes the proof. □
Correctness of the RIMC Algorithm (version 2)
In this section we are going to prove that the converged solution presented in Algorithm 3 is the correct optimal solution. In fact, we will show that the converged solution satisfies the KarushKuhnTucker (KKT) condition of the constrained optimization theory. At first, we have Theorem 10 to prove the correctness of the algorithm with respect to W. Theorem 11 will prove the correctness of the algorithm with respect to H.
Theorem 10
At convergence, the converged solution W ^{∗} of the updating rule in Algorithm 3 satisfies the KKT condition.
Proof
The KKT condition for W with constraints W _{ α β }≥0, with α=1,⋯,m;β=1,⋯,r is:
Similar to Eq. 25, the J(W) can be written as:
Now, the partial derivative of J(W) can be expressed as:
Thus, the KKT condition for W is:
But, once W converges (according to Algorithm 3), the converged solution W ^{∗} satisfies the following:
which can be written as
This is identical to Eq. 43. Thus, the converged solution W ^{∗} satisfies the KKT condition. □
Theorem 11
At convergence, the converged solution H ^{∗} of the updating rule in Algorithm 3 satisfies the KKT condition.
Proof
The KKT condition for H with constraints H _{ γ ψ }≥0, with γ=1,⋯,n,ψ=1,⋯,r is:
Now, the partial derivative of J(H) from Eq. 25 is
Thus, the KKT condition for H is:
But, once H converges (according to Algorithm 3), the converged solution, H ^{∗} satisfies the following:
which can be written as
This is identical to Eq. 47. Thus, the converged solution H ^{∗} satisfies the KKT condition. □
Stable robust IMC (SRIMC) formulation
Instead of solving the RIMC objective function (Eq. 7) directly, here we propose a twostep solution strategy to the RIMC formulation, and we call this new algorithm SRIMC.
Step 1: solving matrix Z from a matrix equation
In this step, we consider the following matrix equation
where Z is an m×n matrix of unknowns, X is the M×m feature matrix of the row entities, Y is the N×n is the feature matrix of the column entities. And, A is the M×N binary association matrix between the row and column entities.
Now, in Eq. 48, if we left multiply by X ^{T} and right multiply by Y, we get the following equation
If X has full column rank and Y has a full row rank, then both X ^{T} X and Y ^{T} Y are invertible. Therefore, we can solve for Z.
Step 2: robust NMF on matrix Z
This a modified nonnegative matrix factorization (NMF) problem; only difference is the usage of the ℓ _{2,1} norms instead of ℓ _{2} norms in the loss function and the regularizers.
Algorithm for SRIMC
We can also solve the Stable Robust IMC optimization problem by solving the two problems mentioned above. It is demonstrated in Algorithm 4.
Results
DiseaseLincRNA association datasets
We prepared a sparse association matrix by extracting the lincRNAdisease association dataset from the LncRNADisease [4] with sparsity indx 0.22%. LincRNA expression dataset was obtained from the coexpression based association study [7]. Finally, we cataloged 8194 lincRNAs and 2148 human disease phenotypes and the resulting association matrix contains 46,934 associations among these two entities. We followed a standard naming of the disease phenotypes by OMIM identification numbers. We extracted top5 OMIM phenotypes matching the human disease names using OMIM API [16].
LincRNA feature datasets
The features of LincRNAs consist of four groups of information: (i) expression profiles, (ii) transcriptor factor binding sites (TFBS), (iii) functional annotations and (iv) single nucleotide polymorphism (SNP) information. The RNAseq expression profiles of the 8194 lincRNAs on 22 human tissues were collected from the Human BodyMap Project 2.0 [3]. The expression scores were measured in FPKM (Fragments Per Kilobase of exons per Million Fragments mapped) unit. Then, TFBS information about the lincRNAs in our study with 120 transcription factors were obtained from ChIPbase dataset [17]. Linc2GO is a public data repository containing functional annotations of lincRNAs [18]. There are three different types of functions cataloged in the Lin2GO dataset: gene ontology biological process (GO BP), gene ontology molecular function (GO MF) and KEGG pathways. The 8194 lincRNAs with the functional annotation together make a sparse matrix with sparsity index 0.11%. We performed singular value decomposition on the matrix to compute and use the leading 100 singular vectors in our study as part of the features of the lincRNAs. We extracted links among 368,494 SNPs and the lincRNAs from our study from the lncRNASNP dataset [19]. Again, the SNPlincRNA association matrix turned out to a sparse matrix with the sparsity index 0.0077%. Therefore, we performed singular value decomposition on the matrix to compute and use the leading 100 singular vectors. Finally, we performed a filtering on all the four groups of features of the lincRNAs in our study. We found that 6540 out of the initial 8194 lincRNAs have data from all the four groups of featureset. Therefore, our final lincRNA feature matrix (X in our study) has 6540 rows (lincRNAs) and 342 columns (features).
Disease feature datasets
The disease feature dataset consists of two groups of information: (i) term frequency inverse document frequency (TFIDF) scores and (ii) phenotype similarity scores. The TFIDF scores were prepared by mining the OMIM text corpus on the 2661 OMIM phenotypes, resulting a 20491 term scores of each of the 2148 phenotypes from our study. We took leading 100 singular vectors as part of the disease feature. The phenotypephenotype similarity scores were retrieved from a study conducted by [20]. The similarity profiles after encapsulated in a square matrix of dimension 2148 by 2148, had to go through a singular value decomposition module to extract leading 100 singular vectors that constitute the part of the feature matrix of the diseases in our study. Finally, our disease feature matrix contains 200 features of the 2148 diseas es.
Baseline algorithms
We conducted a comparative study of our proposed algorithms with five baseline methods: (i) NMF [13], (ii) LRLSLDA [5], (iii) TsLincRNADisease [7], (iv) KRWRH [6] and (v) standard IMC [21]. The NMF based approach finds the two factors W and H by just working on the lincRNAdisease association matrix A. The LRLSLDA ranks the lincRNAs with a disease by the use of a classifier trained on two similarity feature matrices. The method was developed with eight parameters to train before getting good prediction results. The TsLincRNADisease utilizes a series of statistical significance tests on a coexpression network obtained from tissuespecific and nontissuespecific lincRNA expression information. Apart from the expression data, this method lacks the integration of other types of information available about the lincRNAs and the disease. The KRWRH is a stochastic algorithm developed on top of the random walk on a three heterogeneous networks. The method is very complex and it is harder to obtain a steady state distribution for the dataset our study.
Evaluation metrics
We define two metrics for evaluating our proposed algorithm and the baseline algorithms. The metrics are popular in evaluating any recommender style systems as in [22].
p r e c i s i o n @ k: The ratio of the number of recovered disease phenotypes to recommended k phenotypes for a target lincRNA. We take average of the ratios for every lincRNAs of our study. The metric is defined as follows:
where, P _{ l }(k) is the topk ranked diseases for an lincRNA l, D _{ l } is the set of diseases related to the lincRNA l deleted during the training phase. And, N _{ l } is the total number lincRNAs in the test set.
r e c a l l @ k: The ratio of recovered disease phenotypes to the set of hidden phenotypes in the test dataset. Again, we take average of the ratios for every lincRNAs in the study. The metric is defined as follows:
We repeated the experiments for various values of k, from 5 to 100. We conducted 10fold crossvalidation in each of the experiments listed in the following sections.
Discussions
True LincRNAdisease association retrieval
Figure 1 shows the performance of RIMC along with other baseline algorithms to predict true lincRNAdisease associations. A 10fold crossvalidation was conducted on the 2418 OMIM phenotypes. We find that our RIMC method leads in identifying true associations than all the baseline algorithms for all k values. The NMF based algorithm is better than the three other baseline algorithms. LRLSLDA’s association retrieval was the worse due to the fact that it relies only on known association matrix and the expression profiles of the lincRNAs that seems to be not sufficient to build one predictive model.
Induction on new associations
Here we conducted a thorough comparative study on the three algorithms including two of ours (RIMC and SRIMC) to predict associations between novel lincRNAs and/or diseases. We assume that all the features of the novel lincRNAs and/or diseases that we bring into our prediction framework can be computed or available. Note that,none of the baseline algorithms except the standard inductive matrix completion based approach (standard IMC) are missing in all the experiments from this sections due to the fact that none are capable of doing induction on novel associations.
Induction experiments on new LincRNAs
From the dataset in our study we selected a list of 10% lincRNAs and deleted all the entries of these randomly selected lincRNAs from the three training matrices A,X and Y. The deleted entries will serve as test set during evaluation. Then, RIMC, SRIMC and the standard IMC were trained with modified training matrices. Once, training is done on the reduced dataset, each of the obtained three modules were evaluated with the test set that were extracted at the beginning of this step. We repeat the entire training and test steps 10 times and reported the average performance score of all the three methods. Figure 2 illustrates the performance comparison of the three methods for predicting association between a new lincRNA with an existing set of diseases. RIMC and SRIMC show better p r e c i s i o n @ k than the standard IMC based approach for predicting upto the top50 disease associations with the new lincRNAs. For higher values of k in the topk predictions, both RIMC and the standard IMC show similar performance. But in terms of numerical precision, RIMC exceeds the performance of standard IMC. However, in terms of r e c a l l @ k, we can see that SRIMC and RIMC perform superior than that of the standard IMC method.
Induction experiments on new diseases
Similar to the approach mentioned in the previous section, we randomly selected 10% of the total disease phenotypes from the dataset of the study, and deleted all the entries related to the diseases. The deleted entries is going to be our test set. The reduced dataset is going to serve as training dataset. The RIMC, SRIMC and the standard IMC were trained on the reduced training dataset and evaluated against the test set. The entire training and evaluation were repeated 10 times and the average performance scores were reported. Figure 3 illustrates the performance comparison of the three methods to predict associations among known list of lincRNAs with a novel disease. Here, both RIMC and SRIMC demonstrates better induction performance in terms of the p r e c i s i o n @ k and r e c a l l @ k values.
Induction experiments on both new LincRNAs and new diseases
Finally, in this batch of induction experiment, we randomly picked 5% of the subject disease entries, and 5% of the subject lincRNA entries and deleted the respective connections between the two entities from the three data matrices A,X and Y. The deleted connections and feature set are treated as the testset, while the reduced data matrices are used to train the three algorithms. We repeat the above steps 10 times and compute the average performance scores. Figure 4 illustrates the performance comparison of our proposed RIMC, SRIMC and the only baseline algorithm applicable here which is the standard IMC to predict association between a new lincRNA and a new disease based on the model trained on data about a limited set of lincRNAs and disease phenotypes not including these two lincRNA and disease phenotypes. The p r e c i s i o n @ k plot of for the RIMC and SRIMC show better performance than the standard IMC based approach for predicting for both lower and higher values of k in the topk association ranking with the novel diseases. However, from the r e c a l l @ k cure of the both algorithms, we can see that both RIMC and standard IMC performs similar in the topk association prediction problem. But, SRIMC performs superior than both of the algorithms.
Conclusions
In this article, we propose theoretical foundations of robust inductive matrix completion method using ℓ _{2,1} norm. We provided three algorithms to solve our robust induction matrix completion objective function. The first two algorithms are equivalent, but the third one what we call Stable Robust Inductive Matrix Completion (SRIMC) breaks the problem into two subproblems. But it turns out to be a simple, stable and better solution strategy. We applied the proposed methods in identifying missing links between putative lincRNAs and human disease phenotypes. All the three variants of robust inductive matrix completion are well suited for noisy type of datasets. Besides the standard IMC formulation, our proposed method also outperformed other four lincRNAdisease association solutions. The proposed methods are applicable to predict associations among between wellstudied lincRNAs with novel disease, or novel lincRNAs with wellstudied diseases, or a set of novel lincRNAs with novel diseases.
Abbreviations
 FPKM:

Fragments per kilobase of exons per million fragments
 GWAS:

Genome wide association studies
 HTS:

High throughput sequencing
 IMC:

Inductive matrix completion
 lincRNAs:

Long intergenic noncoding RNAs
 lncRNAs:

Long noncoding RNAs
 NMF:

Nonnegative matrix factorization
 RIMC:

Robust inductive matrix completion
 SRIMC:

Stable robust inductive matrix completion
 SNP:

Single nucleotide polymorphism
 TFBS:

Transcriptor factor binding sites
 TFIDF:

Term frequency inverse document frequency
References
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating noncoding regions of the genome. Nat Rev Genet. 2010; 11(8):559–71.
Esteller M. Noncoding rnas in human disease. Nat Rev Genet. 2011; 12(12):861–74.
Cabili MN, Trapnell C, Goff L, Koziol M, TazonVega B, Regev A, Rinn JL. Integrative annotation of human large intergenic noncoding rnas reveals global properties and specific subclasses. Genes Dev. 2011; 25(18):1915–27.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2013; 41(D1):983–6. doi:10.1093/nar/gks1099.
Chen X, Yan GY. Novel human lncRNA–disease association inference based on lncRNA expression profiles. Bioinformatics. 2013; 29(20):2617–24.
Ganegoda GU, Li M, Wang W, Feng Q. Heterogeneous network model to infer human diseaselong intergenic noncoding rna associations. NanoBioscience IEEE Trans. 2015; 14(2):175–83.
Liu MX, Chen X, Chen G, Cui QH, Yan GY. A computational framework to infer human diseaseassociated long noncoding rnas. PloS ONE. 2014; 9(1):84408.
Jain P, Dhillon IS. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626. 2013. https://arxiv.org/abs/1306.0626.
Liu W, Zheng N, You Q. Nonnegative matrix factorization and its applications in pattern recognition. Chin Sci Bull. 2006; 51(1):7–18.
Biswas AK, Kim DC, Kang M, Gao JX. Robust inductive matrix completion strategy to explore associations between lincrnas and human disease phenotypes. In: Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference On. Shenzhen: IEEE;2016. p. 334–9.
Luo D, Ding C, Huang H. Towards structural sparsity: an explicit l2/l0 approach. In: 2010 IEEE International Conference on Data Mining. Sydney;2010. p. 344–53. doi:10.1109/ICDM.2010.155.
Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint ℓ2,1norms minimization. In: Advances in Neural Information Processing Systems. Vancouver;2010. p. 1813–21.
Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999; 401(6755):788–91.
Kong D, Ding C, Huang H. Robust nonnegative matrix factorization using l2,1norm. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. Glasgow: ACM;2011. p. 673–82.
Ding CH, Li T, Jordan MI. Convex and seminonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell. 2010; 32(1):45–55.
Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM. org: Online Mendelian Inheritance in Man (OMIM®;), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015; 43(D1):789–98.
Yang JH, Li JH, Jiang S, Zhou H, Qu LH. ChIPBase: a database for decoding the transcriptional regulation of long noncoding RNA and microRNA genes from ChIPSeq data. Nucleic Acids Res. 2013; 41(D1):177–87.
Liu K, Yan Z, Li Y, Sun Z. Linc2GO: a human LincRNA function annotation resource based on ceRNA hypothesis. Bioinformatics. 2013; 29(17):2221–2.
Gong J, Liu W, Zhang J, Miao X, Guo AY. lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res. 2015; 43(D1):181–6.
Caniza H, Romero AE, Paccanaro A. A network medicine approach to quantify distance between hereditary disease modules on the interactome. Sci Rep. 2015; 5:17658. doi:10.1038/srep17658.
Natarajan N, Dhillon IS. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014; 30(12):60–8.
Lian D, Zhao C, Xie X, Sun G, Chen E, Rui Y. Geomf: Joint geographical modeling and matrix factorization for pointofinterest recommendation. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York City: ACM;2014. p. 831–40.
Acknowledgements
We express deepest gratitude to Dr. Kytai Nguyen in Bioeengineering Department at University of Texas at Arlington for the comments and feedback on our lincRNAs discovery results.
Funding
Not applicable.
Availability of data and materials
The ChIPbase dataset is available at https://omictools.com/chipbasetool. The Linc2GO dataset is available at: https://omictools.com/linc2gotool. The SNPlincRNA data can be found at: http://bioinfo.life.hust.edu.cn/lncRNASNP/.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 10 Supplement 5, 2017: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2016: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume10supplement5.
Authors’ contributions
AKB conceived the package and wrote the manuscript. DK and MK contributed to data analysis and programming for the experiment. CD contributed to the twostep algorithm to solve the proposed RIMC algorithm. JG provided overall supervision. All authors reviewed, edited and approved the final manuscript.
Ethics approval and consent to participate
All the datasets used in this study are from publicly available data repository. No patients samples were used or collected in this study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Biswas, A.K., Kim, D., Kang, M. et al. Stable solution to l _{2,1}based robust inductive matrix completion and its application in linking long noncoding RNAs to human diseases. BMC Med Genomics 10 (Suppl 5), 77 (2017). https://doi.org/10.1186/s1292001703101
Published:
DOI: https://doi.org/10.1186/s1292001703101