Data preprocessing
The HTSeq-FPKM expression data and the isoform expression data of human breast invasive carcinoma (BRCA) were obtained from Genomic Data Commons (GDC) data portal (Data Release 10, downloaded on December 25, 2017; https://portal.gdc.cancer.gov/). We obtained gene annotation and mature miRNA information from Ensembl BioMart (Ensemble Genes 90, GRCh38.p10, downloaded on November 7, 2017; https://asia.ensembl.org/index.html) and the miRBase Sequence Database (Release 21, downloaded on June 27, 2017; http://www.mirbase.org/), respectively. For the two expression data sets, we kept only biological features with at least 80% of non-zero values over all samples and performed the upper quartile normalization. To allow log transformation, we identified the minimum non-zero element in each expression data set before adding it element-wise to the original expression data set. The RNA-Seq expression data were then \({log}_{2}\) transformed. We further removed samples that are outliers (beyond two standard deviations away from the mean value) from the data set. For HTSeq-FPKM expression data set, we separated protein coding genes and lncRNAs (including lincRNAs and antisense RNAs) based on human gene annotation from Ensembl. In total, we analyzed 1,061 patients with 4,359 lncRNAs, 1,061 patients with 16,517 mRNAs and 1,034 patients with 534 miRNAs for BRCA data sets.
Identification of BRCA subtypes
For BRCA cohort subtypes, we identified each sample based on Ashouri et al. [8]. The BRCA cohort they identified were subdivided into the four established PAM507 subtypes. Here, our BRCA cohort included 167 samples classified as Basal-like, 115 samples classified as Her2 type, 405 samples classified as Luminal A, and 287 samples classified as Luminal B. To filter mRNAs and lncRNAs with low expression across most samples in each subtype, we removed mRNAs and lncRNAs that were median FPKM values < = 0.7 for downstream analyses (Table 1).
Gene co-expression analysis
We selected the highly-correlated lncRNA/mRNA pairs, lncRNA/miRNA pairs and mRNA/miRNA pairs in four subtypes. We computed z-score which is the Fisher transformation of Spearman's rank correlation coefficients among lncRNAs, mRNAs and miRNAs for the cancer subtypes, respectively. The Spearman correlation coefficient is a statistical measure of the strength of a monotonic relationship between paired data. In a sample it is denoted by rs and is by design constrained as
$$-1\le {r}_{s}\le 1$$
(1)
And it is defined as the Pearson correlation coefficient between the ranked variables. Furthermore, unlike Pearson correlation coefficient, there is no requirement of normality and hence it is a nonparametric statistic. Now
$${r}_{s}=1-\frac{6\sum {{d}_{i}}^{2}}{n\left({n}^{2}-1\right)}$$
(2)
where \(n\) is a sample of size, the \(n\) raw scores \({X}_{i}, {Y}_{i}\) are converted to ranks \({x}_{i}, {y}_{i}\), and \({d}_{i}={x}_{i}-{y}_{i}\) is the difference between ranks.
Then we computed z-score using the Fisher transformation of (\({r}_{s}\)):
$$F\left({r}_{s}\right)=\frac{1}{2}\mathrm{ln}\frac{1+{r}_{s}}{1-{r}_{s}}$$
(3)
where \({r}_{s}\) is the sample Spearman rank correlation coefficient. And then
$$z=\sqrt{\frac{n-3}{1.06}}F\left({r}_{s}\right)$$
(4)
where z is a z-score for \({r}_{s}\) which approximately follows a standard normal distribution, and n is the sample size.
Construction of bipartite co-expression networks
A bipartite network describes the interactions between two different types of nodes (X-type and Y-type), with edges connecting only nodes of different types. We selected the highly-correlated lncRNA/mRNA, lncRNA/miRNA and mRNA/miRNA pairs to construct three bipartite co-expression networks (Additional file 1: Figure S4a) in four subtypes. To choose a cut-off z-score value we used the Scale-free Topology Criterion [9], which applies the linear regression model fitting index (\({R}^{2}\)) to quantify how well a network satisfies a scale-free topology. If parameter values lead to an \({R}^{2}\) value close to 1 may lead to networks with very few connections. Here we only consider those z-score values that lead to a bipartite network satisfying scale-free topology at least approximately: \({R}^{2}\) > 0.8 and the slope of the regression line between \({\mathrm{log}}_{10}\left(p\left(k\right)\right)\) and \({\mathrm{log}}_{10}(k)\) should be greater than -2 but less than -0.5.
Association indices and networks
We calculated the interaction-profile similarities between genes. Using networks to measure similarity between genes and especially focus on bipartite networks that connect X-type nodes to Y-type nodes. In bipartite networks, association indices can be used to measure shared Y-type nodes between two X-type nodes, or vice versa. Here, we calculated the association indices between mRNA-mRNA pairs in mRNA-lncRNA and mRNA-miRNA bipartite network, and the same for lncRNA-lncRNA pairs and miRNA-miRNA pairs.
We focused on the following five measures to quantify interaction-profile similarity [10], including the Jaccard index, the Simpson index, the geometric index, the cosine index, and the Pearson correlation coefficient. For example, we used five association indices to measure interaction-profile similarity between X-type nodes A and B.
The Jaccard index is the proportion of shared nodes between A and B relative to the total number of nodes connected to A or B:
$$\frac{\left|N\left(A\right)\cap N\left(B\right)\right|}{\left|N\left(A\right)\cup N\left(B\right)\right|}$$
(5)
where \(N\left(A\right)\) is defined as the number of nodes with which A interacts (similarly for \(N\left(B\right)\)).
The Simpson index is the proportion of shared nodes relative to the degree of the least-connected node:
$$\frac{\left|N\left(A\right)\cap N\left(B\right)\right|}{\mathrm{min}\left(\left|N\left(A\right)\right|, \left|N\left(B\right)\right|\right)}$$
(6)
The geometric index corresponds to the product of the proportion of shared nodes between A and B:
$$\frac{{\left|N\left(A\right)\cap N\left(B\right)\right|}^{2}}{\left|N\left(A\right)\right|\cdot \left|N\left(B\right)\right|}$$
(7)
The cosine index is the geometric mean of the proportions of shared nodes between A and B:
$$\frac{\left|N\left(A\right)\cap N\left(B\right)\right|}{\sqrt{\left|N\left(A\right)\right|\cdot \left|N\left(B\right)\right|}}$$
(8)
The Pearson correlation coefficient is the correlation between the interaction profiles of A and B:
$$\frac{\left|N\left(A\right)\cap N\left(B\right)\right|\cdot {n}_{y}-\left|N\left(A\right)\right|\cdot \left|N\left(B\right)\right|}{\sqrt{\left|N\left(A\right)\right|\cdot \left|N\left(B\right)\right|\cdot \left({n}_{y}-\left|N\left(A\right)\right|\right)\cdot \left({n}_{y}-\left|N\left(B\right)\right|\right)}}$$
(9)
where \({n}_{y}\) is the total number of Y-type nodes in the network.
We expected to uncover the underlying complex regulatory relationships among lncRNAs, mRNAs and miRNAs. The association network is a network in which two nodes of the same type are connected by an edge with their similarity. It enables the comparison of pairs of nodes across networks using the integration of different types of networks, whether pairs of nodes with similar interaction profiles in one bipartite network are also similar in another bipartite network. An illustration of the analysis workflow is shown in Additional file 1: Figure S4b.
Protein–protein interaction analysis
We listed pairwise genes in four areas and calculated the percentage of pair in PPIs database, respectively. The PPIs information were obtained from ConsensusPathDB-human (downloaded on July 31, 2017; http://cpdb.molgen.mpg.de/). In total, ConsensusPathDB-human contains 272,998 protein interactions. All binary PPIs have an aggregated confidence score that was computed as a consensus score across the six methods for judging interaction confidence [11]. We only used 92,728 high-quality PPI interactions whose consensus score > 0.95.
Functional similarity analysis
The Gene Ontology (GO) is a standard vocabulary of functional terms and gene product attributes across all species. The GO is divided into three orthogonal ontologies, biological process, molecular function, and cellular component. Gene products are functionally similar if they have similar molecular functions and biological processes. GO annotations can be used as a measure of functional similarity between gene products. We used \(funSim\) to assess the functional relationship between two genes [12]. Lengauer and co-workers provide a new measure of similarity between GO terms. This new measure is based on Lin’s [13] and Resnik’s [14, 15] definitions, called \({sim}_{Rel}\). We also measured and compared functional similarity between genes in four areas, respectively.