Integrating heterogeneous genomic data to accurately identify disease subtypes

Background High-throughput biotechnologies have been widely used to characterize clinical samples from various perspectives e.g., epigenomics, genomics and transcriptomics. However, because of the heterogeneity of these technologies and their outputs, individual analysis of the various types of data is hard to create a comprehensive view of disease subtypes. Integrative methods are of pressing need. Methods In this study, we evaluated the possible issues that hamper integrative analysis of the heterogeneous disease data types, and proposed iBFE, an effective and efficient computational method to subvert those issues from a feature extraction perspective. Results Strict experiments on both simulated and real datasets demonstrated that iBFE can easily overcome issues caused by scale conflicts, noise conflicts, incompleteness of patient relationships, and conflicts between patient relationships, and that iBFE can effectively combine the merits of DNA methylation, mRNA expression and microRNA (miRNA) expression datasets to accurately identify disease subtypes of significantly different prognosis. Conclusions iBFE is an effective and efficient method for integrative analysis of heterogeneous genomic data to accurately identify disease subtypes. The Matlab code of iBFE is freely available from http://zhangroup.aporc.org/iBFE. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0154-5) contains supplementary material, which is available to authorized users.


Background
With the development of high-throughput genomic technologies, it has become easy and cost-effective to comprehensively characterize clinical samples by a wide range of genomic data, e.g., depicting cancer samples from epigenomic, genomic and transcriptomic perspectives. Large-scale efforts conducted by The Cancer Genome Atlas (TCGA) have already applied this strategy to study over 20 cancers from thousands of patients, with a large amount of epigenomic, genomic, transcriptomic and clinical data collected from the same patients [1][2][3][4]. While the availability of such a wealth of well-structured data makes the status of patients be characterized comprehensively and subtly, it also presents important challenges for the analysis methodology. Because of the great heterogeneity of technologies and biological data, individual analysis or simple concatenation of all the available datasets often cannot generate desired results [5]. Although independent analyses of single datasets were commonly adopted, the inconsistent conclusions underscore the necessity of unbiased integrative methods. Due to the exacerbated "curse of dimensionality" [6], i.e., the number of measures is greatly larger than the number of patients, direct concatenation may generate worse results. The currently developed integrative methods for analysis of multiple genomic data of the same patients can generally be classified into three groups [7,8]. The first group of methods is based on matrix factorization [9][10][11][12][13]. The second group of methods is based on Bayesian models [14][15][16]. A major issue with the factorization and Bayesian approaches is that they generally require proper data preprocessing and normalization techniques. The computation of these approaches is also complicated. Recently, Wang et al. proposed a new type of integrative methods based on network fusion, which achieves the state-of-the-art performance regarding both accuracy and computational speed as demonstrated in [5]. However, it is still unknown what factors interfere with integrative analysis and what are the pitfalls of the current integrative analytical methodology while dissection of issues that interfere with integrative analysis and identification of alternative methods is essential for boosting the translation of advances of high-throughput genomic technologies to personalized medicine.
In this study, we explicitly interrogated factors that inhibit integrative analyses of multiple data types for both disease class discoveries and classifications [17]. By isolating those possible factors, we identified that the scales of measurement, the noise types and sizes, and the completeness and concordance of patient relationships in different data types are important issues that prevent integrative. And the currently available methods cannot overcome all the issues. Motivated by the great power of feature extraction methods for unbiased and unsupervised analyses in single datasets [18], we proposed a novel integrative approach Based on Feature Extraction (referred to iBFE below). Simulations suggested that iBFE can overcome all the issues identified in this study. Applications of iBFE to integrating the DNA methylation, mRNA expression and miRNA expression datasets of lung and kidney cancers produced by TCGA suggest that iBFE not only can successfully integrate the diverse data types but also can identify disease subtypes that have distinct survival profiles. Because iBFE is simple, flexible, unsupervised and unbiased, it is readily to extend to integrate more types of genomic datasets to improve the disease diagnosis and prognosis.

Overview of the iBFE method
The iBFE method is motivated by the observation that the accuracy of disease class discovery and classification can be significantly improved in the feature space extracted from the original data [18][19][20]. The pipeline of iBFE consists of three steps: i) extract features from individual type of datasets; ii) concatenate the extracted features; iii) extract new features from concatenated features. When the three steps were finished, the newly constructed features of patients can be used as inputs to do disease class discoveries and classifications by other algorithms e.g., k-means [21,22] and support vector machines [23,24].
First, iBFE uses Pearson and Spearman correlations to extract features from individual data types. Given a single dataset X MxN (1) , in which x ij (1) represents the j-th variable of the i-th patient (i ranged from 1 to M, and j ranged from 1 to N), P MxM (1) and S MxM (1) are constructed from X (1) . P (1) is the similarity matrix of patients constructed by Pearson correlation coefficients [25,26], i.e., p ab (1) is the Pearson correlation coefficient of x a-(1) and x b- (1) . Here x a-(1) and x b- (1) represent values of all the variables of the a-th and b-th patients, respectively. Similar to P (1) , S (1) is the similarity matrix of patients constructed by Spearman correlation coefficients [27,28]. The advantage of Pearson correlation coefficients in feature extraction has been demonstrated and validated previously [18]. The introduction of Spearman correlation coefficients here is to employ its distributionindependent property, which is important for handling issues caused by scale and noise during integration. Both of Pearson correlation coefficients and Spearman correlation coefficients have values ranged from −1 to 1, which can provide consistent scales for different data types.
Given K types of datasets, in the second step, P (k) and (k) , k = 1,…,K, are concatenated into Y Mx2MK , i.e., Y Mx2MK = [P (1) S (1) … P (k) S (k) … P (K) S (K) ], where the rows of Y represent patients while the columns of Y are the extracted features by Pearson correlation coefficients and Spearman correlation coefficients. Because P (k) and S (k) are naturally normalized to the region from −1 to 1, concatenation at this step will not suffer from issues encountered during direct concatenation of the original datasets.
In the third step, a new similarity matrix of patients Z MxM is constructed by calculating the Pearson correlation coefficients of the rows of Y, i.e., z ij is the Pearson correlation coefficient of y i-and y j-, where y i-and y jrepresent the i-th and j-th rows of Y, respectively. Z MxM is the final features extracted by iBFE from the K types of original datasets. In practice, the original datasets generally consist of thousands of variables because thousands of genes are measured at the epigenomic, genomic and transcriptomic levels by high-throughput biotechnologies. By mapping the original datasets into feature space spanned by profiles of patient similarities, iBFE extracts the patterns embedding within patient relationships. Further, the calculation expense is also greatly reduced.
In summary, the algorithm of iBFE can be outlined as follows: Step I: calculate P(k) and S(k) for X(k), k = 1,…,K; Step II: construct Y = [P (1) S (1) … P (k) S (k) … P (K) S (K) ]; Step III: construct Z by calculating the Pearson correlation coefficients of rows of Y.
Here we named the iBFE using both Pearson and Spearman correlation coefficients as iBFE 1 . To evaluate the performance of iBFE that only employs Pearson or Spearman correlation coefficients, we also constructed iBFE 2 that only uses Pearson correlation coefficients and iBFE 3 that only uses Spearman correlation coefficients.

Simulating datasets that dissect possible issues interfering with integration
We evaluated the factors that may affect integration of different types of datasets for disease class discovery and classification by simulation. Because simulation can highlight one possible factor while controlling the influence of other factors, it provides an ideal tool to evaluate the impacts of single factors on integration although some simulations may be not quite realistic. According to our experience, we hypothesize that the following factors that may affect integrative analyses: i) scales of measurements in different datasets; ii) noise types of different datasets; iii) noise sizes; iv) completeness of patient relationships that is revealed by single datasets; v) concordance of patient relationships revealed by each dataset. To evaluate their roles during integrative analyses, we constructed five simulated datasets (Fig. 1).
By simulated dataset 1 (SD1), we evaluated the impacts of scale conflicts of measurements on integrative analyses. We simulated 100 patients that are characterized by 100 variables for simplicity. The first 50 patients belong to cluster 1, with the first 50 variables all one and the other 50 variables all zero. The second 50 patients belong to cluster 2, with the first 50 variables all zero and the other 50 variables all one. All the 100x100 measurements are disturbed by noise sampling from a standard normal distribution. We named this prototype data as data 0 (SD1-D0), the hidden real data. Two types of observed data are generated from SD1-D0. Type 1 of SD1 (SD1-T1) is constructed by transforming SD1-D0 to its q th power, i.e., x ij (SD1-T1) = (x ij (SD1-D0) ) q . Type 2 of SD1 (SD1-T2) is constructed by x ij (SD1-T2) = q^(x ij (SD1-D0) ). Here q is a parameter to control the scale difference between the two data types. The power-law and exponential functions are used to simulate the issues caused by scales of different measurements.
By simulated dataset 2 (SD2), we evaluated the impacts of different noise types on integrative analyses. The prototype SD2-D0 is the same as SD1-D0 except that the noise is not added. The observed SD2-T1 and SD2-T2 are constructed based on SD2-D0 by adding noise sampled respectively from a normal distribution with means zero and standard deviation q and from a uniform distribution from zero to q, where q is the parameter to control the size of noise.
By simulated dataset 3 (SD3), the impacts of different noise sizes are evaluated. The prototype SD3-D0 is the Fig. 1 Graphic representations of the simulation process. A total of five simulated datasets were generated. Each dataset was simulated by firstly constructing two prototype data types and then adding noise (represented by the shadows in the figure with different shadow types representing different noise types). Simulated dataset 1 (SD1) simulated the impacts of different scales on the integrative analyses through scaling in different ways of the same prototype dataset adding the same type and size of noise. SD2 simulated the impacts of different types of noise on the integrative analyses through adding different types of noise to the same prototype dataset. SD3 simulated the impacts of different sizes of noise through adding different sizes but the same type of noise to the same prototype dataset. SD4 simulated the impacts of incompleteness of patient relationships through constructing partially clustered prototype datasets. SD5 simulated the impacts of conflicting patient relationships through constructing conflicting clustered prototype datasets same as SD2-D0. The observed SD3-T1 and SD3-T2 are constructed based on SD3-D0 by adding noise sampled from a normal distribution with means zero and different standard deviations.
By simulated dataset 4 (SD4), we evaluated the impacts of incomplete patient relationships embedded in single data types on integrative analyses. Two prototype datasets, i.e., SD4-D0-1 and SD4-D0-2, are constructed. SD4-D0-1 simulates 100 patients by 100 variables for simplicity, in which the first 50 patients form a cluster with the first 50 variables all one and the other 50 variables all zero. The relationships of the other 50 patients are not defined in SD4-D0-1 and the corresponding variables are all zero. In SD4-D0-2, the relationships of the first 50 patients are not defined (with all the corresponding variables zero) but the other 50 patients are defined as another cluster (with the first 50 variables all zero and the other 50 variables all one). SD4-D0-1 and SD4-D0-2 together define the complete relationship of the 100 patients. SD4-T1 and SD4-T2 are constructed from SD4-D0-1 and SD4-D0-2 respectively by adding noise sampled from a normal distribution with means zero and standard deviation q.
By simulated dataset 5 (SD5), the impacts of conflicting patient relationships embedded in different data types are examined. Two prototype datasets, i.e., SD5-D0-1 and SD5-D0-2, are constructed. SD5-D0-1 simulates 100 patients by 100 variables for simplicity, in which the first 50 patients form cluster 1 with the first 50 variables all one and the other 50 variables all zero, whereas the other 50 patients form cluster 2 with the first 50 variables all zero and the other 50 variables all one. In SD5-D0-2, the first 30 patients and the last 30 patients form a cluster and the middle 40 patients form another cluster. SD5-D0-1 and SD5-D0-2 define two clusters individually but together they define four clusters of the 100 patients. SD5-T1 and SD5-T2 are constructed from SD5-D0-1 and SD5-D0-2 respectively by adding noise sampled from a normal distribution with means zero and standard deviation q.
Real datasets generally have many noisy features that are helpless to identify disease subtypes and many patients that cannot be definitely classified to a certain disease subtype. And different disease subtypes also have different sizes. We constructed another five realistic simulation datasets by adding these properties to SD1-SD5. Based on SD1-SD5, the size of the second disease subtype was doubled, 50 unclassified patients were added, and additional features (10 times of the number of informative features) that were sampled from the normal distributions were added to each simulated datasets.

Evaluating iBFE and other integrative methods on simulated datasets
We use three types of metrics to evaluate those factors interfering with integrative analyses and the performance of various integrative methods to overcome the interfering factors in different situations. The first type of metrics examines the intra-class consistency and inter-class discrimination of patients based on the respective features constructed by individual integrative methods. Two measures are employed: Pearson correlation coefficients and the Gaussian kernel constructed based on the Euclidean distance of the extracted features. The second type of metric examines the performance of each integrative method for disease class discovery, i.e., clustering patients into subtypes. The widely used k-means algorithm (implemented in Matlab 8.1) is applied 1000 times to the features extracted by each integrative method with k = 2 on SD1-4 and k = 4 on SD5. The clustering scheme with the minimum sum of point-to-centroid distances is selected as the final clustering for evaluation. Normalized mutual information between the true clusters and each clustering scheme generated by different integrative methods are calculated to demonstrate their performance [5]. The third type of metric evaluates the performance of each integrative method for predicting disease classes of patients when the disease subtypes of some patients are known. The widely used random forest algorithm [29] is used as the classifier because random forest is robust and accurate and can be applied to both linearly and nonlinearly classified situations. To reduce biases caused by over-fitting, the leave-one-out crossvalidation scheme is used [30].
Three integrative analysis methods are included in the evaluation, i.e., direct concatenation [5], similarity network fusion (SNF) [5] and iBFEs. Direct concatenation is included because it is the most intuitive method to integrate various types of datasets to comprehensively characterize diseases. Inclusion of direct concatenation can obviously illustrate the impacts of those suspicious factors on integrative analyses. SNF is the state-of-the-art algorithm recently proposed for integrative analyses [5], which demonstrates excellent performance in combining multiple genomic datasets to predict subtypes and survival of various cancer patients. Especially, SNF is demonstrated to outperform other integrative methods like iCluster [31] which is based on pre-selection of genes. Direct concatenation was implemented by the matrix concatenation operation in Matlab. The Matlab code of SNF was downloaded from http:// compbio.cs.toronto.edu/SNF/SNF/Software.html.
Evaluating iBFE on the DNA methylation, mRNA expression and miRNA expression datasets of lung and kidney cancers produced by TCGA The DNA methylation, mRNA expression and miRNA expression datasets of lung squamous cell carcinoma (106 patients) and kidney renal clear cell carcinoma (122 patients) produced by TCGA are included to evaluate the performance of iBFE on real datasets [1,4]. These two TCGA datasets are also involved in the evaluation of performance of SNF and other integrative methods [5]. Because TCGA repository contains multiple platforms for each data type, the platform corresponding to the largest number of available individuals and describing both tumor samples and controls whenever possible was enrolled in data building. For expression data, the Broad Institute HT-HG-U133A platform was included in the lung cancer dataset, and the UNC-Illumina-Hiseq-RNASeq platform was included in the kidney cancer dataset. For miRNA expression data, the BCGSC-Illumina-GA-miRNAseq platform was included in the lung and kidney cancer datasets. For the methylation data, the JHU-USC-Human-Methylation-27 platform was included in both datasets. Patients' clinical information was also included to evaluate the prognostic power of the proposed integrative analysis method.
Three types of metrics are used to evaluate the performance of iBFE. The first type of metrics also examines the intra-class consistency and inter-class discrimination of patients and the Pearson correlation coefficients and Euclidean distances are employed. Because the true clustering schemes are not available for these two real datasets, the second and third types of metrics used on the simulated datasets cannot be used again. We proposed an alternative measure to evaluate the performance of iBFE for disease class discovery and prediction. First, k-means is applied 1000 times to obtain the clustering scheme on each cancer dataset with k ranging from 2 to 10. Then the k-means clustering scheme that is the most stable is selected as the true subtypes of patients to calculate the leave-one-out accuracy of the iBFE features, which serves as the second type of evaluating metric. The third type metric is to examine whether the integrative analyses can identify disease subtypes that have significantly different survival probability. Although factors out of the genomic measurements may also affect survival probability, prognosis prediction based on genomic data may be helpful for clinicians.

Factors interfering with integrative analyses highlighted by simulations
We evaluated the performance of the intuitive direct concatenation method and the state-of-the-art method SNF on each type of simulated datasets. Given the controlling parameters, the simulations were repeated 100 times, and the averages of evaluating metrics were recorded for comparison. We observed that all the five factors can interfere with integrative analyses, influencing all the metrics including intra-class consistency, inter-class discrimination and accuracy of clustering and classification.
The different scales of two data types interfere with integrative analyses significantly when the controlling parameter q becomes large. When q is small, the scales of two data types are close to each other. And the two data types can be treated as two replicates of the same dataset. Thus, both direct concatenation and SNF can clearly identify the true patient relationships and demonstrate good performance for both class discovery and classification. However, when q is large, although direct concatenation and SNF still demonstrate acceptable discrimination of higher intra-class patient similarity than that of inter-class, the accuracy of clustering by kmeans based on either the concatenated features or the constructed features by SNF is significantly reduced. For example, when q = 20 ( Fig. 2a and Table 1), the normalized mutual information between clustering scheme produced by direct concatenation and the true patient clustering scheme is only 0.0354, whereas the normalized mutual information between clustering scheme produced by SNF and the true scheme is 0.00519. Therefore, scale issues significantly impair the accuracy of clustering based on multiple data types. For disease class prediction, direct concatenation demonstrates a good performance (94 % accuracy) when q = 20 while SNF shows dissatisfied performance (52 % accuracy).
The noise types and sizes also influence the integration of different data. Direct concatenation generally produces worse clustering and classification results than those based on single data (Fig. 2b and c and Table 1). Although SNF can sometimes improve the classification accuracy in leave-one-out cross-validation, the accuracy of clustering is significantly reduced ( Table 1).
When the complete patient relationships are defined only by the combination of different data types and individual data type reveals only partial information of patient relationships, it is demonstrated that direct concatenation can significantly improve the intra-class consistency, the inter-class discrimination, and the accuracy of clustering and classification ( Fig. 2d and Table 1). SNF also performed well with this situation, with the accuracy of classification slightly better than that of direct concatenation. However, the clustering accuracy of SNF is much lower than that of direct concatenation ( Fig. 2d and Table 1).
When the patient relationships are conflictingly defined by the different data types, patients are in fact clustered to more than one class. For example, in SD5 (Fig. 2e), data1 defines two classes and data2 also defines two classes. However, the two clustering schemes are conflicting and in fact the patients form four distinct classes. The performance of direct concatenation is affected in this situation, with both the accuracy of clustering and classification reduced significantly ( Fig. 2e and Table 1). In particular, the leave-one-out accuracy of classification is reduced to unsatisfied 63 %. SNF can obtain better classification accuracy (93 %) but the clustering accuracy is unsatisfied. The normalized mutual information between the true clustering scheme and the SNF clustering results became as low as to 0.12 ( Fig. 2e and Table 1). Therefore, conflicting patient relationships defined by different data types impair the performance of both direct concatenation and SNF.
In summary, the performance of direct concatenation seems to be resistant to the incompleteness of patient relationships of individual data types, but it can be heavily affected by the discrepancy of scales, noise types, noise sizes, and the conflicts of the patient relationships. SNF significantly improves the classification accuracy in the situations of incomplete and conflicting patient relationships, but its clustering performance is heavily affected by these factors.

Performance of iBFE on simulated datasets
We then applied iBFEs to the simulated datasets to evaluate whether iBFE can surmount these disturbing factors. On SD1, i.e., datasets that simulate scale issues, iBFE 1 achieves better results than direct concatenation and SNF, regarding all the evaluation metrics including intra-class consistency, inter-class discrimination and accuracy of clustering and classification ( Fig. 2a and Table 1). The leave-one-out classification accuracy of iBFE 1 is comparable to or better than direct concatenation, and the clustering accuracy of iBFE 1 also approximates to 1, significantly higher than those of direct concatenation and SNF. On SD2 and SD3, i.e., datasets that simulate different noise types and sizes, iBFE 1 also outperforms direct concatenation and SNF regarding almost all the evaluation metrics (Fig. 2b and c and Table 1). On SD4 that simulates incomplete patient relationships, iBFE 1 demonstrated better intra-class consistency and inter-class discrimination but the accuracy of clustering and classification is slightly lower than those of direct concatenation and SNF (Fig. 2d and Table 1). On SD5 that simulate conflicting patient relationships, iBFE 1 outperformed direct concatenation and SNF regarding almost all the metrics (Fig. 2e and Table 1). On those realistic simulation datasets, iBFE 1 also demonstrated superior performance (Fig. 3). iBFE 2 that uses only Pearson correlation coefficients and iBFE 3 that uses only Spearman correlation coefficients also demonstrated similar performance compared to iBFE 1 that uses both Pearson and Spearman correlation coefficients (Figs. 2 and 3 and Table 1). Because iBFE 1 uses more information than iBFE2 and iBFE3, it is generally  Sim interclass 0 ± 0.0 0 ± 0.0 0 ± 0.0 0.99 ± 0.00007 0.11 ± 0.021 0.10 ± 0.020 0.12 ± 0.023 more robust and often gives out clearer patterns of patient relationship (Table 1). Therefore, iBFE surmounts all the difficulties caused by the five factors regarding almost all the evaluating metrics, and it significantly outperforms direct concatenation and SNF on situations with discrepancy of scale, noise and subtype definitions.

Performance of iBFE on real lung and kidney cancer datasets
The performance of iBFE was further evaluated on real lung and kidney cancer datasets produced by TCGA. Similar to the results on simulated datasets, iBFE also demonstrated superior intra-class consistency and interclass discrimination on both the lung and kidney cancer datasets (Fig. 4, Table 2 and Additional file 1 and Additional file 2). Based on individual clustering schemes, direct concatenation, SNF and iBFE all achieved accuracy close to 1 ( Table 2). Of the 106 lung cancer patients, 12 patients were identified to form a single cluster by all the three methods (See Additional file 1). Survival analysis demonstrated that these 12 patients showed significantly better prognosis than other patients (p = 0.00255, logrank test for Kaplan-Meier survival functions). Within the other 94 patients, no methods identified clusters that have significantly different survival probability. This observation suggested that the performance of direct concatenation, SNF and iBFE is consistent when the signal/noise ratio is adequately high in the datasets. The discrimination of patients with better The best performer was highlighted with the darkest color PCC intraclass : average Pearson correlation coefficients of patients within the same classes; PCC interclass : average Pearson correlation coefficients of patients from different classes; Sim intraclass : average similarity of patients within the same classes measured by the Gausian kernel;Sim interclass :average similarity of patients from different classes measured by the Gausian kernel;ACC_rfLOO: accuracy of leave-one-out cross-validation by random forest; NMI_kmeans: normalized mutual information between the true patient relationships and the clustering results by k-means Fig. 3 Heatmaps of patient similarity on realistic simulation datasets. Compared to simplistic simulations, realistic simulations added many noisy features and unclassified patients and the class sizes were also equal. Patient similarity was measured by Pearson correlation coefficients. A, results on SD1 (issue of scales); B, results on SD2 (issue of noise types); C, results on SD3 (issue of noise sizes); D, results on SD4 (issue of incomplete patient relationships); E, results on SD5 (issue of conflict patient relationships). iBFE1: integration by using both Pearson and Spearman correlation coefficients; iBFE2: integration by using only Pearson correlation coefficients; iBFE3: integration by using only Spearman correlation coefficients prognostics was mainly contributed by the DNA methylation data because clustering based on only methylation data also generated the same result but clustering based on mRNA expression or miRNA expression data did not obtain similar results. The normalized mutual information between clustering schemes generated by individual data types and integrative methods suggested that iBFE extracted more information from the DNA methylation data than direct concatenation and SNF.  Of the 122 kidney cancer patients, either direct concatenation or SNF did not identify patient clusters that showed significantly different prognosis. However, through clustering all the patients into three classes (so did direct concatenation and SNF), iBFE identified two classes of patients that had significantly good (p = 0.00892, log-rank test for Kaplan-Meier survival functions) or poor (p = 0.00017, logrank test for Kaplan-Meier survival functions) prognosis against other patients (Fig. 5). The mRNA expression data contributed mainly to the identification of patient clusters with good or poor prognosis. The mRNA expression data individually suggested the existence of patient clusters with good or poor prognosis but the p-values ((p = 0.02109 for good prognosis and p = 0.00042 for poor prognosis, logrank test for Kaplan-Meier survival functions) were higher than those of iBFE. The miRNA expression data individually identified a cluster with poor prognosis with high p-value (0.03033). The DNA methylation data individually did not identify clusters with significantly different prognosis. The normalized mutual information between clustering schemes generated by individual data types and integrative methods suggested that iBFE extracted more information from the mRNA expression data than direct concatenation and SNF. These results suggest that iBFE can identify and merge the signals embedded in diverse data types to accurately identify disease subtypes and predict prognosis.

Discussion
The rapid developments of high-throughput biomedical technologies have made it possible and cost-effective to comprehensively characterize patients with various diseases from multiple levels [1,2,4,5,10,14]. This will greatly advance the development of personalized medicine and makes hopeful promises for accurate diagnosis and prognosis [5,10,17,31]. However, the heterogeneity behind the biological processes involved in the measurements and the distinct technologies also raise significant challenges for the integrative analyses [5]. Although direct concatenation is the simplest and the most intuitive method to adopt and some alternative methods have been proposed, the performance of these methods is not satisfactory and factors that hamper their performance are unclear. In this study, we dissected the possible disturbing factors and evaluated their impacts on integrative analyses by simulation, which clearly illustrate those restricting factors. Inspired by the simulation results and the fact that disease class discovery and prediction can often obtain better results in the feature space extracted from the original data [18][19][20], we proposed a novel method, called iBFE, for integrating diverse genomic data types towards accurately diagnosis and prognosis. Evaluation on both simulated and real datasets suggests that iBFE can overcome those restricting factors successfully. IBFE can identify patient clusters that show Fig. 5 Survival curves of kidney cancer subtypes revealed by different data types and integration methods significantly different prognosis, which is important for understanding the subtypes of diseases and for improving patients' health.
The principles behind iBFE are simple. Upon the feature extraction concept, iBFE employs Pearson and Spearman correlation coefficients as the atomic operations to subvert the difficulties posed by discrepancy of scales, noise and embedded patient relationships. Because Pearson correlation coefficients and Spearman correlation coefficients have no parameters to tune, iBFE is also parameter-free. Furthermore, because of the simplicity, iBFE is flexible to include other feature extraction to further improve the integrative analysis. . The same as direct concatenation and SNF, iBFE is also unsupervised. The usage of iBFE does not require any prior information of the datasets and patients. And moreover, iBFE improves the computing efficacy by transforming the original data of thousands variables into a small number of variables All these properties of iBFE greatly facilitate the application of iBFE in practice.

Conclusions
In conclusion, we evaluated those restricting factors that hamper integrative analyses of diverse genomic datasets generated by various biomedical technologies, and proposed a simple, flexible and powerful method to overcome these restricting factors. Examinations on both simulated and real datasets suggest that the new method can effectively and efficiently identify disease subtypes and predict prognosis.

Consent
Written informed consent was obtained from the patient by the TCGA project for the publication of this report and any accompanying images.

Additional files
Additional file 1: Heatmaps of patient similarity for lung cancer. The similarity scores were measured by the Pearson correlation coefficients based on single data types (DNA methylation, mRNA expression and miRNA expression) and the integrated scores (integrated by direct concatenation, SNF and iBFE). (DOCX 802 kb)