 Technical advance
 Open Access
 Published:
Integrating heterogeneous genomic data to accurately identify disease subtypes
BMC Medical Genomics volume 8, Article number: 78 (2015)
Abstract
Background
Highthroughput biotechnologies have been widely used to characterize clinical samples from various perspectives e.g., epigenomics, genomics and transcriptomics. However, because of the heterogeneity of these technologies and their outputs, individual analysis of the various types of data is hard to create a comprehensive view of disease subtypes. Integrative methods are of pressing need.
Methods
In this study, we evaluated the possible issues that hamper integrative analysis of the heterogeneous disease data types, and proposed iBFE, an effective and efficient computational method to subvert those issues from a feature extraction perspective.
Results
Strict experiments on both simulated and real datasets demonstrated that iBFE can easily overcome issues caused by scale conflicts, noise conflicts, incompleteness of patient relationships, and conflicts between patient relationships, and that iBFE can effectively combine the merits of DNA methylation, mRNA expression and microRNA (miRNA) expression datasets to accurately identify disease subtypes of significantly different prognosis.
Conclusions
iBFE is an effective and efficient method for integrative analysis of heterogeneous genomic data to accurately identify disease subtypes. The Matlab code of iBFE is freely available from http://zhangroup.aporc.org/iBFE.
Background
With the development of highthroughput genomic technologies, it has become easy and costeffective to comprehensively characterize clinical samples by a wide range of genomic data, e.g., depicting cancer samples from epigenomic, genomic and transcriptomic perspectives. Largescale efforts conducted by The Cancer Genome Atlas (TCGA) have already applied this strategy to study over 20 cancers from thousands of patients, with a large amount of epigenomic, genomic, transcriptomic and clinical data collected from the same patients [1–4]. While the availability of such a wealth of wellstructured data makes the status of patients be characterized comprehensively and subtly, it also presents important challenges for the analysis methodology. Because of the great heterogeneity of technologies and biological data, individual analysis or simple concatenation of all the available datasets often cannot generate desired results [5]. Although independent analyses of single datasets were commonly adopted, the inconsistent conclusions underscore the necessity of unbiased integrative methods. Due to the exacerbated “curse of dimensionality” [6], i.e., the number of measures is greatly larger than the number of patients, direct concatenation may generate worse results. The currently developed integrative methods for analysis of multiple genomic data of the same patients can generally be classified into three groups [7, 8]. The first group of methods is based on matrix factorization [9–13]. The second group of methods is based on Bayesian models [14–16]. A major issue with the factorization and Bayesian approaches is that they generally require proper data preprocessing and normalization techniques. The computation of these approaches is also complicated. Recently, Wang et al. proposed a new type of integrative methods based on network fusion, which achieves the stateoftheart performance regarding both accuracy and computational speed as demonstrated in [5]. However, it is still unknown what factors interfere with integrative analysis and what are the pitfalls of the current integrative analytical methodology while dissection of issues that interfere with integrative analysis and identification of alternative methods is essential for boosting the translation of advances of highthroughput genomic technologies to personalized medicine.
In this study, we explicitly interrogated factors that inhibit integrative analyses of multiple data types for both disease class discoveries and classifications [17]. By isolating those possible factors, we identified that the scales of measurement, the noise types and sizes, and the completeness and concordance of patient relationships in different data types are important issues that prevent integrative. And the currently available methods cannot overcome all the issues. Motivated by the great power of feature extraction methods for unbiased and unsupervised analyses in single datasets [18], we proposed a novel integrative approach Based on Feature Extraction (referred to iBFE below). Simulations suggested that iBFE can overcome all the issues identified in this study. Applications of iBFE to integrating the DNA methylation, mRNA expression and miRNA expression datasets of lung and kidney cancers produced by TCGA suggest that iBFE not only can successfully integrate the diverse data types but also can identify disease subtypes that have distinct survival profiles. Because iBFE is simple, flexible, unsupervised and unbiased, it is readily to extend to integrate more types of genomic datasets to improve the disease diagnosis and prognosis.
Methods
Overview of the iBFE method
The iBFE method is motivated by the observation that the accuracy of disease class discovery and classification can be significantly improved in the feature space extracted from the original data [18–20]. The pipeline of iBFE consists of three steps: i) extract features from individual type of datasets; ii) concatenate the extracted features; iii) extract new features from concatenated features. When the three steps were finished, the newly constructed features of patients can be used as inputs to do disease class discoveries and classifications by other algorithms e.g., kmeans [21, 22] and support vector machines [23, 24].
First, iBFE uses Pearson and Spearman correlations to extract features from individual data types. Given a single dataset X_{MxN}^{(1)}, in which x_{ij}^{(1)} represents the jth variable of the ith patient (i ranged from 1 to M, and j ranged from 1 to N), P_{MxM}^{(1)} and S_{MxM}^{(1)} are constructed from X^{(1)}. P^{(1)} is the similarity matrix of patients constructed by Pearson correlation coefficients [25, 26], i.e., p_{ab}^{(1)} is the Pearson correlation coefficient of x_{a}^{(1)} and x_{b}^{(1)}. Here x_{a}^{(1)} and x_{b}^{(1)} represent values of all the variables of the ath and bth patients, respectively. Similar to P^{(1)}, S^{(1)} is the similarity matrix of patients constructed by Spearman correlation coefficients [27, 28]. The advantage of Pearson correlation coefficients in feature extraction has been demonstrated and validated previously [18]. The introduction of Spearman correlation coefficients here is to employ its distributionindependent property, which is important for handling issues caused by scale and noise during integration. Both of Pearson correlation coefficients and Spearman correlation coefficients have values ranged from −1 to 1, which can provide consistent scales for different data types.
Given K types of datasets, in the second step, P^{(k)} and^{(k)}, k = 1,…,K, are concatenated into Y_{Mx2MK}, i.e., Y_{Mx2MK} = [P^{(1)}S^{(1)} … P^{(k)}S^{(k)} … P^{(K)}S^{(K)}], where the rows of Y represent patients while the columns of Y are the extracted features by Pearson correlation coefficients and Spearman correlation coefficients. Because P^{(k)} and S^{(k)} are naturally normalized to the region from −1 to 1, concatenation at this step will not suffer from issues encountered during direct concatenation of the original datasets.
In the third step, a new similarity matrix of patients Z_{MxM} is constructed by calculating the Pearson correlation coefficients of the rows of Y, i.e., z_{ij} is the Pearson correlation coefficient of y_{i} and y_{j}, where y_{i} and y_{j} represent the ith and jth rows of Y, respectively. Z_{MxM} is the final features extracted by iBFE from the K types of original datasets. In practice, the original datasets generally consist of thousands of variables because thousands of genes are measured at the epigenomic, genomic and transcriptomic levels by highthroughput biotechnologies. By mapping the original datasets into feature space spanned by profiles of patient similarities, iBFE extracts the patterns embedding within patient relationships. Further, the calculation expense is also greatly reduced.
In summary, the algorithm of iBFE can be outlined as follows:

Step I: calculate P(k) and S(k) for X(k), k = 1,…,K;

Step II: construct Y = [P^{(1)} S^{(1)} … P^{(k)} S^{(k)} … P^{(K)} S^{(K)}];

Step III: construct Z by calculating the Pearson correlation coefficients of rows of Y.
Here we named the iBFE using both Pearson and Spearman correlation coefficients as iBFE_{1}. To evaluate the performance of iBFE that only employs Pearson or Spearman correlation coefficients, we also constructed iBFE_{2} that only uses Pearson correlation coefficients and iBFE_{3} that only uses Spearman correlation coefficients.
Simulating datasets that dissect possible issues interfering with integration
We evaluated the factors that may affect integration of different types of datasets for disease class discovery and classification by simulation. Because simulation can highlight one possible factor while controlling the influence of other factors, it provides an ideal tool to evaluate the impacts of single factors on integration although some simulations may be not quite realistic. According to our experience, we hypothesize that the following factors that may affect integrative analyses: i) scales of measurements in different datasets; ii) noise types of different datasets; iii) noise sizes; iv) completeness of patient relationships that is revealed by single datasets; v) concordance of patient relationships revealed by each dataset. To evaluate their roles during integrative analyses, we constructed five simulated datasets (Fig. 1).
By simulated dataset 1 (SD1), we evaluated the impacts of scale conflicts of measurements on integrative analyses. We simulated 100 patients that are characterized by 100 variables for simplicity. The first 50 patients belong to cluster 1, with the first 50 variables all one and the other 50 variables all zero. The second 50 patients belong to cluster 2, with the first 50 variables all zero and the other 50 variables all one. All the 100x100 measurements are disturbed by noise sampling from a standard normal distribution. We named this prototype data as data 0 (SD1D0), the hidden real data. Two types of observed data are generated from SD1D0. Type 1 of SD1 (SD1T1) is constructed by transforming SD1D0 to its q^{th} power, i.e., x_{ij}^{(SD1T1)} = (x_{ij}^{(SD1D0)})^{q}. Type 2 of SD1 (SD1T2) is constructed by x_{ij}^{(SD1T2)} = q^(x_{ij}^{(SD1D0)}). Here q is a parameter to control the scale difference between the two data types. The powerlaw and exponential functions are used to simulate the issues caused by scales of different measurements.
By simulated dataset 2 (SD2), we evaluated the impacts of different noise types on integrative analyses. The prototype SD2D0 is the same as SD1D0 except that the noise is not added. The observed SD2T1 and SD2T2 are constructed based on SD2D0 by adding noise sampled respectively from a normal distribution with means zero and standard deviation q and from a uniform distribution from zero to q, where q is the parameter to control the size of noise.
By simulated dataset 3 (SD3), the impacts of different noise sizes are evaluated. The prototype SD3D0 is the same as SD2D0. The observed SD3T1 and SD3T2 are constructed based on SD3D0 by adding noise sampled from a normal distribution with means zero and different standard deviations.
By simulated dataset 4 (SD4), we evaluated the impacts of incomplete patient relationships embedded in single data types on integrative analyses. Two prototype datasets, i.e., SD4D01 and SD4D02, are constructed. SD4D01 simulates 100 patients by 100 variables for simplicity, in which the first 50 patients form a cluster with the first 50 variables all one and the other 50 variables all zero. The relationships of the other 50 patients are not defined in SD4D01 and the corresponding variables are all zero. In SD4D02, the relationships of the first 50 patients are not defined (with all the corresponding variables zero) but the other 50 patients are defined as another cluster (with the first 50 variables all zero and the other 50 variables all one). SD4D01 and SD4D02 together define the complete relationship of the 100 patients. SD4T1 and SD4T2 are constructed from SD4D01 and SD4D02 respectively by adding noise sampled from a normal distribution with means zero and standard deviation q.
By simulated dataset 5 (SD5), the impacts of conflicting patient relationships embedded in different data types are examined. Two prototype datasets, i.e., SD5D01 and SD5D02, are constructed. SD5D01 simulates 100 patients by 100 variables for simplicity, in which the first 50 patients form cluster 1 with the first 50 variables all one and the other 50 variables all zero, whereas the other 50 patients form cluster 2 with the first 50 variables all zero and the other 50 variables all one. In SD5D02, the first 30 patients and the last 30 patients form a cluster and the middle 40 patients form another cluster. SD5D01 and SD5D02 define two clusters individually but together they define four clusters of the 100 patients. SD5T1 and SD5T2 are constructed from SD5D01 and SD5D02 respectively by adding noise sampled from a normal distribution with means zero and standard deviation q.
Real datasets generally have many noisy features that are helpless to identify disease subtypes and many patients that cannot be definitely classified to a certain disease subtype. And different disease subtypes also have different sizes. We constructed another five realistic simulation datasets by adding these properties to SD1SD5. Based on SD1SD5, the size of the second disease subtype was doubled, 50 unclassified patients were added, and additional features (10 times of the number of informative features) that were sampled from the normal distributions were added to each simulated datasets.
Evaluating iBFE and other integrative methods on simulated datasets
We use three types of metrics to evaluate those factors interfering with integrative analyses and the performance of various integrative methods to overcome the interfering factors in different situations. The first type of metrics examines the intraclass consistency and interclass discrimination of patients based on the respective features constructed by individual integrative methods. Two measures are employed: Pearson correlation coefficients and the Gaussian kernel constructed based on the Euclidean distance of the extracted features. The second type of metric examines the performance of each integrative method for disease class discovery, i.e., clustering patients into subtypes. The widely used kmeans algorithm (implemented in Matlab 8.1) is applied 1000 times to the features extracted by each integrative method with k = 2 on SD14 and k = 4 on SD5. The clustering scheme with the minimum sum of pointtocentroid distances is selected as the final clustering for evaluation. Normalized mutual information between the true clusters and each clustering scheme generated by different integrative methods are calculated to demonstrate their performance [5]. The third type of metric evaluates the performance of each integrative method for predicting disease classes of patients when the disease subtypes of some patients are known. The widely used random forest algorithm [29] is used as the classifier because random forest is robust and accurate and can be applied to both linearly and nonlinearly classified situations. To reduce biases caused by overfitting, the leaveoneout crossvalidation scheme is used [30].
Three integrative analysis methods are included in the evaluation, i.e., direct concatenation [5], similarity network fusion (SNF) [5] and iBFEs. Direct concatenation is included because it is the most intuitive method to integrate various types of datasets to comprehensively characterize diseases. Inclusion of direct concatenation can obviously illustrate the impacts of those suspicious factors on integrative analyses. SNF is the stateoftheart algorithm recently proposed for integrative analyses [5], which demonstrates excellent performance in combining multiple genomic datasets to predict subtypes and survival of various cancer patients. Especially, SNF is demonstrated to outperform other integrative methods like iCluster [31] which is based on preselection of genes. Direct concatenation was implemented by the matrix concatenation operation in Matlab. The Matlab code of SNF was downloaded from http://compbio.cs.toronto.edu/SNF/SNF/Software.html.
Evaluating iBFE on the DNA methylation, mRNA expression and miRNA expression datasets of lung and kidney cancers produced by TCGA
The DNA methylation, mRNA expression and miRNA expression datasets of lung squamous cell carcinoma (106 patients) and kidney renal clear cell carcinoma (122 patients) produced by TCGA are included to evaluate the performance of iBFE on real datasets [1, 4]. These two TCGA datasets are also involved in the evaluation of performance of SNF and other integrative methods [5]. Because TCGA repository contains multiple platforms for each data type, the platform corresponding to the largest number of available individuals and describing both tumor samples and controls whenever possible was enrolled in data building. For expression data, the Broad Institute HTHGU133A platform was included in the lung cancer dataset, and the UNCIlluminaHiseqRNASeq platform was included in the kidney cancer dataset. For miRNA expression data, the BCGSCIlluminaGAmiRNAseq platform was included in the lung and kidney cancer datasets. For the methylation data, the JHUUSCHumanMethylation27 platform was included in both datasets. Patients’ clinical information was also included to evaluate the prognostic power of the proposed integrative analysis method.
Three types of metrics are used to evaluate the performance of iBFE. The first type of metrics also examines the intraclass consistency and interclass discrimination of patients and the Pearson correlation coefficients and Euclidean distances are employed. Because the true clustering schemes are not available for these two real datasets, the second and third types of metrics used on the simulated datasets cannot be used again. We proposed an alternative measure to evaluate the performance of iBFE for disease class discovery and prediction. First, kmeans is applied 1000 times to obtain the clustering scheme on each cancer dataset with k ranging from 2 to 10. Then the kmeans clustering scheme that is the most stable is selected as the true subtypes of patients to calculate the leaveoneout accuracy of the iBFE features, which serves as the second type of evaluating metric. The third type metric is to examine whether the integrative analyses can identify disease subtypes that have significantly different survival probability. Although factors out of the genomic measurements may also affect survival probability, prognosis prediction based on genomic data may be helpful for clinicians.
Results
Factors interfering with integrative analyses highlighted by simulations
We evaluated the performance of the intuitive direct concatenation method and the stateoftheart method SNF on each type of simulated datasets. Given the controlling parameters, the simulations were repeated 100 times, and the averages of evaluating metrics were recorded for comparison. We observed that all the five factors can interfere with integrative analyses, influencing all the metrics including intraclass consistency, interclass discrimination and accuracy of clustering and classification.
The different scales of two data types interfere with integrative analyses significantly when the controlling parameter q becomes large. When q is small, the scales of two data types are close to each other. And the two data types can be treated as two replicates of the same dataset. Thus, both direct concatenation and SNF can clearly identify the true patient relationships and demonstrate good performance for both class discovery and classification. However, when q is large, although direct concatenation and SNF still demonstrate acceptable discrimination of higher intraclass patient similarity than that of interclass, the accuracy of clustering by kmeans based on either the concatenated features or the constructed features by SNF is significantly reduced. For example, when q = 20 (Fig. 2a and Table 1), the normalized mutual information between clustering scheme produced by direct concatenation and the true patient clustering scheme is only 0.0354, whereas the normalized mutual information between clustering scheme produced by SNF and the true scheme is 0.00519. Therefore, scale issues significantly impair the accuracy of clustering based on multiple data types. For disease class prediction, direct concatenation demonstrates a good performance (94 % accuracy) when q = 20 while SNF shows dissatisfied performance (52 % accuracy).
The noise types and sizes also influence the integration of different data. Direct concatenation generally produces worse clustering and classification results than those based on single data (Fig. 2b and c and Table 1). Although SNF can sometimes improve the classification accuracy in leaveoneout crossvalidation, the accuracy of clustering is significantly reduced (Table 1).
When the complete patient relationships are defined only by the combination of different data types and individual data type reveals only partial information of patient relationships, it is demonstrated that direct concatenation can significantly improve the intraclass consistency, the interclass discrimination, and the accuracy of clustering and classification (Fig. 2d and Table 1). SNF also performed well with this situation, with the accuracy of classification slightly better than that of direct concatenation. However, the clustering accuracy of SNF is much lower than that of direct concatenation (Fig. 2d and Table 1).
When the patient relationships are conflictingly defined by the different data types, patients are in fact clustered to more than one class. For example, in SD5 (Fig. 2e), data1 defines two classes and data2 also defines two classes. However, the two clustering schemes are conflicting and in fact the patients form four distinct classes. The performance of direct concatenation is affected in this situation, with both the accuracy of clustering and classification reduced significantly (Fig. 2e and Table 1). In particular, the leaveoneout accuracy of classification is reduced to unsatisfied 63 %. SNF can obtain better classification accuracy (93 %) but the clustering accuracy is unsatisfied. The normalized mutual information between the true clustering scheme and the SNF clustering results became as low as to 0.12 (Fig. 2e and Table 1). Therefore, conflicting patient relationships defined by different data types impair the performance of both direct concatenation and SNF.
In summary, the performance of direct concatenation seems to be resistant to the incompleteness of patient relationships of individual data types, but it can be heavily affected by the discrepancy of scales, noise types, noise sizes, and the conflicts of the patient relationships. SNF significantly improves the classification accuracy in the situations of incomplete and conflicting patient relationships, but its clustering performance is heavily affected by these factors.
Performance of iBFE on simulated datasets
We then applied iBFEs to the simulated datasets to evaluate whether iBFE can surmount these disturbing factors. On SD1, i.e., datasets that simulate scale issues, iBFE_{1} achieves better results than direct concatenation and SNF, regarding all the evaluation metrics including intraclass consistency, interclass discrimination and accuracy of clustering and classification (Fig. 2a and Table 1). The leaveoneout classification accuracy of iBFE_{1} is comparable to or better than direct concatenation, and the clustering accuracy of iBFE_{1} also approximates to 1, significantly higher than those of direct concatenation and SNF. On SD2 and SD3, i.e., datasets that simulate different noise types and sizes, iBFE_{1} also outperforms direct concatenation and SNF regarding almost all the evaluation metrics (Fig. 2b and c and Table 1). On SD4 that simulates incomplete patient relationships, iBFE_{1} demonstrated better intraclass consistency and interclass discrimination but the accuracy of clustering and classification is slightly lower than those of direct concatenation and SNF (Fig. 2d and Table 1). On SD5 that simulate conflicting patient relationships, iBFE_{1} outperformed direct concatenation and SNF regarding almost all the metrics (Fig. 2e and Table 1). On those realistic simulation datasets, iBFE_{1} also demonstrated superior performance (Fig. 3). iBFE_{2} that uses only Pearson correlation coefficients and iBFE_{3} that uses only Spearman correlation coefficients also demonstrated similar performance compared to iBFE_{1} that uses both Pearson and Spearman correlation coefficients (Figs. 2 and 3 and Table 1). Because iBFE_{1} uses more information than iBFE2 and iBFE3, it is generally more robust and often gives out clearer patterns of patient relationship (Table 1). Therefore, iBFE surmounts all the difficulties caused by the five factors regarding almost all the evaluating metrics, and it significantly outperforms direct concatenation and SNF on situations with discrepancy of scale, noise and subtype definitions.
Performance of iBFE on real lung and kidney cancer datasets
The performance of iBFE was further evaluated on real lung and kidney cancer datasets produced by TCGA. Similar to the results on simulated datasets, iBFE also demonstrated superior intraclass consistency and interclass discrimination on both the lung and kidney cancer datasets (Fig. 4, Table 2 and Additional file 1 and Additional file 2). Based on individual clustering schemes, direct concatenation, SNF and iBFE all achieved accuracy close to 1 (Table 2).
Of the 106 lung cancer patients, 12 patients were identified to form a single cluster by all the three methods (See Additional file 1). Survival analysis demonstrated that these 12 patients showed significantly better prognosis than other patients (p = 0.00255, logrank test for KaplanMeier survival functions). Within the other 94 patients, no methods identified clusters that have significantly different survival probability. This observation suggested that the performance of direct concatenation, SNF and iBFE is consistent when the signal/noise ratio is adequately high in the datasets. The discrimination of patients with better prognostics was mainly contributed by the DNA methylation data because clustering based on only methylation data also generated the same result but clustering based on mRNA expression or miRNA expression data did not obtain similar results. The normalized mutual information between clustering schemes generated by individual data types and integrative methods suggested that iBFE extracted more information from the DNA methylation data than direct concatenation and SNF.
Of the 122 kidney cancer patients, either direct concatenation or SNF did not identify patient clusters that showed significantly different prognosis. However, through clustering all the patients into three classes (so did direct concatenation and SNF), iBFE identified two classes of patients that had significantly good (p = 0.00892, logrank test for KaplanMeier survival functions) or poor (p = 0.00017, logrank test for KaplanMeier survival functions) prognosis against other patients (Fig. 5). The mRNA expression data contributed mainly to the identification of patient clusters with good or poor prognosis. The mRNA expression data individually suggested the existence of patient clusters with good or poor prognosis but the pvalues ((p = 0.02109 for good prognosis and p = 0.00042 for poor prognosis, logrank test for KaplanMeier survival functions) were higher than those of iBFE. The miRNA expression data individually identified a cluster with poor prognosis with high pvalue (0.03033). The DNA methylation data individually did not identify clusters with significantly different prognosis. The normalized mutual information between clustering schemes generated by individual data types and integrative methods suggested that iBFE extracted more information from the mRNA expression data than direct concatenation and SNF. These results suggest that iBFE can identify and merge the signals embedded in diverse data types to accurately identify disease subtypes and predict prognosis.
Discussion
The rapid developments of highthroughput biomedical technologies have made it possible and costeffective to comprehensively characterize patients with various diseases from multiple levels [1, 2, 4, 5, 10, 14]. This will greatly advance the development of personalized medicine and makes hopeful promises for accurate diagnosis and prognosis [5, 10, 17, 31]. However, the heterogeneity behind the biological processes involved in the measurements and the distinct technologies also raise significant challenges for the integrative analyses [5]. Although direct concatenation is the simplest and the most intuitive method to adopt and some alternative methods have been proposed, the performance of these methods is not satisfactory and factors that hamper their performance are unclear. In this study, we dissected the possible disturbing factors and evaluated their impacts on integrative analyses by simulation, which clearly illustrate those restricting factors. Inspired by the simulation results and the fact that disease class discovery and prediction can often obtain better results in the feature space extracted from the original data [18–20], we proposed a novel method, called iBFE, for integrating diverse genomic data types towards accurately diagnosis and prognosis. Evaluation on both simulated and real datasets suggests that iBFE can overcome those restricting factors successfully. IBFE can identify patient clusters that show significantly different prognosis, which is important for understanding the subtypes of diseases and for improving patients’ health.
The principles behind iBFE are simple. Upon the feature extraction concept, iBFE employs Pearson and Spearman correlation coefficients as the atomic operations to subvert the difficulties posed by discrepancy of scales, noise and embedded patient relationships. Because Pearson correlation coefficients and Spearman correlation coefficients have no parameters to tune, iBFE is also parameterfree. Furthermore, because of the simplicity, iBFE is flexible to include other feature extraction to further improve the integrative analysis. . The same as direct concatenation and SNF, iBFE is also unsupervised. The usage of iBFE does not require any prior information of the datasets and patients. And moreover, iBFE improves the computing efficacy by transforming the original data of thousands variables into a small number of variables All these properties of iBFE greatly facilitate the application of iBFE in practice.
Conclusions
In conclusion, we evaluated those restricting factors that hamper integrative analyses of diverse genomic datasets generated by various biomedical technologies, and proposed a simple, flexible and powerful method to overcome these restricting factors. Examinations on both simulated and real datasets suggest that the new method can effectively and efficiently identify disease subtypes and predict prognosis.
Consent
Written informed consent was obtained from the patient by the TCGA project for the publication of this report and any accompanying images.
Abbreviations
 SD:

Simulated dataset
 SNF:

Similarity network fusion
 iBFE:

Integration by feature extraction
 TCGA:

The cancer genome atlas
References
The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012;489(7417):519–25.
The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330–7.
Smith LM, Hartmann L, Drewe P, Bohnert R, Kahles A, Lanz C, et al. Multiple insert size pairedend sequencing for deconvolution of complex transcriptomes. RNA Biol. 2012;9(5):596–609. eng.
The Cancer Genome Atlas Research N. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43–9.
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7. PubMed Epub 2014/01/28. eng.
Hughes G. On the mean accuracy of statistical pattern recognizers. Information Theory, IEEE Transactions on. 1968;14(1):55–63.
Kristensen VN, Lingjaerde OC, Russnes HG, Vollan HKM, Frigessi A, BorresenDale AL. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014;14(5):299–313.
Wei Y. Integrative Analyses of Cancer Data: A Review from a Statistical Perspective. Cancer Informatics. 2015 05/14(4839CINIntegrativeAnalysesofCancerData:AReviewfromaStatisticalPersp.pdf):173–81. English.
Hofree M, Shen JP, Carter H, Gross A, Ideker T. Networkbased stratification of tumor mutations. Nat Meth. 2013;10(11):1108–15.
Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ. Discovery of multidimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 2012;40(19):9379–91.
Shen R, Wang S, Mo Q. Sparse integrative clustering of multiple omics data sets. 2013 2013/03:269–94. en.
Shen R, Mo Q, Schultz N, Seshan VE, Olshen AB, Huse J, et al. Integrative Subtype Discovery in Glioblastoma Using iCluster. PLoS ONE. 2012;7(4):e35236.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. PubMed Pubmed Central PMCID: PMC2631488, Epub 2008/12/31. eng.
Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012;28(24):3290–7. PubMed Pubmed Central PMCID: 3519452, Epub 2012/10/11. eng.
Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29(20):2610–6.
Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci. 2003;100(14):8348–53.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286(5439):531–7.
Ren X, Wang Y, Zhang XS, Jin Q. iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Nucleic Acids Res. 2013;41(14):e143.
Ren X, Wang Y, Wang J, Zhang XS. A unified computational model for revealing and predicting subtle subtypes of cancers. BMC Bioinformatics. 2012;13(1):70. doi:10.1186/147121051370.
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41(Database issue):D48–55. PubMed Pubmed Central PMCID: PMC3531136, Epub 2012/12/04. eng.
MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, vol. 1. Berkeley, Calif: University of California Press; 1967.
Steinhaus H. Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III. 1956;4:801–4.
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory; Pittsburgh, Pennsylvania, USA. 130401: ACM. 1992. p. 144–52.
Cortes C, Vapnik V. Supportvector networks. Mach Learn. 1995;20(3):273–97. English.
Stigler SM. Francis Galton’s Account of the Invention of Correlation. 1989 1989/05(2):73–9. en.
Fisher RA. Frequency Distribution of the Values of the Correlation Coeffients in Samples from an indefinitely Large Population. Biometrika. 1915;10(4):507–21.
Fieller EC, Hartley HO, Pearson ES. Tests for Rank Correlation Coefficients. I. Biometrika. 1957;44(3–4):470–81.
Choi SC. Tests of equality of dependent correlation coefficients. Biometrika. 1977;64(3):645–7.
Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32. English.
Kohavi R. A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI; 1995.
Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25(22):2906–12.
Acknowledgments
The authors thank members of ZHANGroup of Academy of Mathematics and Systems Science, Chinese Academy of Sciences for their valuable comments and contributions to discussions. This study is supported by projects from National Natural Science Foundation of China [91330114, 31200106, 11131009 and 61171007], by the National Science and Technology Major Project, “China MegaProject for Infectious Disease” [2013ZX10004601] and by Program for Changjiang Scholars and Innovative Research Team in University [IRT13007].
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declared none of competing interests.
Authors’ contributions
XR designed the study. XR and HF implemented the experiments and analysis. XR and QJ wrote the manuscript. All authors read and approved the final manuscript.
Authors’ information
Dr. Ren is an associate professor of the Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College. His interest is translating bioinformatic achievements to clinical applications.
Xianwen Ren and Hua Fu contributed equally to this work.
Additional files
Additional file 1:
Heatmaps of patient similarity for lung cancer. The similarity scores were measured by the Pearson correlation coefficients based on single data types (DNA methylation, mRNA expression and miRNA expression) and the integrated scores (integrated by direct concatenation, SNF and iBFE). (DOCX 802 kb)
Additional file 2:
Heatmaps of patient similarity for kidney cancer. The similarity scores were measured by the Pearson correlation coefficients based on single data types (DNA methylation, mRNA expression and miRNA expression) and the integrated scores (integrated by direct concatenation, SNF and iBFE). (DOCX 875 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ren, X., Fu, H. & Jin, Q. Integrating heterogeneous genomic data to accurately identify disease subtypes. BMC Med Genomics 8, 78 (2015). https://doi.org/10.1186/s1292001501545
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1292001501545
Keywords
 DNA methylation
 Gene expression
 miRNA expression
 Integration
 Diagnosis
 Prognosis
 Cancer stratification