Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis

Background Feature selection of multi-omics data analysis remains challenging owing to the size of omics datasets, comprising approximately \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^2$$\end{document}102–\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105 features. In particular, appropriate methods to weight individual omics datasets are unclear, and the approach adopted has substantial consequences for feature selection. In this study, we extended a recently proposed kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) method to integrate multi-omics datasets obtained from common samples in a weight-free manner. Method KTD-based unsupervised FE was reformatted as the collection of kernelized tensors sharing common samples, which was applied to synthetic and real datasets. Results The proposed advanced KTD-based unsupervised FE method showed comparative performance to that of the previously proposed KTD method, as well as tensor decomposition-based unsupervised FE, but required reduced memory and central processing unit time. Moreover, this advanced KTD method, specifically designed for multi-omics analysis, attributes P values to features, which is rare for existing multi-omics–oriented methods. Conclusions The sample R code is available at https://github.com/tagtag/MultiR/.


Background
Feature selection with multi-omics datasets has been a long-standing challenge for bioinformatics. Among the numerous proposed methods adapted to multi-omics data analysis [1,2], only few are capable of performing feature selection. Most of these methods fail to implement feature selection because multi-omics data analysis has a strong tendency to involve a small number ( = n ) of samples with a large number ( = p ) of features, commonly referred to as the large p small n problem [3], posing difficulty for accurate feature selection. Features should have a sufficiently small P value to be selected under the null hypothesis. Since the raw P values must be heavily corrected for multiple comparisons when dealing with multi-omics datasets, P values become larger and inevitably less significant; thus, attributing significant P values to individual features, even after correction, is difficult. However, since the number of samples (i.e., conditions) is less than that of features (i.e., variables), labels or values attributed to samples can be accurately predicted by any model (when the number of conditions is less than the number of variables, either the labels or values attributed to samples may be predicted, even if the variables are purely random numbers).
In multi-omics analysis, it is difficult to obtain large sample sizes since multiple observations, each of which corresponds to individual omics approaches, must be performed. In this sense, the required cost and time Taguchi and Turki BMC Medical Genomics (2022) 15:37 are multiplied in proportion to the number of omics approaches considered. This often results in a smaller number of samples to which multi-omics measurements are performed when only limited experimental resources are available.
Principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) [4] have been proposed to be applied to feature selection for addressing the large p small n problem. Thus, these approaches are also suitable for feature selection in multi-omics analysis. Recently, the TD-based method was extended to kernel TD (KTD)-based unsupervised FE [5], which was applied to integrated analysis of N 6 -methyladenosine (m 6 A) and gene expression data [6]. These methods attribute P values to features, which is critical since this enables evaluation of the significance of the selected features, which is rarely possible using other methods applicable to multi-omics datasets [1,2]. In spite of this advantage, there are limitations of PCA, TD, and KTD-based unsupervised FE when applied to multiomics data analysis. PCA is inferior to TD when aiming for an integrated analysis of multiomics data. PCA failed to identify genes whose expression and methylation levels are altered simultaneously, but TD could [7]. Although KTD and TD successfully integrated two omics data, they could not achieve the following two points, 1 Reduction of required memory and long CPU time 2 Integration of more than two types of omics data simultaneously (see Discussions below). We here described a modification of the KTD-based unsupervised FE method to be more suitable for multi-omics data analysis. Although only a small modification was implemented, it nevertheless resulted in more flexibility for multi-omics data analysis, which was verified using synthetic and real data.
Before explaining the results, we briefly discuss the relationship between feature selection and feature extraction, both of which are employed when we are forced to deal with the large p small n problem. The former, feature selection, is more straightforward than the latter; it reduces the number of features less than the number of samples. On the other hand, feature extraction is more indirect, since it generates a limited number of new features from the original large number of features. In this study, we have employed a mixed strategy of these two. We first generated new features using feature extraction and selected features using the generated features.

Extended KTD-based unsupervised FE method
Suppose that we have K multi-omics datasets with N k features formatted as tensors sharing sample indices j 1 , . . . , j m as: refers to the i k th feature of the kth omics dataset. When K types of omics data are measured for each sample, k ∈ [1, K ].
x i k j 1 j 2 ···j m can be kernelized as where K k is an arbitrary kernel applied to x i k j 1 j 2 ···j m . Higher-order singular-value decomposition (HOSVD) [4] is then applied to x j 1 ···j m j ′ 1 ···j ′ m k , resulting in Eq. (4), where ℓ s , (1 ≤ s ≤ 2m) refers to the ℓ s th singular-value vectors attributed to the sth experiment type for 1 ≤ s ≤ m and the (s − 2m) th experiment type for m + 1 ≤ s ≤ 2m , respectively. ℓ 2m+1 refers to the ℓ 2m+1 th singular-value vector attributed to the omics datasets, k. u ℓ s j s ∈ R M s ×M s and u ℓ 2m+1 k ∈ R K ×K are singular-value matrices, which are also orthogonal matrices, is a core tensor that represents the weight of individual terms composed of the products of singular-value vectors. Here, one should note that u ℓ 2m+1 k represents the balance (weight) between multi-omics datasets, which (1) usually must be pre-defined manually in the case of a conventional supervised learning approach for FE.
Next, the u ℓ s j s values that are of interest from the biological point of view (e.g., distinct values between the two classes being compared) must be identified. Using these selected u ℓ s j s values, singular-value vectors are derived and assigned to i k s as where ℓ = (ℓ 1 , . . . , ℓ m ).
Finally, P values are attributed to i k assuming that the u ℓ i k values obey a multivariate Gaussian distribution (null hypothesis) as where P χ 2 [> x] is a cumulative χ 2 distribution in which the argument is larger than x and σ ℓ is the standard deviation. Here, summation is taken over ℓ s selected as being of interest. P values are then computed by the pchisq function in R [8].
The obtained P i k values are corrected using the Benjamini-Hochberg (BH) criterion [4] and the i k values associated with the adjusted P values less than the established threshold (typically 0.01) are selected. Correction by the BH criterion is performed by the p.adjust function in R with the option of method="BH".

Synthetic dataset
A synthetic dataset was derived in the form of a tensor, x ijk ∈ R N ×M×K , as where and ǫ ijk ∈ (0, 1) is a uniform random real number that emulates the residuals. N is the number of variables, j is a variable that adds order to the jth sample, and k is the kth omics data. This synthetic dataset assumes that only the top N 1 features among N features have dependency on i independent of k. x ijk , (kN 1 < i ≤ (k + 1)N 1 ) also has dependence on j, but in an omics(k)-dependent manner. The dependence on j is a linear increase upon j.
One hundred ensembles of x ijk were generated and the performances were averaged. First, a linear kernel was generated as Next, HOSVD was applied, resulting in (for the dimensions of the datasets, N, M, and K, see the legend to Table 1). Since it was observed that u 2j always had the largest correlation with a j and u 1k was always constant, regardless of k, u 2j and u 1k were employed to compute for attributing the P values with ℓ 1 = 2 and ℓ 2 = 1 . The P i values were corrected using the BH criterion and the i values associated with adjusted P i s less than 0.01 were selected.
To demonstrate the difficulty of this task, linear regression was used as an alternative method where α ik and β ik are regression coefficients, and a j is defined in Eq. (8). Since x ijk s with distinct k differ, distinct models were applied to each. After computing BH-corrected P values, the is associated with adjusted P values less than 0.01 were selected.
When the least absolute shrinkage and selection operator (lasso) [9] was applied, the maximum number of features selected were considered, although lasso can select at most as many as M features, which is less than the number of features coincident with a j , 2N 1 .
When random forest (rf ) [10] was applied to synthetic data sets, there were two ways to select features. First, features with nonzero importance, which is an evaluation measure provided by rf, were selected. Next, in order Table 1 Confusion matrix when applying KTD-based unsupervised FE to a synthetic dataset ( N = 1000, N 1 = 10, M = 10, K = 3) to reduce the number of selected features, the top most 2N 1 features having a larger absolute importance were selected.

Multi-omics hepatitis B virus (HBV) vaccine dataset
The real multi-omics HPV vaccine datasets were based on 75 samples measured in 15 individuals at five subsequent time points (i.e., 0, 1, 3, 7, and 14 days) after HBV vaccine treatment. Gene expression was measured using RNA-sequencing technology and methylation profiles were measured using microarray technology. The proteome was measured for whole blood cells (WBCs) as well as plasma. Since this multi-omics dataset is composed of four types of omics data measured for 15 individuals at five time points, it is thus formatted as a tensor. Gene expression and methylation profiles were retrieved from the gene expression omnibus (GEO) database using the GSE155198 and GSE161020 datasets, respectively. For gene expression profiles, the GSE155198_RAW.tar dataset was available in the Supplementary File section of GEO. Individual files that included count number of mapped reads toward genes were collected and integrated as a single file. Individual files are named according to the format "GSMXXXXXXX_GRnn_Vm.count. txt.gz, " where XXXXXXX, nn, and m are integers; nn ∈ {01, 02, 03, 04, 05, 06, 07, 10, 11, 13, 15, 17, 18, 19} identifies the 15 individuals; and m ∈ {3, 4, 5, 6, 7} identifies the five time points. Files were loaded into R as a data frame using the read.csv command. Data frames were bound into a single data frame using the cbind command in R. For the methylation profiles, the GSE161020_series_matrix.txt.gz dataset, available in the GEO Series Matrix File(s) section, was used as is. The file was loaded into R as a data frame using the read.csv command. The first column of the data frame is an identifier in the form of cgyyyyyyyy, where yyyyyyyy is an integer. Since the methylation profile was measured with microarray technology, the identifier can be annotated with the reference to the microarray annotation file, GPL6480-9577.txt.gz, which is available under GEO ID GPL6480. Since the other 75 column names are in the form of GSMXXXXXXX, the columns were reordered with reference to the columns of the data frame generated from the gene expression profiles as described above. The proteome dataset was obtained from ProteomeXchange [11] using ID PXD020474. Two files (GR01,04,09, 10,11,13,15,17,18,19.txt and GR02,03,05,06,07.txt) were downloaded and loaded as data frames into R using the read.csv command. The fourth column of the data frame includes the protein IDs, which were used as identifiers for subsequent analysis. These two data frames were merged with the row names of the union of the protein identifier. Missing observations were filled with zeros. Since the first and second rows have the format "GRnn" and "Visit m, " respectively, the columns can also be reordered with reference to the column names of the data frame generated by gene expression profiles as described above.
x i k j 1 j 2 ∈ R N k ×5×15 ; that is, the number of values of the number of i k features of the kth feature type measured at the j 1 th time point for j 2 individuals. These values are standardized as , 1588 (k = 3 : WBC proteome and k = 4 : plasma proteome). Figure 1 schematically illustrates the analysis method for the HBV vaccination datasets.
A linear kernel was employed as follows: Then, a tensor was added: where 4 stands for four omics data, 5 stands for five time points and 15 stands for fifteen individuals. Applying The singular-value vectors of interest were as follows: • u ℓ 1 j 1 and u ℓ 3 j ′ 1 should be significantly dependent on time points corresponding to j 1 and j ′ 1 . • u ℓ 2 j 2 and u ℓ 4 j ′ 2 should be independent of individuals corresponding to j 2 and j ′ 2 . • u ℓ 5 k should be common between distinct omics measurements.
As a result, ℓ 1 = ℓ 3 = 2, ℓ 2 = ℓ 4 = 1, ℓ 5 = 1 satisfies the required conditions (Fig. 2). Then, The computed P i k values were corrected using the BH criterion and i k s associated with either P i k < 0.01 (for gene expression and methylation) or with P i k < 0.05 (for the two proteomes) were successfully selected (the full (18) lists of selected features are available in Additional file 2, 3, 4, 5: Data S1-S4).

Kidney cancer multi-omics datasets
Full description of the compilation of the kidney cancer multi-omics datasets is available in the related study [12]. In brief, there were two sets of multi-omics kidney cancer data, each of which was composed of messenger RNA (mRNA) and microRNA (miRNA) expression profiles. Fig. 1 Schematic representation of HBV vaccination data analysis. Analysis starts from the center, moves to the right, comes back to the center, and then moves to the left. The cyan rectangle annotated as "methylation" is x i1j1j2 , the yellow rectangle annotated as "gene" is x i2j1j2 , the green rectangle annotated as "WBC" is x i3j1j2 , and the magenta rectangle annotated as "Plasma" is x i4j1j2 . The four tilted cubes to the right of these four rectangles are x kj1j2j ′ 1 j ′ 2 , whose correspondence with x i k j1j2 is indicated by the same color. The tilted cubes colored by layers to the right of the four tilted cubes represent the bundle of x kj1j2j ′ 1 j ′ 2 . The right-most figure with a blue cube annotated as "G" at the center corresponds to TD shown in Eq. (16). The four colored rectangles to the left of the four colored and annotated rectangles represent the singular-value vectors computed by Eq. (17). Genes are selected from these singular-value vectors using P values computed by Eq. (18). For methylation, transcription factors (TFs) are further selected by Enricher using the selected genes ( Table 3). The selected genes and TFs are then uploaded to Enrichr to validate the biological reliability (the left-most figure with color gradation) The first dataset was obtained from The Cancer Genome Atlas (TCGA) and included 253 kidney tumors and 71 normal kidneys. The second dataset was obtained from GEO (GSE16441), and included 17 patients and 17 healthy controls. The method by which these two datasets were preprocessed is described in [12]. The dataset was formatted as x i k j ∈ R N k ×M , which represents The i k expression levels of j subjects ( k = 1 for mRNA and k = 2 for miRNA).
A linear kernel was employed as For the first dataset (i.e., TCGA data), M = 324 , whereas for the second dataset (i.e., GEO data), M = 37 . Then, a tensor was added Applying HOSVD to x kjj ′ results in Then, singular-value vectors were identified such that u ℓ 1 j and u ℓ 2 j ′ were significantly distinct between healthy controls and patients. As a result, ℓ 1 = ℓ 2 = 2 satisfies the required conditions (Fig. 3). Then, is computed and The computed P i k values were corrected using the BH criterion and i k s associated with adjusted were successfully selected. Table 1 shows the confusion matrix obtained by the proposed KTD-based unsupervised FE. Among the 10 features associated with a i , approximately seven features were correctly selected, whereas false positives were almost zero. Thus, KTD-based unsupervised FE successfully selected features correlated with a j . Table 2 shows the confusion matrix obtained by linear regression-based FE. Essentially, no features correlated with a j were selected. Thus, regression analysis did not select any features correlated with a j .  These results demonstrated that an apparently simple and easy problem became difficult when it is a large p small n problem, whereas KTD-based unsupervised FE was able to handle this problem to some extent. These advantages have also been observed in PCA-and TDbased unsupervised FE [4].

Synthetic dataset
Although the methods that did not attribute P values to features were not of interest, since the capability of attributing P values to features is a great advantage of KTD-based unsupervised FE, as emphasized in the Background, lasso was employed as another method for comparison. Although lasso regression does not attribute P values to features, the model was fitted to feature selection in a large p small n problem to demonstrate the difficulty of feature selection in the synthetic dataset. Table 2 shows the confusion matrix, which is clearly inferior to that shown in Table 1. Although the KTDbased unsupervised FE approach correctly selected at least 7 out of 10 features (70 %), which was correlated with a j with essentially no false positives, lasso selected at most 5 out of 20 features (only 25 %), which was correlated with a j with two false positives (approximately half of the true positives). In addition to lasso, we also tested rf as an alternative method that cannot attribute P values (Table 2). First, we have selected features with nonzero importance; although most of (c.a. 17) features among the 20 features coincident with a j are selected, almost half of features not coincident with a j are also wrongly selected. Thus, rf is clearly inferior to KTD-based unsupervised FE. One might wonder if the top 20 features with a larger absolute importance are selected. Only five out of 20 features are selected. Thus, rf is still inferior to TD-based unsupervised FE even if limited and most important features are selected. This suggested that even if methods that could not attribute P values to features can be considered, they would not outperform the KTD-based unsupervised FE method.
Thus, subsequently, we only focused only methods that attributed P values to features.

HBV vaccine dataset
To validate genes selected by KTD-based unsupervised FE, the selected genes were uploaded to Enrcihr [13]. Initially, 1335 genes associated with 2077 methylation probes selected by KTD-based unsupervised FE were uploaded. Many transcription factors (TFs) were significantly predicted to target these 1335 genes (Table 3). These 21 TFs were then uploaded to Enrichr again; Additional file 1: Table S1 shows the top 10 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in the "KEGG 2019 HUMAN" Enrichr category (the full list is available in Additional file 2: Data S1). It was clear that these data included many biologically reasonable KEGG pathways (for details, see the Discussion section below).
Eight genes associated with 11 probes were identified as differentially expressed genes (DEGs) when gene expression profiles were considered (Table 4), which were uploaded to Enrichr. Conversion of probe names to gene symbols was performed by the ID converter tool of DAVID [14]. Additional file 1: Table S2 shows the top 10 GEO profiles enriched in the Enrichr "Disease Perturbations from GEO up/down" category (the full list is available in Additional file 3: Data S2). Although many neurodegenerative profiles not apparently related to the HBV vaccine are listed, the rationalization for their inclusion is provided in the Discussion section.
Finally, two set of proteins identified as DEGs in WBC and plasma profiles (Table 4) were uploaded to Enrichr. Additional file 1: Tables S3 and S4 show the top 10 enriched GEO profiles identified in the Enrichr "Disease Perturbations from GEO up/down" category when proteins for the WBC and plasma sections in Table 5 were uploaded to Enrichr (the full list is available in Additional file 3, 4: Data S3 and S4). Other than the enriched Table 3 TFs enriched in the "ChEA 2016" Enrichr category (adjusted P values < 0.05 ) when 1335 genes associated with 2077 methylation probes selected by KTD-based unsupervised FE were considered (the full list is available in Additional file 2: Data S1)  reasonable hepatitis-related GEO profiles, some neurodegenerative disease-related GEO profiles were also enriched, as shown in Additional file 1: Table S3, which are further explored in the Discussion section.
Since there are many enriched biological processes and pathways listed in Additional file 1: Tables S1-S4, the genes and proteins selected in this section may not be artifacts, but likely have a true biological basis.

Kidney cancer
hsa-mir-200c and hsa-mir-141 were selected from TCGA, and hsa-miR-141, hsa-miR-210, and hsa-miR-200c were selected from GEO. Thus, these miRNAs are highly coincident with each other, even more so than reported in previous works [5,12] where TD-as well as KTD-based unsupervised FE was applied to TCGA and GEO datasets. For mRNA, there were five common genes selected between the TCGA and GEO datasets ( Table 6, P = 6.7 × 10 −5 , odds ratio: 13.13).

Discussion
There are several advantages in the proposed implementation of KTD-based unsupervised FE compared with previously proposed versions of KTD-based unsupervised FE [5], as well as TD-based unsupervised FE [4] in the context of application to the integration of multiomics datasets. For example, 1 The present implementation can reduce the required computational memory. As a result, required computational time can be reduced as well. 2 The present implementation can integrate more than two omics data in a straight manner.
Although the point was achieved in KTD-based unsupervised FE [5], too, the primary advantage of the proposed method is to achieve the above two simultaneously, as discussed below. As for point 1, when the original implementation of TD-based unsupervised FE [4] is applied to the integration of multi-omics data, (e.g., x ij ∈ R N ×M and x hj ∈ R H ×M , which corresponds to the ith or hth omics data of the jth sample), HOSVD is applied to Since H, N ≫ M , this was not an effective implementation. When KTD-based unsupervised FE [5] is applied, HOSVD is applied to This drastically reduced the required memory and central processing unit (CPU) time. As for point 2, nevertheless, it was unclear how more than two omics datasets could be integrated. In the implementation introduced in this paper, which corresponds to the ith measurement of the kth omics data of the jth sample, where K is the total number of omics datasets. Since HOSVD is applied to x kjj ′ , any number of multi-omics datasets may be handled. This slight modification drastically increased the ability of KTD-based unsupervised FE to handle multi-omics datasets. Thus, it is obvious that the present implementation has at least one advantage over the past KTD implementations. Kernels K k were highly correlated between mRNA ( k = 1 ) and miRNA ( k = 2 ) for the kidney cancer data [15] (Fig. 4). Kernels K k for gene expression profiles and the proteome were also highly correlated when HBV vaccine experiments were considered (Table 7). Thus, it was obvious that the current formalism was very effective in identifying the coincidence between individual omics (in this case, mRNA, miRNA, and the proteome).
Although the list of enriched pathways in Additional file 1: Table S1 did not seem to be related to the HBV vaccine directly, there were indirect reasonable relationships. For example, the Hippo signaling pathway has recently (24) Table 5 Proteins identified as DEGs when gene expression profiles in the proteome were considered   WBC  HIST1H2BJ, HIST2H2BF, HIST1H2BG, HIST1H2BB, HIST1H2BD, ACTG1, HIST1H2BL, HIST1H2BN, PFN1, HIST1H2BK, HIST3H2BB, ACTB, HBB, HBA2,  HIST1H2BA, HIST1H2BI, HIST1H2BC, HIST1H2BO, HIST2H2BE, HIST1H2BM, HBA1, HIST1H2BF, HIST1H2BE, HIST1H2BH   Plasma  FGA, HP, GSN, ALB, FGG, IGLL5, APOA1, SERPINA1, ORM1, TF, GC, CP, C4A, CSF3R, A2M, HPX, HRG, A1BG, CFH, APOB, C3, CLEC14A been identified to be related to the immune system [16]. "Hepatitis B" was also significantly enriched (adjusted P value of 7.79 × 10 −3 ), but it did not rank within the top 10 KEGG pathways in Additional file 1: Table S1. In addition, since patients with diabetes have a higher risk of HBV infection [17], it is reasonable that the two KEGG pathways "AGE-RAGE signaling pathway in diabetic complications" and "Maturity onset of diabetes in the young" were enriched. Although multiple cancer types were also enriched in these data, as shown in Additional file 1: Table S1, many cancer types other than liver cancer are known to be related to the risk of HBV infection [18]. Although genes identified as DEGs in relation to HBV vaccination were also enriched in various neurodegenerative diseases other than hepatitis (Additional file 1: Tables S2 and S3), this is a reasonable finding because viral hepatitis was reported to be related to Parkinson's disease [19]. There are also known associations between hepatic functions and plasma amyloid-β levels [20]; cirrhosis patients with HBV infection have higher plasma A β 40 and A β 42 levels than patients with HBV-negative cirrhosis. More directly, Ji et al. [21] reported that the hepatitis B core VLP-based mis-disordered tau vaccine alleviated cognitive deficits and neuropathology progression in a Tau.P301S mouse model of Alzheimer's disease. Thus, enrichment of neurodegenerative disease-related genes among the identified DEGs does not appear to be an artifact, but rather provides possible supportive evidence that KTD-based unsupervised FE detected side effects caused by vaccinations.
Other conventional univariate tools such as limma [22] and sam [23] cannot be used for these tasks since they are designed to handle categorical classes and thus cannot be applied to HBV vaccination data, which are only associated with time points and are not categorical. Although regression analysis was attempted for the synthetic dataset, there were no features correlated with dates. Thus, there were no univariate feature selection methods applicable to the HBV vaccination data that could identify features correlated with date. For the kidney cancer datasets, it has been extensively demonstrated that these conventional univariate tools such as limma [22] and sam [23] cannot compete with the TD-based unsupervised FE approach [5,12]. Thus, no univariate feature selection method was identified that was superior to TD-based unsupervised FE when applied to kidney cancer datasets.
The performance of the proposed method has not been compared to other existing multi-omics-oriented methods [1,2] because no suitable methods were identified for suitable comparison. First, most of the recently proposed cutting-edge methods adapted to multi-omics analysis are specific to a high-throughput sequencing (HTS) architecture. For example, MKpLMM [24] [25] requires a bed file, which is also not available for the present dataset. Since the purpose of the present study was to propose a more flexible method that is not specific to the HTS architecture and the datasets employed in this study were not obtained using HTS, these were considered unsuitable methods for comparison with the proposed implementation of KTD-based unsupervised FE. Second, other methods that are not specific to HTS lack the statistical validation of feature selection (i.e., no ability to attribute P values to features). For example, although MOFA [26] is not specific to HTS, it does not have the ability to select features; thus, we were not able to compare its performance with that of our proposed method. Although DIABLO [27] is also not specific to the HTS architecture and has feature selection ability, there is no functionality to attribute P values to individual features; thus, features cannot be selected based on statistical significance, and therefore, DIABLO was considered to be outside of the scope of this study. FSMKL [28], which is also not specific to the HTS architecture does have the ability to add statistical scores to individual features; however, selecting features based on statistical scores is not effective. In this sense, to our knowledge, there are no other multi-omics-oriented feature selection methods that satisfy the following requirements: • Not specific to the HTS architecture • Attributes P values to individual features to evaluate statistical significance • Can handle more than or equal to three kinds of omics data simultaneously • Applicable to severe large p small n (typically, p/n ∼ 10 2 or more) problems For example, for the large p small n problem, although Subramanian et al. [2] summarized existing machine learning methods for multi-omics analysis, typically they are applied to studies including up to 10 2 samples; thus, they cannot be regarded as a severe large p small n problem. As a result, the performance of the KTD-based unsupervised FE was not directly compared to other existing methods that satisfy all of the above conditions. HBV vaccination data were selected to demonstrate the superior power of advanced KTD-based unsupervised FE because of the difficulty of the problem with this dataset. Since vaccination must be given to healthy people, side effects must be minimized [29]; in fact, since vaccination is essentially an infection with a weaker pathogen, its effect is inevitably weak. As expected, a very limited number of features (genes and proteins) were selected using the advanced KTD-based unsupervised FE proposed in this article, whereas conventional linear regression analysis did not attribute significant P values to any features. This suggests that the proposed advanced KTD-based unsupervised FE method has superior ability to select features when applied to even particularly difficult multi-omics datasets.
One might wonder why we did not employ more advanced feature selection methods other than lasso or rf. Generally, other methods are not fitted to the present situation, i.e., the large p small n problem. Since it is impossible to demonstrate the difficulty of using all the other methods, we consider two methods, class-specific mutual information variation for feature selection [30] and multilabel feature selection with constrained latent structure shared term [31] in order to demonstrate why other advanced methods are not fitted to the large p small n problem. When k features were aimed to be selected, although the complexity of class-specific mutual information variation for feature selection was supposed to be kMN, it excluded the computational time needed for the computation of mutual information among N features, which is as large as N 2 . It is not fitted to the large p small n problem associated with a large number of features, N. For example, in the synthetic example, we tried to compute mututal information among N = 100 features for only one ensemble; it took 100 s. Since we employed N = 1000 , it would take 100 × (1000/100) 2 = 10 4 s for only one ensemble. We employed 100 ensembles; thus, in total, the required computational time for 100 ensembles would be as long as 10 4 × 100 = one million s, which is unrealistic, since other methods, such as linear regression, lasso and rf, require less than 10 min (= 600 s) for computation with 100 ensembles. This means that class-specific mutual information variation for feature selection is not reasonable to be applied to the present synthetic example. As for multilabel feature selection with constrained latent structure shared term, it is not fitted to the large p small n problem as well, since it can select at most M features. Multilabel feature selection with constrained latent structure shared term decomposes the matrix N × M into a product of two small matricies, N × k and k × M , when k features are selected. Nevertheless, in the large p small n problem, since M ≪ N , k < M < N . Thus, it can select at most M features. On the other hand, multilabel feature selection with constrained latent structure shared term was applied to the case where N < M [31]. In our synthetic example, the number of features to be selected, 2N 1 , is larger than M. Thus, multilabel feature selection with constrained latent structure shared term cannot be used for the present synthetic example. Although