Skip to main content

Table 2 Feature selection methods

From: A systematic analysis of genomics-based modeling approaches for prediction of drug response to cytotoxic chemotherapies

Selection Method

Description

No feature selection (NO FS)

All probes used with a total of 49,386 probes.

Differentially Expressed genes (DEGs)

Array probes that have a statistically significant Spearman correlation P < 0.05 with drug response

LIMMA

Linear Empirical Bayes with a modified t-statistic as implemented in the LIMMA Bioconductor package in R. Genes were selected by running LIMMA on the top and bottom 25% sensitive and resistance cell lines. A false discovery rate of 5% was chosen as a cutoff.

Bonferroni Correction (BC)

Bonferroni Correction \( {\rho}_{BC}=\frac{\alpha }{m} \) where α is significance level of 0.05 and m is the number of features tested, 49,386. ρBC = 1.0 x 10−6

DEG Bootstrap (BS)

Array probes which have a statistically significant Spearman correlation P < 0.05 in fifty random subsets containing 75% of the training data

Histotype specific Bootstrap (BS-Hist)

50 subsets of the training data were generated such that each subset contained only one cell from a specific histotype. Probes that have a significant Spearman correlation P < 0.05 in 50% of the splits were selected. ** Data not shown, reported Additional file 2

Maximum Relevance Minimum Redundancy (MRMR)

Maximum Relevance Minimum Redundancy. 1000 Probes are chosen such that they have a maximum correlation with drug response with minimal cross-correlation with other chosen probes.

Control 1 (CTR1)

Probes are randomly selected from all 49,836 probes equal to the number of DEGs for each model/trial. For example, bleomycin dataset 1 yielded 5377 DEGs in DEG feature selection thus 5377 probes are selected randomly in control 1 experiments.

Control 2 (CTR2)

The compliment of DEGs. For example, for bleomycin dataset 1 control 2 genes would include 38,009 probes excluded form the 5377 probes selected as DEGs.

Random Control (RCTR)

A number, N, of probes equal to the number of DEGs are randomly selected. This gives N vectors with each entry corresponding to a cell line in the training set. This vector is then shuffled randomly such that the original value is no longer associated with the same cell yielding a feature matrix that is arbitrary.

Histotype Only (HIST)

Each cell line is associated with a 55 dimensional vector where the nth entry is 1 if the cell comes from the corresponding histotype and 0 otherwise. (One hot encoded)

  1. A summary and definition of the different feature selection methods discussed in the results section. The abbreviations that will be used in the text to refer to these methods are in prentices