Skip to main content

On the challenges of predicting treatment response in Hodgkin’s Lymphoma using transcriptomic data



Despite the advancements in multiagent chemotherapy in the past years, up to 10% of Hodgkin’s Lymphoma (HL) cases are refractory to treatment and, after remission, patients experience an elevated risk of death from all causes. These complications are dependent on the treatment and therefore an increase in the prognostic accuracy of HL can help improve these outcomes and control treatment-related toxicity. Due to the low incidence of this cancer, there is a lack of works comprehensively assessing the predictability of treatment response, especially by resorting to machine learning (ML) advances and high-throughput technologies.


We present a methodology for predicting treatment response after two courses of Adriamycin, Bleomycin, Vinblastine and Dacarbazine (ABVD) chemotherapy, through the analysis of gene expression profiles using state-of-the-art ML algorithms. We work with expression levels of tumor samples of Classical Hodgkin’s Lymphoma patients, obtained through the NanoString’s nCounter platform. The presented approach combines dimensionality reduction procedures and hyperparameter optimization of various elected classifiers to retrieve reference predictability levels of refractory response to ABVD treatment using the regulatory profile of diagnostic tumor samples. In addition, we propose a data transformation procedure to map the original data space into a more discriminative one using biclustering, where features correspond to discriminative putative regulatory modules.


Through an ensemble of feature selection procedures, we identify a set of 14 genes highly representative of the result of an fuorodeoxyglucose Positron Emission Tomography (FDG-PET) after two courses of ABVD chemotherapy. The proposed methodology further presents an increased performance against reference levels, with the proposed space transformation yielding improvements in the majority of the tested predictive models (e.g. Decision Trees show an improvement of 20pp in both precision and recall).


Taken together, the results reveal improvements for predicting treatment response in HL disease by resorting to sophisticated statistical and ML principles. This work further consolidates the current hypothesis on the structural difficulty of this prognostic task, showing that there is still a considerable gap to be bridged for these technologies to reach the necessary maturity for clinical practice.


Hodgkin’s Lymphoma (HL) is a type of blood cancer that originates in the lymphatic system, more precisely in lymphocytes, with the patient age peak of diagnostics occurring at the 20 s and 30 s. In 2018, HL represented 0.4% of all new tumors (79990 new cases) and 0.3% of all cancer deaths (26167 deaths) worldwide [1]. Survival of Hodgkin’s Lymphoma patients has significantly improved over the past years. Still, after initial remission, patients experience an elevated risk of death from all causes [2], such as cardiotoxicity diseases like myocardial infarction and congestive heart failure [3], and secondary cancers [4], diseases that are often treatment-related [5].

The current prognosis for HL is largely based on the International Prognostic Score (IPS) [6] which predicts for 5-year freedom from progression. Moccia et al. [7] concluded that this scoring does not identify with certainty low or high risk groups, and recommends the use of molecular markers and/or fluorodeoxyglucose Positron Emission Tomography (FDG-PET) scanning for this purpose. Despite the proven relevance of FDG-PET for HL prognostic, this medical exam is: (i) intrusive, with the need to inject a radioactive tracer; (ii) expensive, estimated at 1020 Eur per exam [8]; and (iii) impossible to perform in remote locations and ambulatory settings as it requires large machinery.

The transcriptional activity of tumor cells is a viable proxy candidate to assess regulatory response to treatment, thus being positioned as a possible alternative to the FDG-PET exam. Nevertheless, the role of differential gene expression in HL has not been exhaustively studied as it is a relatively rare cancer (2.86 cases per 100,000 persons annually [9]). Specific approaches, such as hierarchical clustering [10, 11], Cox regression [12] and sparse multinomial logistic regression [13] have been explored in other works, but some of the state-of-the-art machine learning (ML) approaches successfully applied to more common cancers have not yet been comprehensively employed.

In this context, this work aims to assess the predictability limits of an interim FDG-PET performed after two courses of Adriamycin, Bleomycin, Vinblastine and Dacarbazine (ABVD) chemotherapy treatment using gene expression profiling as a surrogate. To this end, a superior methodology combining ML advances with subspace-based data transaformations is proposed. Transcriptomic data of Hodgkin’s Lymphoma patients’ diagnostic tumor samples acquired by Luminari et al. [14] are used with the objective of better understanding the predictability of a patient’s response to a specific chemotherapy regimen. To this end, we resort to gene expression profiles obtained from Formalin Fixed Paraffin Embedded (FFPE) diagnostic tumor samples [15]. Although there are other datasets available with gene expression profiles of HL patients [16,17,18,19],Footnote 1 these were not collected with the objective of studying the result of a FDG-PET after undergoing a chemotherapy regimen, hence not possessing the fundamental information for our analysis.

Our work advances the current status quo on this task described by Luminari et al. [14], which is grounded on a more traditional statistical analysis of the data, resorting to multivariate logistic analysis, filtering by Fold-Change (FC) and False Discovery Rate (FDR) values and multivariate logistic regression. In contrast, we conduct a thorough optimization and assessment of preprocessing and ML techniques to develop a predictor that can, at the moment of diagnosis, classify patients’ future interim PET after two courses of ABVD chemotherapy according to treatment response.

In addition to the end-to-end assessment of state-of-the-art predictors, our work proposes the use of biclustering principles to transform the original high-dimensional feature space into one consisting of features given by discriminative gene expression patterns, and shows that the new space yields relevant statistical properties. The gathered results show that this novel transformation yields statistically significant improvements on the predictive performance.


To tackle the introduced research problem, we propose the methodology presented in Fig. 1. This methodology starts with essential data preprocessing, followed by a feature analysis stage divided into two phases, resulting in a reduced data space conducive to the subsequent predictive analysis stage. Along the predictive analysis stage, we propose a bicluster-based space transformation that converts the gene-centric space to a pattern-centric one. State-of-the-art ML models can then be applied along the original or pattern-centric data space for the targeted prognostic ends. The methodology is implemented in Python (version 3.8.5) and the code is publicly available in GitHub.Footnote 2

Fig. 1
figure 1

Schematic workflow of proposed solution for the treatment response prediction in HL disease


The proposed methodology is validated on the cohort study conducted by Luminari et al. [14], available at the National Center for Biotechnology Information Gene Expression Omnibus (GSE132348Footnote 3). It consists of 106 samples of patients diagnosed with Classical Hodgkin’s Lymphoma. Each individual has associated the normalized expression levels of 765 different genes, obtained using the NanoString’s nCounter platform [15] over the RNA extracted from FFPE diagnostic tumor samples. Gender, age, stage of disease according to the Lugano classification [20], and Lymphocyte-to-Monocyte Ratio (LMR), LMR>2.1, are further provided. Finally, each record contains the result of an interim PET realized after two courses of ABVD chemotherapy (iPET2), which is classified as “positive” or “negative” according to the Deauville 5-point scale [21], with PET defined as positive when its ordinal value is greater or equal than 4. More information on the data collection process can be found in the original work [14]. The data is relatively imbalanced, with 84 (80%) iPET2 negatives, and 21 (20%) iPET2 positives.

Data preprocessing

The mRNA counts were \(\log _2\) transformed to better handle the variability of expression within and across genes. Samples with missing values were removed, resulting in a new distribution of 82 (79.6%) iPET2 negatives and 21 (20.4%) iPET2 positives. The variable stage was encoded as an ordinal variable, ranging from 1 (corresponding to“I A”) to 8 (corresponding to “IV B”), indicating that both a larger stage number and the B variant are worse prognostic factors. Since the dataset is considerably imbalanced, with a target’s distribution of around 80/20, balancing methods were assessed through a comparative analysis of their effects on predictive performance, with the combination of both oversampling and subsampling techniques being compared. Oversampling using Support Vector Machine Synthetic Minority Oversampling Technique (SVM-SMOTE)Footnote 4 [23] yield the best results, hence being selected for the target experiments.

Feature analysis

Given the high-dimensionality of the cohort data (770 transcriptomic features, 103 patients), dimensionality reduction was undertaken to aid the target predictive task. Following the practice suggested by Saeys et al. [24], and supported by previous empirical evidence [25,26,27], we pre-reduced the feature space using univariate filter methods Wilcoxon Rank Sum TestFootnote 5 [29] and Mutual InformationFootnote 6 [31], and subsequently applied a more complex embedded method Support Vector Machine - Recursive Feature Elimination (SVM-RFE)\(^{6}\) [32].

Initial Feature Selection. As most preprocessed features do not follow a normal distribution according to Shapiro-Wilk test [33], we resort to the non-parametric Wilcoxon Rank Sum Test [29] and Mutual Information (MI) [31] criteria. For each independent variable \(y_j\) and the target response outcome \(z\), the former Wilcoxon statistic tests whether the distributions \(y_j|z=0\) and \(y_j|z=1\) are equal. The latter MI statistic does not test a hypothesis, but a p-value is subsequently generated using a one-sided permutation test. As these approaches are univariate, not taking into account interactions between variables, a less strict than usual significance threshold of \(0.1\) is considered to prevent the removal of potentially relevant genes.

Secondary Feature Selection. Grounded on empirical evidence [34], the embedded Support Vector Machine - Recursive Feature Elimination (SVM-RFE) [32], an adaption of RFE selection method that replaces an external ranking function with the magnitude of the weights of a Support Vector Machine, is subsequently employed.

Predictive analysis

The literature on classification tasks in the oncotranscriptomics domain shows a relative predominance of specific machine learning (ML) models [35, 36]. Accordingly, Support Vector Machines (SVM) [37], k-Nearest Neighbors (KNN) [38] and Random Forest [39] are selected as classifier candidates. Given the consistently state-of-the-art performance of XGBoost [40] in other domains, we further disclose its predictability performance. Complementarily, Decision Trees [41] are further assessed as the learnt associations can reveal important insights about the genetic component of HL, and Naive Bayes [42] to offer a baseline stance on predictive accuracy.

In order to obtain the best classification possible, all the predictors are subjected to parameter optimization through the Tree Parzen Estimator algorithm [43], in addition to an optimized application of the aforementioned preprocessing techniques.

Bicluster-based space transformation

The various biological mechanisms present in our bodies rarely originate due to a single gene’s expression values. Instead, the majority of gene regulation is done in a modular way, in which sets of genes interact with each other to enforce a certain mechanism. With this in mind, it is important to not reduce the analysis of transcriptomic data to individual genes but instead to study these values and its contributions in the context of putative regulatory modules. Furthermore, and in spite of the relevance of the introduced feature selection process, such process relies on univariate methods that do not take into consideration gene interactions and on an embedded method that show biases towards specific predictors [34].

In this context, we propose a novel transformation procedure in high-dimensional data spaces that can map individual gene expression features with loose discriminative power to discriminative pattern-centric features given by discriminative regulatory modules able to better represent complex interactions between multiple genes. Biclustering has been largely employed for the effective discovery of gene expression patterns [44,45,46], with pattern-based biclustering showing relevant performance indicators in diverse biological data contexts [47, 48]. Biclustering based on PAttern Mining Software (BicPAMS) [49] integrates dispersed state-of-the-art contributions on pattern-based biclustering, allowing for a high level of parametrization while performing efficient searches with guarantees of optimality, statistical significance and discriminative power [50, 51].

In this context, BicPAMS is applied to find discriminative patterns on the training cohort data. Multiple parameterizations are tested in order to obtain the best space transformation for the classification task. The dataset is then transformed so that each feature corresponds to one gene expression pattern encompassing multiple genes. Each value of the transformed dataset will then represent the similarity between the gene expression expectations associated with a given pattern and the actual levels of gene expression observed for an individual. Figure 2 shows an example of the described process.

Fig. 2
figure 2

Example of space transformation using biclustering, showing a found bicluster with the corresponding pattern (left), and the resultant values on the transformed dataset using Euclidean distance (right)


The proposed methodology is applied on the cohort study conducted by Luminari et. al [14] to assess the limits to the predictability of the quality of HL patient response to ABVD treatment. The gathered results are presented in two major steps: (i) a comparison of predictive levels from the optimized application of state-of-the-art ML models against the current reference levels; and (ii) an assessment of the improvement yield by the proposed pattern-centric data space transformation.

Evaluation methodology. Given the small population size (103 samples), a nested Cross-Validation (CV) schema is considered using 10 folds. Nested Cross-Validation is used to separate the data used for hyperparameterization from the data used for assessing the model’s performance [52]. Without this separation, the model would learn its parameters in data in which it would be tested, resulting in an overestimated predictive capability. In a nested CV, inside each CV loop used for model evaluation, another CV must be performed in the training data for parameter tuning and preprocessing.

To promote comparison with the evaluation setting presented in the precursor work by Luminari et al. [14], two evaluation settings are considered. The first setting (setting I) purposefully corresponds to an approach similar to the one followed by Luminari et al. [14], where the preprocessed data using the two configurations of feature selection (Methods section) are subjected to internal bootstrapping for evaluation purposes. The second setting (setting II) guarantees the soundness of the acquired predictability levels by ensuring that feature selection step is performed inside each fold of the nested cross-validation, as previously described. Direct comparison between the two settings should not be attempted given the diversity of the underlying evaluation principles.

A core contribution of this study is the possibility to create a reliable decision support system that can help decide the intensity of the treatment a patient must undergo, with a positive prediction indicating that a more aggressive regime is necessary. Attending to this observation, the following evaluation metrics were chosen: (i) AUC as an overall indicator of how the predictor performs when the decision threshold is not optimized; (ii) recall and precision to ensure that the predictor does not skew towards the majority class, especially important due to the imbalanced data nature; and (iii) specificity to guarantee the identification of patients who will react well to the standard treatment, avoiding the prescription of an unnecessarily stronger chemotherapy regimen. Accuracy, although presenting an useful overall indication of the predictor’s performance, is not considered here due to the imbalanced nature of the data. The metric optimized in the hyperparameterization step is the F1 score, so that the classifiers can attain a balanced performance in both precision and recall. Precision-Recall curves, together with explainability models, are further provided to offer a more comprehensive comparison of the predictors.

Predictive performance under state-of-the-art ML

The initial phase of feature selection by the Mutual Information and Wilcoxon Rank Sum algorithms identified a total of 250 features out of the initial 770. These features correspond to 248 genes and the clinical variables stage and \(\textit{LMR}{>} \textit{2.1}\). The work of Luminari et al. [14] found, during a first analysis, a 13-gene signature positively correlated with iPET2 in addition to the variable \(\textit{LMR}{>} \textit{2.1}\). In comparison, we find that our most influential feature set contains 9 out of the 13 genes and the \(\textit{LMR}{>} \textit{2.1}\) variable. The selection of the variable stage by our algorithm points to a relation with the target iPET2 not identified through the multivariate logistic analysis performed by Luminari et al. [14]. The second and final phase of feature selection, performed by the algorithm SVM-RFE, identified a set of 14 genes. Out of these 14 genes, only 2 were also found in the 13-gene signature identified by Luminari et al. [14], indicating a discrepancy between the two gene sets and confirming the difficulty of the task of identifying a concise set of discriminative genes.

In order to provide a reference random baseline for the interpretation of classification results, we performed twenty iterations of training and validation of a random classifier in our dataset. It is the convergence value of these metrics that should be interpreted as the reference minimum value for this classification task, namely, a precision of 0.21, and a recall, specificity and AUC of approximately 0.5.

Results for setting I are presented in Fig. 3, where each color encodes a given performance metric and the horizontal lines correspond to the random classifier’s results in said metric. We can observe that the combination of our two feature selections steps, data balancing using SVM-SMOTE and the classifier SVM, can achieve a superior mean result of 0.97 AUC (against an AUC of 0.84 obtained by Luminari et al. [14]). The high predictive power of an SVM in this setting is to be expected due to its recurrent good performance in this type of data [53, 54], and the fact that is being paired with the SVM-SMOTE data balancing technique and the SVM-RFE feature selection algorithm, both using an SVM as the base of its decisions.

The results obtained under setting II are presented in Fig. 4. In this case, the XGBoost algorithm achieves significantly better results than the SVM, the previous best performer, with an AUC of 0.77, a precision of 0.67, a recall of 0.52 and a specificity of 0.94.

Fig. 3
figure 3

Optimized classifier’s performance in the prediction of iPET2 (setting I)

Fig. 4
figure 4

Optimized classifier’s performance in the prediction of iPET2 (setting II)

In contrast, KNN has notably poor performance in comparison with the defined baseline, which can be explained by two major factors: (i) a high percentage of outliers in our dataset; and (ii) high gene expression variability, leading to inflated differences between individuals that belong to the same class. Two out of the three associative classifiers, Decision Tree (DT) and Random Forest (RF), also present comparable performance to the random classifier, suggesting that a decision tree is unable to fully take advantage of all the discriminative gene interactions present in the dataset. Since the RF and XGBoost classifiers are both ensembles of DTs, the discrepancy between their results must originate due to the embed feature engineering capabilities of XGBoost and differences on the pursued bagging and boosting strategies. Since bagging uses multiple independently trained DTs to make predictions, the fact that a DT cannot assimilate the knowledge in the data will lead to an overall lack in performance for the RF. Boosting on the other hand trains the trees iteratively, allowing for each consequent model to improve where the previous one failed, resulting in a classifier that can better model more complex interactions.

Overall, it is observable that all the classifiers can attain a high specificity, but at the cost of a reduced precision and recall. In other words, classifiers have an easier time correctly classifying patients with negative iPET2, guaranteeing that the patients that will react well to the ABVD regimen are correctly identified. The main difficulty with this predictive problem is in the correct classification of positive patients, possibly due to the low number of samples of this class. In addition, treatment response is being assessed with regards to iPET2 results, which may not be an optimal representative of the true quality of patient response to ABVD chemotherapy.

The predictive results can be further analyzed through the study of the classifiers’ Precision-Recall curves presented in Fig. 5. KNN is omitted from these graphics due to its dependence on an ineffective method to calculate the appropriate thresholds.

Fig. 5
figure 5

Precision-Recall curves for configuration one (left) and configuration two (right) for each machine learning model

It is in our interest to better understand what leads a certain patient to be misclassified by our models. We plot some of the characteristics of the correctly classified individuals against wrongly classified ones by the best predictor in the second configuration, XGBoost. This analysis is provided for the four clinical variables, age, gender, stage and \(\textit{LMR}{>} \textit{2.1}\) in Fig. 6. Starting with the variable gender, no significant trend is noted. \(\textit{LMR}{>} \textit{2.1}\) on the other hand shows a more clear inclination for correctly classifying positive cases when this variable is “False”, with the percentage of True Positives (TP) being substantially higher than the False Positives (FP) percentage. Regarding the stage variable, the values “I A” and “III B” are omitted due to the low number of samples corresponding to each one. In the observation of the remaining values only “III A” shows a significant deviation from the others, with all the positive cases correctly predicted but with a low performance in the false cases. The final plot is dedicated to the variable age and is presented in a stacked view, where the bins encompassing a 10-year period from each class (TN, FN, TP or FP) are stacked to facilitate a comparative analysis between them. We can then recognize that the majority of False Negatives (FN) occur in patients between 30 and 40 years, and the False Positives (FP) are more evenly distributed with a slightly higher concentration in patients between 20 to 30 years.

Fig. 6
figure 6

Distribution of predictions for the clinical variables (gender, LMR, stage of the disease and age)

As some of our results indicate, the majority of useful information about how the patient will react to treatment is contained on gene expression features, and therefore it is imperative that the analysis of the factors inducing wrong classifications be extended towards these features. In Fig. 7 we can see the distributions of the nine more discriminative genes according to their Mutual Information [31] with the response outcome. The red vertical lines highlight the expression levels associated with the six patients that were misclassified by all of the assessed predictive models.

Fig. 7
figure 7

Distribution of top discriminative genes according to Mutual Information with highlighted values (red vertical lines) corresponding to false positive and negative predictions

We can then observe that, as expected, the majority of the highlighted values are found in the intersection of both distributions, where the classification is harder to perform.

Bicluster-based Space transformation

The results presented until here correspond to the classification task in a feature space reduced by a composition of feature selection procedures. This approach lacks the ability to effectively represent the complex gene interactions responsible for the outcome of the patient. In accordance with the introduced methodology, we capture these interactions through discriminative biclusters using BicPAMS [49]. The found biclusters are then used to create new features representing discriminative and statistically significant gene expression patterns.

With the goal of better understanding the impact that BicPAMS’ parameters have on predictive performance, we performed a comparative analysis of how distinct parameterizations affect the behavior of a baseline Naive Bayes predictor. The evaluated parameters are: (i) number of iterations, indicating how many times the mining process is repeated, masking the found biclusters in each new iteration and forcing the mining process to find other less trivial biclusters but resulting in greater computational cost; (ii) minimum lift, a placed threshold to determine whether a given bicluster is sufficiently discriminative [51]; (iii) number of labels, corresponding to the number of overlapping gene expression levels [48]; and (iv) maximum number of biclusters, the number of mapped features in the transformed data space by the postprocessing filtering of the bottom discriminative biclusters according to their lift. The results for each of these parameters are shown in Fig. 8. We can see in the plotted results that in respect to most of the parameters there is a gain in performance by increasing its respective values, but only until a certain threshold is reached, from which there are no further advantages. The minimum lift is an exception to this, showing almost no effect in this specific problem and pointing to the importance of less discriminative patterns in the learning process.

Fig. 8
figure 8

Variation of performance according to multiple parameters

Finally, we further assessed the impact of different dissimilarity functions to assess the how likely is a given gene expression pattern for a specific patient, determining how the values present in the transformed dataset are computed. To this end, we consider both the Euclidean distance and a binary value indicating if a patient possesses a given pattern or not. In the second case, a tolerance threshold can be included to accommodate for noise. Figure 9 shows the performance of a Naive Bayes classifier in three different settings: the binary transformation with thresholds of 0.5 and 1, and the transformation using Euclidean distance. As expected, by using a numeric representation instead of a binary one, there is less loss of information, and consequently, we can achieve better results in the most difficult metrics for this classification task, precision and recall.

Fig. 9
figure 9

Variation of performance according to the distance criterion

To assess the effects of the pattern-centric feature space mapping, we applied the target transformation using BicPAMS algorithm with the following parameters (placed according to previously gathered empirical evidence): number of iterations = 9; minimum lift = 1.3; number of labels = 10; maximum biclusters = 250; and distance criterion = Euclidean distance. Under this configuration, it is possible to carry out a comprehensive exploration of the discriminative biclusters by performing multiple iterations while maintaining a relatively low minimum lift so that the computational requirements do not get too high. The high number of labels guarantees that the found patterns discriminate fine levels of expression while the relatively high number of biclusters ensures that most discriminative and statistically significant biclusters are retrieved. Figure 10 provides a direct comparison of each individual metric between these results and the ones previously obtained by our second setting (Fig. 4).

Fig. 10
figure 10

Direct comparison of results with (blue) and without (red) bicluster-based space transformation

The gathered results show that the transformed feature space has statistically significant impact on the behavior of the classifiers. SVM and XGBoost, classifiers that already presented good results, are not as significantly affected by this transformation, but all the others benefit from it and present an increase in performance in all the studied metrics.


The ability to better discriminate how a patient will respond to a treatment is essential, especially in the domain of cancer therapy where the majority of treatments are associated with high toxicity and the prognostic exams can be intrusive and expensive. We comprehensively assess the predictability guarantees achieved by state-of-the-art ML models. These models are carefully optimized through the Tree Parzen Estimator algorithm and evaluated in a controlled manner resorting to nested cross-validation. Despite the placed optimization principles, the obtained results still fall short on the predictability power necessary to translate decisions in real-world practice. Transcriptional and iPET2 activity are structurally different, with the former being better positioned to model regulatory responses to treatment, even at the cost of iPET2 discordance.

The high specificity attained by most models (0.94) indicates that the classification models can correctly identify most of the patients that show disease regression after the treatment, but are more susceptible to recognize the positive patients. One possible reason for this bias is the low percentage of positive cases, representing only 20% of the total patients. To correct for this imbalance, we resorted to the use of balancing with SVM-SMOTE and hyperparameterization of the predictive models according to F1 score, an evaluation metric sensitive to the positive samples. Despite these efforts, the best classifier achieved a precision of 0.67 and a recall of 0.52, confirming the impact the difficulty of finding a transcriptional exam resembling the nature of iPET2 activity.

The nature of the target cohort data further introduces generalization challenges to the target prognostication, with low number of samples and high-dimensionality. In addition, the transcriptome profiling is susceptible to the infiltration of non-cancer cells and arbitrarily-high variations to the composition of the target cell population, further contributing to generalization difficulties. Finally, the biological mechanisms underlying diseases such as HL are immensely complex and dependent of interactions between many genes at multiple omics levels.

In order to better represent the complex interactions between genes that originate biological processes, the proposed space transformation offers an elegant way of shifting the learning from individual genes towards patterns of gene expression. By doing this, we group statistically significant and discriminative sets of genes that partake in regulatory modules correlated with a specific outcome of interest. This creates more straightforward conditions to guide the learning of predictive models. We observed that the efficacy of this transformation is proven by the increase in performance of the majority of classifiers in all the studied metrics.

Despite the yield improvements, the gathered results are indicative of the innate difficulty of the target predictive task, claiming for further contributions in this domain able to translate high-dimensional regulatory profiles into actionable and reliable results.

In order to deal with the challenges presented in our work, we suggest the following directions of research: (i) strengthen this methodology by completing the current regulatory stances with complementary omic layers; (ii) further combine the transcription of non-coding RNAs, recently shown to play an important role in HL [55]; (iii) assess the potential increase in performance by using more specialized classification principles best suited to deal with the inherent overlapping class-conditional distributions of expression per gene (Fig. 7); (iv) place a finer description on the quality of treatment response, translating the classification task into an (ordinal) regression task; and (v) further assess the impact of alternative pattern-based feature space transformations on predictive accuracy, namely by resorting to ensembles of biclusters with different characteristics.


This work introduced a novel methodology to improve the predictive accuracy of HL treatment response after two courses of ABVD chemotherapy against reference predictive levels [14]. This is achieved through a biclustering-based data space transformation that creates a shift from gene-centric to pattern-centric organization of expression data, combined with a thorough optimization of preprocessing procedures and state-of-the-art ML models. The methodology presents increased performance with the proposed space transformation, yielding improvements as large as 20pp in precision and recall, as is the case with Decision Trees. Furthermore, through the application of an ensemble of feature selection procedures, we identify a set of 14 genes highly representative of the result of an FDG-PET after two courses of ABVD chemotherapy. This set only overlaps with the one identified by Luminari et al. [14] in 2 genes, indicating that it can offer a complementary understanding of the HL response to chemotherapy.

Availability of data and materials

The datasets generated and/or analysed during the current study are available in the National Center for Biotechnology Information Gene Expression Omnibus repository,


  1. (accessed 3/2022) (accessed 3/2022) (accessed 3/2022) (accessed 3/2022).


  3. (accessed June 28, 2021).

  4. implemented using imbalanced-learn [22]

  5. implemented using SciPy [28]

  6. implemented using scikit-learn [30]



Hodgkin’s Lymphoma


Adriamycin, Bleomycin, Vinblastine and Dacarbazine


Machine learning


Fuorodeoxyglucose Positron Emission Tomography


International Prognostic Score


Formalin fixed paraffin embedded


False discovery rate




Lymphocyte-to-monocyte ratio


Support vector machine synthetic minority oversampling technique


Support vector machine–recursive feature elimination


Mutual information


Support vector machine


K-nearest neighbors


Biclustering based on PAttern Mining Software




Decision tree


Random forest


True positives


False positives


False negatives


False positives


  1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424.

    Article  PubMed  Google Scholar 

  2. Aleman Berthe MP, van den Belt-Dusebout Alexandra W, Klokman Willem J, Mars B, van’t Veer, Bartelink Harry, van Leeuwen Flora E. Long-term cause-specific mortality of patients treated for Hodgkin’s Lymphoma. Optimizing treatment of patients with Hodgkin’s Lymphoma. 2007;21:95.

  3. Aleman Berthe MP, van den Belt-Dusebout Alexandra W, De Bruin Marie L, van’t Veer Mars B, Baaijens Margreet HA, de Boer Jan Paul, Hart Augustinus AM, Klokman Willem J, Kuenen Marianne A, Ouwens Gabey M et al. Late cardiotoxicity after treatment for hodgkin lymphoma. Blood. 2007;109(5):1878–86.

  4. Dores GM, Metayer C, Curtis RE, Lynch CF, Aileen Clarke E, Glimelius B, Storm H, Pukkala E, Van Leeuwen FE, Holowaty EJ, et al. Second malignant neoplasms among long-term survivors of hodgkin’s disease: a population-based evaluation over 25 years. J Clin Oncol. 2002;20(16):3484–94.

    Article  PubMed  Google Scholar 

  5. Hoppe RT. Hodgkin’s disease: complications of therapy and excess mortality. Annals Oncol. 1997;8:S115–8.

    Article  Google Scholar 

  6. Hasenclever D, Diehl V, Armitage JO, Assouline D, Björkholm M, Brusamolino E, Canellos GP, Carde P, Crowther D, Cunningham D, et al. A prognostic score for advanced hodgkin’s disease. New Engl J Med. 1998;339(21):1506–14.

    Article  CAS  PubMed  Google Scholar 

  7. Moccia Moccia A, Donaldson Jane, Chhanabhai Mukesh, Hoskins Paul, Klasa Richard, Savage Kerry J, Shenkier Tamara, Skinnider Brian, Gascoyne Randy D, Connors Joseph M, et al. The international prognostic factor project score (ips) in advanced stage hodgkin lymphoma has limited utility in patients treated in the modern era., 2009.

  8. Verboom P, van Tinteren H, Hoekstra OS, Smit EF, Van Den Bergh JHAM, Schreurs AJM, Stallaert RALM, van Velthoven PCM, Comans EFI, Diepenhorst FW, et al. Cost-effectiveness of FDG-pet in staging non-small cell lung cancer: the plus study. Eur J Nucl Med Mol Imaging. 2003;30(11):1444–9.

    Article  PubMed  Google Scholar 

  9. Howlader N, Noone AM, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR et al. Seer cancer statistics review, 1975–2017. National Cancer Institute, 2020.

  10. Devilard E, Bertucci F, Trempat P, Bouabdallah R, Loriod B, Giaconia A, Brousset P, Granjeaud S, Nguyen C, Birnbaum D, et al. Gene expression profiling defines molecular subtypes of classical hodgkin’s disease. Oncogene. 2002;21(19):3095–102.

    Article  CAS  PubMed  Google Scholar 

  11. Küppers R, Klein U, Schwering I, Distler V, Bräuninger A, Cattoretti G, Yuhai T, Stolovitzky GA, Califano A, Hansmann M-L, et al. Identification of hodgkin and reed-sternberg cell-specific genes by gene expression profiling. J Clin Investig. 2003;111(4):529–37.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Scott DW, Chan FC, Hong F, Rogic S, Tan KL, Meissner B, Ben-Neriah S, Boyle M, Kridel R, Telenius A, et al. Gene expression-based model using formalin-fixed paraffin-embedded biopsies predicts overall survival in advanced-stage classical hodgkin lymphoma. J Clin Oncol. 2013;31(6):692.

    Article  PubMed  Google Scholar 

  13. Steidl C, Lee T, Shah SP, Farinha P, Han G, Nayar T, Delaney A, Jones SJ, Iqbal J, Weisenburger DD, et al. Tumor-associated macrophages and survival in classic Hodgkin’s Lymphoma. New Engl J Med. 2010;362(10):875–85.

    Article  CAS  PubMed  Google Scholar 

  14. Luminari S, Donati B, Casali M, Valli R, Santi R, Puccini B, Kovalchuk S, Ruffini A, Fama A, Berti V, et al. A gene expression-based model to predict metabolic response after two courses of abvd in hodgkin lymphoma patients. Clin Cancer Res. 2020;26(2):373–83.

    Article  CAS  PubMed  Google Scholar 

  15. Geiss GK, Bumgarner RE, Birditt B, Dahl T, Dowidar N, Dunaway DL, Perry Fell H, Ferree S, George RD, Grogan T, et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat Biotechnol. 2008;26(3):317–25.

    Article  CAS  PubMed  Google Scholar 

  16. Schnitter A, Kohler CW, Reddemann K, Reinke S, Thorns C, Fend F, Federmann B, Möller P, Szczepanowski M, Spang R, et al. Therapeutic targets and microenvironment in sequential biopsies of classical hodgkin lymphoma at diagnosis and relapse. J Hematopathol. 2019;12(1):11–7.

    Article  Google Scholar 

  17. Ravi D, Beheshti A, Abermil N, Passero F, Sharma J, Coyle M, Kritharis A, Kandela I, Hlatky L, Sitkovsky MV, et al. Proteasomal inhibition by ixazomib induces chk1 and MYC-dependent cell death in t-cell and hodgkin lymphoma. Cancer Res. 2016;76(11):3319–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Gamboa-Cedeño AM, Castillo M, Xiao W, Waldmann TA, Ranuncolo SM. Alternative and canonical NF-KB pathways DNA-binding hierarchies networks define hodgkin lymphoma and non-hodgkin diffuse large b cell lymphoma respectively. J Cancer Res Clin Oncol. 2019;145(6):1437–48.

    Article  PubMed  PubMed Central  Google Scholar 

  19. de Oliveira KAP, Kaergel E, Heinig M, Fontaine J-F, Patone G, Muro EM, Mathas S, Hummel M, Andrade-Navarro MA, Hübner N, et al. A roadmap of constitutive NF-\(\kappa\)b activity in hodgkin lymphoma: dominant roles of p50 and p52 revealed by genome-wide analyses. Genome Med. 2016;8(1):1–16.

    Article  Google Scholar 

  20. Cheson BD, Fisher RI, Barrington SF, Cavalli F, Schwartz LH, Zucca E, Lister TA. Recommendations for initial evaluation, staging, and response assessment of hodgkin and non-hodgkin lymphoma: the lugano classification. J Clin Oncol. 2014;32(27):3059.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Meignan M, Gallamini A, Meignan M, Gallamini A, Haioun C. Report on the first international workshop on interim-pet scan in lymphoma. Leukemia Lymphoma. 2009;50(8):1257–60.

    Article  PubMed  Google Scholar 

  22. Lemaître Guillaume, Nogueira Fernando, Aridas Christos K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1–5

  23. Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradigms. 2011;3(1):4–21.

    Article  Google Scholar 

  24. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.

    Article  CAS  PubMed  Google Scholar 

  25. Glaab E, Bacardit J, Garibaldi JM, Krasnogor N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PloS One. 2012;7(7):e39932.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Pandey G, Pandey OP, Rogers AJ, Ahsen ME, Hoffman GE, Raby BA, Weiss ST, Schadt EE, Bunyavanich S. A nasal brush-based classifier of asthma identified by machine learning analysis of nasal RNA sequence data. Sci Rep. 2018;8(1):1–15.

    Article  Google Scholar 

  27. Wang L, Xi Y, Sung S, Qiao H. Rna-seq assistant: machine learning based methods to identify more transcriptional regulated genes. BMC Genomics. 2018;19(1):546.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Virtanen Pauli, Gommers Ralf, Oliphant Travis E, Haberland Matt, Reddy Tyler, Cournapeau David, Burovski Evgeni, Peterson Pearu, Weckesser Warren, Bright Jonathan, van der Walt Stéfan J., Brett Matthew, Wilson Joshua, Millman K Jarrod, Mayorov Nikolay, Nelson Andrew RJ, Jones Eric, Kern Robert, Larson Eric, Carey CJ, Polat İlhan, Feng Yu, Moore Eric W, VanderPlas Jake, Laxalde Denis, Perktold Josef, Cimrman Robert, Henriksen Ian, Quintero EA, Harris Charles R, Archibald Anne M, Ribeiro Antônio H, Pedregosa Fabian, van Mulbregt Paul. SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020;17:261–272,

  29. Mann Henry B, Whitney, Donald R. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.

  30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

    Google Scholar 

  31. Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun Rev. 2001;5(1):3–55.

    Article  Google Scholar 

  32. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.

    Article  Google Scholar 

  33. Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3/4):591–611.

    Article  Google Scholar 

  34. Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008;9(1):1–13.

    Google Scholar 

  35. Podolsky MD, Barchuk AA, Kuznetcov VI, Gusarova NF, Gaidukov VS, Tarakanov SA. Evaluation of machine learning algorithm utilization for lung cancer classification based on gene expression levels. Asian Pacific J Cancer Prev. 2016;17(2):835–8.

    Article  Google Scholar 

  36. Tabares-Soto R, Orozco-Arias S, Romero-Cano V, Bucheli VS, Rodríguez-Sotelo JL, Jiménez-Varón CF. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput Sci. 2020;6:e270.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Boser Bernhard E, Guyon Isabelle M, Vapnik Vladimir N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.

  38. Fix Evelyn. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine, 1951.

  39. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  40. Chen Tianqi, Guestrin Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 785–794, 2016.

  41. Ross Quinlan J. Induction of decision trees. Mach Learn. 1986;1(1):81–106.

    Article  Google Scholar 

  42. Lewis, David D. Naive (bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning, pages 4–15. Springer, 1998.

  43. Bergstra James, Bardenet Rémi, Bengio Yoshua, Kégl Balázs. Algorithms for hyper-parameter optimization. In 25th annual conference on neural information processing systems (NIPS 2011), volume 24. Neural Information Processing Systems Foundation, 2011.

  44. Yizong C, Church George M. Biclustering of expression data. ISMB. 2000;8:93–103.

    Google Scholar 

  45. Bentham RB, Bryson K, Szabadkai G. Mcbiclust: a novel algorithm to discover large-scale functionally related gene sets from massive transcriptomics data collections. Nucleic Acids Res. 2017;45(15):8712–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Williams A, Halappanavar S. Application of biclustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials. Beilstein J Nanotechnol. 2015;6(1):2438–48.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Henriques R, Antunes C, Madeira SC. A structured view on pattern mining-based biclustering. Pattern Recogn. 2015;48(12):3941–58.

    Article  Google Scholar 

  48. Henriques R, Madeira SC. Bicpam: Pattern-based biclustering for biomedical data analysis. Algor Mol Biol. 2014;9(1):1–30.

    Google Scholar 

  49. Henriques R, Ferreira FL, Madeira SC. Bicpams: software for biological data analysis with pattern-based biclustering. BMC Bioinform. 2017;18(1):1–16.

    Google Scholar 

  50. Henriques R, Madeira SC. BSIG: evaluating the statistical significance of biclustering solutions. Data Mining Knowl Discov. 2018;32(1):124–61.

    Article  Google Scholar 

  51. Leonardo A, Costa Rafael S, Lara SL, Rui H. Mining pre-surgical patterns able to discriminate post-surgical outcomes in the oncological domain. IEEE J Biomed Health Inform, 2021.

  52. Patrício A, Costa RS, Henriques R, et al. Predictability of covid-19 hospitalizations, intensive care unit admissions, and respiratory assistance in portugal: Longitudinal cohort study. J Med Internet Res. 2021;23(4): e26075.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci. 2000;97(1):262–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Lee Y, Lee C-K. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 2003;19(9):1132–9.

    Article  CAS  PubMed  Google Scholar 

  55. Cordeiro A, Monzó M, Navarro A. Non-coding RNAs in Hodgkin lymphoma. Int J Mol Sci. 2017;18(6):1154.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.

About this supplement

This article has been published as part of BMC Medical Genomics Volume 16 Supplement 1, 2023. Selected articles from the 20th Asia Pacific Bioinformatics Conference (APBC 2022): medical genomics. The full contents of the supplement are available online at


This work was supported by Fundação para a Ciência e a Tecnologia (FCT), through IDMEC under LAETA project (UIDB/50022/2020), IPOscore with reference (DSAIPA/DS/0042/2018), LAIfeBlood (DSAIPA/AI/0033/2019) and MONET (PTDC/CCI-BIO/4180/2020). This work was further supported by INESC-ID plurianual (UIDB/50021/2020) and the Associate Laboratory for Green Chemistry (LAQV), financed by national funds from FCT/MCTES (UIDB/50006/2020 and UIDP/50006/2020). RSC acknowledges the contract CEECIND/01399/2017. The funding bodies had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



All authors contributed to the design of the methodology. AP implemented the methodology, conducted the experimental analysis, and produced the first draft of the manuscript. RH and RSC validated the work and revised the manuscript. All authors declare that they read and approved the final manuscript.

Corresponding author

Correspondence to André Patrício.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patrício, A., Costa, R.S. & Henriques, R. On the challenges of predicting treatment response in Hodgkin’s Lymphoma using transcriptomic data. BMC Med Genomics 16 (Suppl 1), 170 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Hodgkin’s lymphoma
  • Cancer
  • Machine learning
  • Gene expression
  • Data modeling
  • Discriminative patterns
  • Biclustering
  • Computational biology