HDNA methylation data-based molecular subtype classification related to the prognosis of patients with hepatocellular carcinoma

Background DNA methylation is a common chemical modification of DNA in the carcinogenesis of hepatocellular carcinoma (HCC). Methods In this bioinformatics analysis, 348 liver cancer samples were collected from the Cancer Genome Atlas (TCGA) database to analyse specific DNA methylation sites that affect the prognosis of HCC patients. Results 10,699 CpG sites (CpGs) that were significantly related to the prognosis of patients were clustered into 7 subgroups, and the samples of each subgroup were significantly different in various clinical pathological data. In addition, by calculating the level of methylation sites in each subgroup, 119 methylation sites (corresponding to 105 genes) were selected as specific methylation sites within the subgroups. Moreover, genes in the corresponding promoter regions in which the above specific methylation sites were located were subjected to signalling pathway enrichment analysis, and it was discovered that these genes were enriched in the biological pathways that were reported to be closely correlated with HCC. Additionally, the transcription factor enrichment analysis revealed that these genes were mainly enriched in the transcription factor KROX. A naive Bayesian classification model was used to construct a prognostic model for HCC, and the training and test data sets were used for independent verification and testing. Conclusion This classification method can well reflect the heterogeneity of HCC samples and help to develop personalized treatment and accurately predict the prognosis of patients.


Background
The liver is the greatest digestive and endocrine organ in the human body, and primary HCC, which is characterized by easy metastasis and strong invasion, is one of the most commonly seen malignant tumours in the clinic, with a low 5-year survival rate [1]. HCC has become the second major malignant tumour that threatens human life [2]. In China, the leading aetiology of HCC is viral hepatitis, among which hepatitis B and hepatitis C are the most common, while alcoholic liver disease (ALD) is dominant in western countries [3]. In recent years, an increasing number of HCC cases with unknown reasons has been diagnosed, showing a younger trend [4]. In addition, with the advancement of HCC research, it has been discovered that epigenetic changes [5], including DNA methylation, histone modification and aberrant miRNA expression, play core roles in HCC genesis, wherein HCC genesis is related to abnormal DNA methylation [6]. A large amount of literature has indicated that DNA methylation represents a series of early and most frequent molecular behaviours during the HCC genesis process [7]. DNA methylation accumulates with disease progression and is currently regarded as an indication of malignant HCC [8]. Therefore, obtaining samples from HCC patients after diagnosis, carrying out specific analyses, and mining the specific biomarkers are the key to prognostic assessment, classification determination, recurrence judgement, and early selection of appropriate therapeutics and treatments.
DNA methylation refers to the process in which methyl groups are transferred onto the 5′-position carbon atom in the cytosine of the DNA CpG sequence to form 5-methylcytosine, with S-adenosylmethionine as the donor under the catalysis of DNA methyltransferases (DNMTs include three types, namely, DNMT1, DNMT2 and DNMT3) [9]. Hypermethylation of CpG islands is related to the silencing of tumour suppressor gene expression, which plays a key role during the early tumourigenesis process [10]. Additionally, the universally low methylation level of the tumour genome is closely associated with the activation of oncogenes, changes in chromatin structure, and the loss of basic groups; moreover, it is also closely correlated with tumour invasion, metastasis and prognosis [11]. Currently, research has verified that genome-wide low DNA methylation is an important mechanism responsible for early HCC formation and is also an important cause of early chromatin structural changes in non-cirrhosis HCC. The low methylation of genes such as AKT3, CD147, LINE1 and SFRP1 has been proven to be significantly correlated with the survival, tumour volume and malignant grade of HCC patients [12][13][14]. Moreover, the silencing of tumour suppressor genes and the loss of DNA repair function in HCC are closely correlated with the high methylation of promoter CpG islands, which prevents the timely repair of DNA damage, thus resulting in abnormal cell proliferation; for instance, the CpG islands of genes such as SOX1, VIM, BMP4 and CDKN2A show high methylation levels [15][16][17]. However, it is still unknown whether specific methylation patterns in these gene promoter regions display clinical significance compared with tumour classification, survival and prognosis in large-sample HCC patient data.
Therefore, this study aimed to establish a classification method that integrates several DNA methylation markers in order to help clinicians effectively assess HCC patient prognosis and select therapeutic strategies.

Methods
Pre-processing of preliminary sample data and initial screening of HCC DNA methylation sites The latest clinical follow-up information was downloaded using the TCGA GDC API on July 31st, 2018 (as shown in Table S1.txt) and contained a total of 377 samples (as shown in Table S11.txt); the RNA-Seq data (as shown in Table S2.txt) included 424 samples. Illumina Infinium HumanMethylation450 data [18] were downloaded using the UCSC Cancer Browser [19] and contained 429 samples.
Subsequently, the samples with a follow-up period of over 30 days were screened from the clinical data for further methylation profile matching, and a total of 348 samples detected for methylation were screened. Moreover, the CpG sites with an NA ratio of over 70% were removed from all samples, and the cross-reactive CpG sites in the genome were also removed according to the cross-reactive sites. In addition, the missing values of methylation profiles were filled using the KNN method with the impute R package, and the unstable genomic methylation sites were further removed in order to remove the CpG sites and single nucleotide sites from the sex chromosomes. A total of 208,022 methylation sites were finally obtained. Second, the 348 samples were divided into a training set (n = 174) and test set (n = 174). The obtained clinical follow-up information is shown in Table S3.txt and the clinical follow-up information is shown in Table S4.txt.

Univariate analysis and multivariate survival analysis of the training set methylation sites
First, a univariate Cox proportional hazards regression model was applied to analyse each methylation site and the survival data using the coxph function of the survival R package [20], with p < 0.05 as the significant threshold. Subsequently, the univariate Cox proportional hazards regression model was also employed to analyse age, T, N, and M stages, grade, sex and survival data. The results suggested that the T, N and M stages showed significant differences in prognosis prediction, with logrank p-values of 0.002446, 0.034366 and 0.00259, respectively. Then, the univariate Cox model was used to select the significant methylation sites for multivariate Cox proportional hazards regression model analysis, with the T, N and M stages as the covariants in the model and P < 0.05 as the significant threshold. Finally, the methylation sites of the training set samples that were significant in both the univariate and multivariate analyses were selected as characteristic biomarkers for further analysis.
Molecular subtyping of HCC, screening of intra-groupspecific methylation sites, and enrichment analysis of signalling pathways and transcription factors First, the ConsensusClusterPlus R software package [21] was used for the consistent clustering of characteristic methylation sites that were significant in both the univariate and multivariate analyses, and the molecular subtypes were screened for subgroup classification. The similarity distance between samples was calculated by the Euclidean distance [22], K-means was used for clustering, 80% samples were sampled 100 times by adopting the resampling program, and the optimal cluster number was determined by the cumulative distribution function (CDF) [23]. On this basis, methylation expression profile cluster analysis and clinical characteristic analysis of each subgroup were also performed. Subsequently, the EpiDiff software [24] was also employed to identify the specific methylation sites. For each cluster, the average value of each methylation level of the methylation sites that were significant in both the univariate and multivariate analyses was calculated, the obtained matrix was used as the input data of EpiDiff software, the threshold was set at 2.099, and the cluster-specific methylation sites were screened for genomic annotations. Additionally, the correlation of these specific methylation sites with the gene expression in the subgroups was explored. A total of 172 corresponding samples that had been detected with RNA-Seq were identified in the training set samples, gene expression profiles were extracted from these 172 samples to plot the expression profile heat map, and the correlation and consistency of the gene DNA methylation level with gene expression was observed. Finally, the corresponding genes in the promoter regions of these specific methylation sites were subjected to signalling pathway and transcription factor enrichment analysis using cluster Profiler [25] and g:profiler [26], respectively.

Construction and testing of the prognosis prediction model for HCC patients
The Bayesian network classifier was constructed using the above identified specifically expressed methylation sites in each subtype, and the model performance was judged through 10-fold cross-validation. Subsequently, the expression profile data of specific CpG methylation sites were extracted from both the training set and test set and were substituted into the model for calculation. In addition, the prediction results were calculated, and the prediction accuracy of the prognosis classification model as well as the identification stability of the methylation features were verified and analysed. (A study flowchart has been added, see Figure S1.tif).

Results
Mining characteristic DNA methylation sites based on the survival and prognosis data of HCC patients First, a series of data downloaded from TCGA was preprocessed, including filling in the missing values, removing the CpG sites and single nucleotide sites from the sex chromosomes, removing the CpG sites with cross-reactivity from the genome, excluding samples with incomplete data, and randomly dividing the samples into a training set and test set (see Materials and methods). Afterwards, the methylation sites and survival data were subjected to univariate Cox proportional hazards regression model analysis (using the coxph function in the survival R package, see Table S5.txt), with the significance threshold set as P < 0.05. A total of 26,208 significantly different sites regarding prognosis were discovered, and the top 20 are shown in Table 1. Similarly, our findings revealed that the T, N and M stages were also significantly correlated with prognosis, with logrank P values of 0.002446, 0.034366 and 0.00259, respectively. In addition, the univariate Cox model was used to select the significant methylation sites for multivariate Cox proportional hazards regression model analysis, with the T, N and M stages as the covariants in the model. Finally, a total of 10,699 significant methylation sites were obtained (see Table S6.txt). As a result, the methylation sites (n = 10, 699) from the training set samples that were significant in both the univariate and multivariate analyses were selected as characteristic biomarkers for further analysis.
Using the characteristic DNA methylation sites for the consistent clustering of HCC molecular subgroups The Consensus Cluster Plus R software package was used for the consistent clustering of characteristic methylation sites that were significant in both the univariate and multivariate analyses, and the molecular subtypes were screened for subgroup classification. The similarity distance between samples was calculated by the Euclidean distance, K-means was used for clustering, 80% samples were sampled 100 times by adopting the resampling program, and the optimal cluster number was determined by the CDF. As shown in Fig. 1a, stable clustering results could be obtained when the cluster number was 6 or 7. Further observation of the CDF delta area curve (Fig. 1b) suggested that stable clustering results could be acquired when the cluster number was set as 7. Finally, k = 7 was selected to obtain 7 molecular subtypes.

Methylation expression profile cluster analysis and clinical characteristic analysis based on the HCC molecular subgroups
The stable clustering results at k = 7 were selected according to consistent clustering. As presented in Fig. 2a, 174 tumour samples were assigned to these 7 subgroups. Furthermore, 10,699 methylation profiles were used for cluster analysis, and the distance between methylation sites was calculated by the Euclidean distance. Figure 2b shows that most methylation sites had low abundance in each sample, while the methylation expression profiles of these 7 clusters of samples were also greatly different. In particular, the methylation levels of the samples in Cluster5 and Cluster6 were markedly lower than those of the other groups.  The distribution of the samples according to the 7 molecular subtypes regarding prognosis, T, N, and M stage, grade and age was further analysed, as displayed in Fig. 3. Figure 3a shows that the samples in Cluster5 and Clus-ter6 had the highest invasion degrees; Fig. 3b suggested that M1 samples were mainly distributed in Cluster4 and Cluster5; Fig. 3c indicated that the samples in Clus-ter5 and Cluster6 had high stages; Fig. 3d suggested that Cluster6 had the largest number of G4 samples; Fig. 3e presented the great difference in the age distribution among the different clusters; and Fig. 3f revealed a significant difference in prognosis among these 7 clusters of samples, among which the samples in Cluster2 had the best prognosis, while the samples in Cluster5 and Cluster6 had the poorest prognosis.

Screening of intra-group-specific methylation sites and enrichment analysis of transcription factors based on the HCC molecular subtype classification
To identify the methylation-based molecular subtypes of HCC, the EpiDiff software was used to identify specific methylation sites. For each cluster, the average value of the methylation levels of the 10,699 methylation sites was calculated, and the obtained 10,699 × 6 matrix was used as the input data of EpiDiff software, as presented in Table S7.txt. The threshold was set at 2.099, and 119 methylation sites were considered cluster-specific methylation sites, as displayed in Table S8.txt. The heat map is presented in Fig. 4a, which suggested that Cluster6 had the most specific methylation sites, and most of them were lowly methylated sites, while a small number of specific methylation sites also existed in the remaining clusters, which were mostly hypermethylated. In addition, there were no specific methylation sites in Cluster1. The modification level heat map of the specific methylation modification sites is illustrated in Fig. 4b. These 119 methylation sites were then subjected to genomic annotations, which obtained 105 genes near these methylation sites, as shown in Table S9.
Moreover, the correlation of these specific methylation sites with gene expression in the subgroups was explored. A total of 172 corresponding samples that had been detected with RNA-Seq were identified in the training set samples, 105 gene expression profiles were extracted from these 172 samples (as presented in Table  S10), and the expression profile heat map was further plotted. As shown in Fig. 4c, there were different expression patterns in these subgroups at the expression profile level, suggesting partial consistency between the DNA methylation levels of these genes and their expression.
Furthermore, to observe the mechanism of action of these specific methylation sites, KEGG enrichment analysis was carried out on the corresponding genes in the promoter regions of these specific methylation sites using the cluster Profiler R software package, as shown in Fig. 5a. It could be seen from the figure that these Fig. 2 The consensus matrix for DNA methylation classification with the corresponding heat map. a The colour-coded heatmap corresponding to the consensus matrix for k = 7 obtained by applying consensus clustering. The colour gradients from 0 to 1 represent the degree of consensus, with white corresponding to 0 and dark blue corresponding to 1. b The heatmap corresponding to the dendrogram in Figure a, which was generated using the pheatmap function with DNA methylation classifications, TNM stage, clinicopathological stage and histological type as the annotations genes were enriched in multiple cancer-related pathways, especially the HCC pathway. Additionally, g: profiler was further used for transcription factor enrichment analysis, which revealed that these genes were enriched in the transcription factor KROX (TF: M00982), with a significant FDR value of 8.46e-03. The corresponding genes are shown in Fig. 5b, which suggested that 48 genes were enriched in KROX and involved 67 methylation sites.

Construction of the prognosis prediction model for HCC patients and evaluation in the test set
To verify the differential ability of the identified specific methylation sites, the 119 specific methylation sites Fig. 4 The specific hyper/hypomethylated CpG sites for each DNA methylation cluster. a The display of specific CpG sites for each DNA methylation prognostic subtype. The red bars and blue bars represent hypermethylated CpG sites and hypomethylated CpG sites, respectively. b The heat map for the specific sites among the seven DNA methylation clusters. c The heatmap for the expression of specific related genes among the seven DNA methylation clusters Fig. 3 Comparison of prognosis, TNM stage, grade and age between each DNA methylation cluster. a The survival curves of each DNA methylation subtype in the training set. The horizontal axis represents the survival time (days), and the vertical axis represents the probability of survival. The metastasis b, stage score c, grade score d, age e and survival probability f distributions of each DNA methylation subtype in the training set. The horizontal axis represents the DNA methylation clusters identified using EpiDiff software were used to construct a Bayesian network classifier, and the model performance was judged through 10-fold cross-validation. The classification accuracy of the model constructed based on the training set was 70.68%, and the area under the ROC curve (AUC) was 0.923 (Fig. 6a), suggesting that this classifier had a favourable classification performance.
Furthermore, to verify the stability and reliability of the model, the expression profile data of these 119 CpG methylation sites were extracted from the test set and incorporated into the model for model verification. The predicted results are shown in Table 2, which shows that the cluster statistic number of the training set was similar to that of the predicted samples. Furthermore, the methylation patterns of these 7 clusters of samples were plotted using the pheatmap R package (Fig. 6b). The heatmap showed the presence of markedly distinct methylation patterns among these 7 clusters of samples.  The colour-coded heatmap corresponding to the consensus matrix for k = 7 obtained by applying consensus clustering from the test set. c Survival curves of the seven clusters predicted from the test set using the prognosis model. The log-rank test was used to assess the statistical significance of the difference Moreover, the differences in prognosis among these 7 clusters of samples were also analysed (Fig. 6c), and significant differences were found in the prognosis of these 7 clusters of samples (p = 0.00068). In addition, the Cluster2 samples showed markedly better prognoses over the other clusters of samples, which were consistent with the findings in the training set data.
Finally, the clinical features of these 7 clusters of samples in the test dataset were analysed, and Fig. 7 shows that the clinical feature distribution in each cluster was consistent with that in the training set. Overall, the prognosis prediction model constructed using these 119 methylation profiles showed high prediction accuracy and stability in identifying the methylation features.

Discussion
HCC is one of the most commonly seen malignant tumours and is the third leading the global digestive system tumour in terms of its mortality; the morbidity and mortality of HCC also show increasing trends [27,28]. The leading causes of HCC are chronic hepatitis viral infection (such as HBV and HCV), liver cirrhosis, malnutrition, toxin invasion and metabolic disturbance. Currently, great achievements have been attained in the diagnosis and treatment of HCC, but its pathogenesis remains incompletely elucidated.
Studies have shown that tumours are highly heterogeneous [29], and heterogeneity is the main cause of the inconsistent clinical treatment of tumours. Finn et al. showed different clinical outcomes of different molecular subtypes of HCC, and different molecular subtypes in tumours respond differently to clinical therapeutic drugs [30]. At present, epigenetic abnormalities have been considered to be characteristic of many types of cancer [31]. The epigenome map can be used to identify the molecular subtypes of a particular cancer and is related to clinical outcomes [32]. For example, the CpG island methylation phenotype is found in many types of cancer, abnormal DNA methylation patterns can be used to detect the presence of precancerous lesions or malignant cells in tissue sections [33][34][35]. Abnormal DNA methylation can also be used to detect free tumour DNA in the blood of some cancer patients [36]. Seven different molecular subtypes were found in our study. They not only have different epigenetic patterns but also have significant differences in clinical characteristics such as age, tumour stage, histological type, and prognosis. These results indicate that the individuals in these clusters may have different responses to clinical drugs and provide a reference for personalized therapy and drug development. We further identified subtype-specific methylation sites that can be used as markers to predict patient outcomes.
At present, a large amount of literature has reported the role of DNA methylation in HCC diagnosis, treatment and prognostication. First, compared with genetic mutations, DNA methylation changes are earlier events during the cell carcinogenesis process, while the DNA decomposed after tumour tissue necrosis will enter the peripheral blood (plasma or serum) or other body fluids (such as saliva or sputum). Therefore, detecting the methylation status of the isolated tumour-related genes in body fluids may be applied in the early diagnosis of HCC. For instance, the promoter methylation status of two factors, TFPI2 and IGFBP7, in serum is markedly higher than that in normal subjects and hepatitis patients and may thereby become a non-invasive molecular biomarker for the early diagnosis of HCC in the clinic [37,38]. Second, DNA methylation is mediated by methyltransferase, and it has become a novel approach in the field of HCC treatment to explore therapeutic strategies based on changes in DNMT activity and DNA methylation patterns. In addition, preclinical studies have verified that drugs targeting the DNMT action substrate (azacitidine) and the DNMT cofactor SAM (sinefungin) as well as the anti-SAM metabolite neplanosin, and the non-competitive non-nucleotide transmethylase inhibitor procainamide can reverse the abnormal methylation in tumour cells, re-express the inactivated genes, and suppress cancer cell proliferation and invasion [39][40][41]. Finally, DNA methylation also plays an important role in the prognosis of HCC patients. The CpG island methylation phenotype (CIMP) refers to the presence of methylation in the CpG island of multiple gene promoter regions. Some articles have suggested that patients with high tumour CIMP are associated with a markedly lower survival rate than those with low or no CIMP, while patients with various CIMP degrees also have evidently different tumour metastasis and TNM stages [42]. In addition, the correlation between the methylation of multiple gene promoter sequences and patient prognosis has also been verified, such as Tip30, RASSF1A and CD147 [43]. Nevertheless, the specific methylation sequences in the promoter regions of these gene remain unclear, and it is still unknown whether the methylation of these genes displays clinical significance compared with tumour classification, survival and prognosis in large-sample HCC patient data. This study attempted to establish a classification method that integrates several DNA methylation markers based on solving these problems in order to help clinicians effectively assess the prognosis of HCC patients and select the appropriate therapeutic strategies for HCC patients.

Funding
The study (sample collection, analysis, and interpretation of data) was funded by the National Natural Science Foundation of China (NSFC 81702526). The funder had no role in study design, data collection and analysis, decision to publish.

Availability of data and materials
The data was downloaded from http://xena.ucsc.edu/and https://portal.gdc. cancer.gov/. The direct web links to each of the datasets are provided in Table S11.