Analysis of multi-factor regulated network and different clusters in hypertrophic obstructive cardiomyopathy

We herein explored the regulatory network between lncRNA, miRNAs, mRNAs, and TFs in the HOCM and divided HOCM samples into different clusters to identify key genes and regulatory factors. The four gene sets from GEO, GENE and OMIM databases were integrated and WGCNA network was used to obtain two modules and 32 core genes. Through online database, miRNA-lncRNA, miRNA-mRNA, and TF-mRNA interaction pairs were obtained to screen the regulatory factors, and �nally a multi-factor regulatory network was obtained. The �rst 7 regulatory factors were obtained as the core regulatory factors. The unsupervised clustering method was used to divide HOCM samples into 4 clusters. As a result, four genes including COMP, FMOD, AEBP1 and SULF1 showed signi�cant expression in different clusters. Finally, we got 4 genes are considered as important biomarkers for different progressive stages or prognosis of HOCM. These might assist in discovering molecular mechanisms of HOCM in the future.


Introduction
Hypertrophic cardiomyopathy (HCM) is a common inheritable heart disease, with an estimated incidence rate of 0.2% in the general population [1,2].It is regarded as one of the main reasons for sudden cardiac death (SCD) in young people [3].Of note, about 25-70% of HCM patients had left ventricular out ow tract obstruction LVOTO [4][5][6], which is also referred to as hypertrophic obstructive cardiomyopathy (HOCM).
HOCM shows stronger symptoms and requires varied treatment strategies when compared with hypertrophic nonobstructive cardiomyopathy [7].Unfortunately, SCD might be the rst symptom observed in young individuals.However, the underlying molecular mechanisms of HOCM are still poorly understood.Therefore, it is necessary to study the mechanism of HOCM.
Long non-coding RNAs (lncRNAs) have more than 200 nucleotides and have similar structures with that of mRNAs that are not translated into proteins [8].These participate in a variety of biological processes [9].Previous studies have revealed that dysregulation of lncRNAs is involved in the pathogenesis of HCM [10,11].MicroRNAs (miRNAs) are noncoding RNA molecules (about 21 nucleotides) that prevent gene expression through post-transcriptional regulation.Some studies have reported that miRNAs, such as miR-1 [12], miR-451 [13], miR-22 [14], and so on, play an essential role in cardiac hypertrophy.
Further non-coding RNAs (ncRNAs) act as crucial biomarkers and functions as therapeutic targets in cardiovascular diseases.However, the signaling pathways and regulatory networks underlying the pathogenesis of HOCM is largely unknown.So, the discovery of new regulatory targets of HOCM is imperative for future therapeutic development.
In our research, two public datasets (GSE36961 and GSE89714, including HOCM samples and normal samples) and Gene database of NCBI, Online Mendelian Inheritance in Man database (OMIM) were integrated to screen the core genes associated with HOCM.Using the online database, a multi-factor regulatory network was constructed and divided HOCM into 4 clusters according to the core genes.Our research provides useful information in exploring the potential biomarkers for diagnosis and treatment of HOCM.

Materials
The datasets of GSE36961 contributed by Hebl VB et al. and GSE89714 contributed by Li Y et al. were downloaded from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/), and the expression pro ling was generated using GPL15389 (Illumina HumanHT-12 V3.0 expression beadchip) and GPL11154 (IlluminaHiSeq 2000).The GSE36961 dataset contains a total of 39 control samples and 106 case samples, and the other dataset -GSE89714 included 5 case samples and 4 control samples.The case samples were of HOCM patients who underwent surgical septal myectomy because of LVOTO, and the control samples were donor cardiac tissues.Statistical analysis with the R (version 3.6.0)was performed.

Data procession
The datasets of R language including the matching probes with genes were processed, and the median value of the expression level of the gene was selected when multiple probes were applied to the same gene.The differentially expressed genes (DEGs) were then obtained by using the R software package "limma", according to log 2 |fold change| (|log 2 FC|>0.58) and adjusted the p-value (adj.P.Val < 0.05).[15,16]

Identi cation of candidate gene set and function analysis
The DEGs from the two datasets were summed up, and then the HCM-associated genes from GENE and OMIM databases were obtained.Next, the redundancy was removed to obtain candidate gene set, and the expression pro le data of candidate gene set in GSE36961 were selected for the weighted co-expression network analysis (WGCNA).The "ClusterPro ler" software package from R language was used to perform enrichment analysis of genes of candidate gene set including KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis and gene ontology (GO) enrichment analysis.

WGCNA
The "WGCNA" software package from R language was used to construct the weighted co-expression network for the candidate gene set.After screening HOCM-related modules, the "ClusterPro ler" software package from R language was used for enrichment analysis of genes from modules.Based on GO analysis and KEGG analysis, the signi cant GO terms and pathways were identi ed (adj P-Val < 0.05).Meanwhile, the co-expressed gene set within the module was constructed into an interaction subnetwork, and the core genes of the module were screened according to the node degree of the network using the Cytoscape3.7.2 to visualize the network.

Construction of the multi-factor regulatory network
To screen the TF gene, the lncRNA, and the miRNA that interact with the core genes, the interaction pairs between ncRNAs and their target genes from the starBase v2.0 (http://starbase.sysu.edu.cn/)database (including miRNA-lncRNA and miRNA-mRNA)and interaction pairs between TFs and their target genes (TF-mRNA) from the TRRUST v2 database (https://www.grnpedia.org/trrust/)were downloaded [17,18].
MiRNA-lncRNA pairs should meet the number of supporting experiments that are greater or equal to high stringency (≥ 3), and the miRNA-mRNA interaction pairs should also be identi ed by at least one of the softwares that consisted of targetScan, picTar, RNA22, PITA and miHCMnda/mirSVR.Finally, interaction pairs of core genes containing TF-mRNA, miRNA-lncRNA and miRNA-mRNA were obtained to construct a multi-factor regulatory network.By calculating the degree of each regulator (miRNA, lncRNA, TF), the regulators that act as core regulators (degree > 5) were screened.

Clusters of HOCM
The key genes that co-expressed to GSE36961 for unsupervised clustering were mapped, and then the HOCM samples (N = 106) were selected to classify the HOCM samples by K-means unsupervised clustering method combined with "tsne" dimension reduction.Firstly, we chose the optimal K value (number of categories) by nding the in ection point of SSE (sum of the squared error, i.e., the sum of the squares of the distances of all points to the center of the cluster to which they belong) to divide HOCM samples into clusters.After that, the expression patterns of these core genes in different subtypes were examined and the genes with signi cant differences in different clusters were analyzed, which might act as potential marker genes in HOCM subtypes.

Identi cation of candidate gene set and function analysis
The steps conducted in our study were shown in Fig. 1A, and the identi ed DEGs of the datasets GSE110226 and GSE89714 in HOCM were presented in Supplementary Table 1.A total of 628 DEGs were identi ed from GSE36961 (|Log 2 FC|>0.58,adj.P.Val < 0.05) between HOCM samples and normal tissues, which included 244 upregulated and 384 downregulated genes, and 1483 DEGs were identi ed from GSE89714 (|Log 2 FC|>0.58,adj.P.Val < 0.05), which included 903 upregulated and 580 downregulated genes (Fig. 1B).The volcano plots (Supplementary Fig. 1) are presented to display the DEGs.
A total of 177 HCM-associated genes were obtained from the GENE database of NCBI and 124 genes from the OMIM database (Supplementary Table 2).The DEGs of the two datasets were combined, and the HCM-associated genes were searched in two databases and removed the redundancy to get 2239 candidate genes, and this was considered as a candidate gene set (Fig. 1D).Furthermore, GO analysis and KEGG analysis were performed for candidate genes.The results of KEGG analysis showed that candidate genes showed signi cant enrichment in HCM and dilated cardiomyopathy pathways, and biological process (BP) of GO enrichment analysis involved in the muscle system process and heart process (Fig. 1E and Supplementary Table 3).

WGCNA
Based on 2239 candidate genes, the gene expression pro le data from GSE36961 was applied to construct the WGCNA network (Supplementary Table 4).The results of the cluster showed no outlier sample, and so further analysis for 145 samples in the GSE36961 dataset was performed.A soft threshold (β) value of 8 was set (Fig. 2A).Next, the expression matrix was transformed into the adjacenct matrix, and then the adjacenct matrix was transformed into the topology matrix (TOM).Based on TOM, the average-linkage hierarchical clustering method was used to cluster the genes, and set a minimum number of genes for each gene network module to be 30.After determining the gene module by dynamic shear method, the eigengenes of each module were calculated at one time, and then clustering analysis was performed on the module.The modules with a relatively close distance were merged into a new module.A height of 0.25 was set, and a total of 5 modules were obtained (Fig. 2B and Supplementary Table 5) The Pearson correlation coe cient between the module eigengenes and HOCM samples was calculated.
The bigger value of the module is considered as more important in HOCM.Subsequently, the value of gene signi cance (GS) of every module was obtained.The bigger value of GS represented a more relative module with HOCM.Finally, the turquoise module (cor = 0.82, p < 1e-200) and blue module (cor = 0.72, p < 4.3e-25) were considered as the most relative modules with HOCM (Fig. 2C, D and Supplementay Fig. 2).Through the BP of GO enrichment analysis of the gene from the two modules, the genes in the turquoise module showed association with the muscle system process and muscle tissue development and the genes from the blue module participated with the muscle system process and the heart process.(Fig. 3 and Supplementary Table 5)

Construction of the multi-factor regulatory network
To identify the association of core genes with HOCM, the genes from the two modules were further analyzed.According to the gene expression relationship in the turquoise module, co-expression pairs with a connection threshold value of no less than 0.2 as edges of the co-expression network were chosen to construct a turquoise module network diagram, and the genes were selected with a node degree greater or equal to 3 (N = 15) as the core genes of the turquoise module.Similarly, a connection threshold value of no less than 0.07 was set and the node degree of greater or equal to 3 (N = 17) was selected as the core gene of the blue module.Also the 32 core genes were used as co-expression key genes for further analysis.(Fig. 4A and Supplementary Table 6) A total of 376 miRNA-lncRNA interactions and 160774 miRNA-mRNA interactions were obtained from the starBase v2.0 database, and a total of human 9396 TF-mRNA interaction pairs from the TRRUST v2 database were obtained (Supplementary Table 7).Firstly, the miRNAs and transcription factors that interact with the co-expressed key genes (N = 32) were screened, followed by selecting the lncRNAs that interact with these miRNAs, and nally obtained a multi-factor regulatory network with 175 interaction pairs.(Fig. 4B and Supplementary Table 8) Using the Mcode plugin of Cytoscape to calculate the degrees of every regulatory factor (miRNA, lncRNA, and TF), the rst 7 regulatory factors were obtained as the core regulatory factors, which included the lncRNAs (XIST, MALAT1, H19), TFs (SPI1 and SP1) and miRNAs (hsa-miR-29b-39 and has-miR-29a-3p).(Fig. 4C and Supplementary Table 8) To explore the functions and pathways that are regulated by these core regulators, a total of 8397 target genes that interact with the core regulators from miRNA-lncRNA, miRNA-mRNA, and TF-mRNA interaction pairs were selected.Enrichment analysis for these target genes was performed, and the results obtained the target genes that are related to biological processes such as positive regulation of catabolic process, histone modi cation, and proteasomal protein catabolic process.(Supplementary Fig. 3 and Supplementary Table 8)

Clusters of HOCM
The 32 co-expressed key genes were mapped to GSE36961 for unsupervised clustering, and an optimal K value of 4 was set.(Fig. 5A and Supplementary Table 9) All HOCM samples were clearly divided into 4 clusters, and the expression of key genes in all HCMs was shown in the heatmap.(Fig. 5B,C and Supplementary Table 9) Furthermore, the differences of 32 co-expressed key genes in different clusters were analyzed, and the average value of each gene in each class was taken as the gene expression value in that class.As a result, four genes including the COMP, FMOD, AEBP1 and SULF1 showed signi cant expression in different clusters (p < 0.01), (Fig. 5D, E).

Discussion
HOCM is regarded as a primary risk factor of SCD caused by HCM.Therefore, exploration of the pathogenic mechanism is of utmost importance.In our recent study, two datasets were integrated from the GEO database on HOCM and associated genes of HCM in the GENE and OMIM databases to obtain the candidate gene set.Then, WGCNA method was used to identify the related modules of HOCM.The integration of high-throughput data, online databases and bioinformatic method for scale-free network have widened the disease spectrum and strengthened the evidence.BP analysis indicated that the candidate gene set and genes in most of the relevant modules were targeted to muscle system process, muscle contraction and heart process.Pathway analysis indicated that candidate gene set was mostly enriched in HCM, focal adhesion and dilated cardiomyopathy.These results showed correlation with HOCM, and so, the co-expressed 32 genes with the highest degree were chosen in the two modules as core genes.
The miRNAs, lncRNAs and TFs that interact with the co-expressed key genes were then screened to obtain a multi-factor regulatory network.To date, several studies have reported the features of ncRNAs in the HOCM [19,20].Nevertheless, the details of RNA crosstalk in HOCM have not been elucidated.In this study, an integrated lncRNA-miRNA-mRNA-TF regulatory network was constructed, expounding the views on gene regulation at the pre-transcriptional and post-transcriptional levels.Moreover, bioinformatics technology was applied to explore the key molecules that are involved in the process of HOCM, which might be considered as optimal candidate markers for future therapy.The rst 7 regulatory factors were found as the core regulatory factors, which included the lncRNAs (XIST, MALAT1, H19), TFs (SPI1 and SP1) and miRNAs (hsa-miR-29b-39 and has-miR-29a-3p).XIST, which is named as lncRNA X-inactive speci c transcript, has been identi ed as a necessary regulator of cardiac hypertrophy by regulating miR-101 [21] and miR-330 [22].XIST also showed association with heart failure in females [23].In vivo experiments revealed that knockdown of XIST can inhibit the myocardial cell apoptosis in acute myocardial infarction rat model by regulating miR-449 [24].Besides, H19 has been identi ed as a regulator that targets PPARα of cardiac hypertrophy [25,26].The results showed that XIST, MALAT1 and H19 possibly regulated other miRNAs involved in cardiac hypertrophy, such as miR-15b [27], miR-19b[28] and miR-29b [29] in our research.
The miRNAs and TFs consistent with other researches were identi ed.MiR-29 is a regulator of cardiomyocyte hypertrophy through wnt and mTOR signaling pathways [30,31].Moreover, SP1 can regulate cardiomyocyte hypertrophy by inducing lncRNA CTBP-AS2 [32] and SP1/GATA4 signaling pathways [33].However, SPI1 has not been reported in cardiac hypertrophy so far.So, the core genes were used to divide the HOCM into 4 clusters.Although no difference was observed in the by the expression of a single gene, the HOCM was divided into 4 clusters clearly by combining the 32 co-expressed key genes.Finally, the 4 genes COMP, FMOD, AEBP1 and SULF1 showed signi cantly different expression in the four clusters.From this point, we speculated that these 32 co-expressed key genes are of great signi cance for HOCM typing and the 4 genes are considered as important biomarkers due to different progressive stages or prognosis of HOCM.However, there are several limitations to this study.Firstly, our study focused on data mining and analysis without any experimental con rmation; and 2) due to lack of relevant prognostic information, clinical classi cation of HOCM related to key genes and survival analysis associated with key genes was not conducted.

Conclusion
In our study, a total of 32 key genes and 7 regulatory factors that might be used as biomarkers in HOCM were identi ed.Classi cation of HOCM samples based on the co-expression of key genes demonstrated a better classi cation effect.Four genes showed signi cant differences in different subclasses and can be used as marker genes for different hypertrophic cardiomyopathy subclasses.The analysis method used an online database for gene expression, and provided a novel method for the diagnosis and prognosis of HOCM.However, based on the limitations of the present study, additional clinical research and molecular mechanisms are warranted to con rm these ndings.Construction of multi-factor regulatory network A) Core genes in the blue module were 17 (left), and core genes in the turquoise module were 15 (right), core genes were labeled in the red color; and B) the construction of multi-factor regulatory network was done by Cytoscape software.The green rhombus represent transcription factor.The red circles represent genes (mRNAs).The orange triangles represent miRNAs.The purple arrows represent lncRNAs.C) Seven key regulatory factors were found in the network.The higher the degree is, the more important the regulatory function in the network is.

Declarations Figures
Clusters of HOCM.A) K = 4 was selected as the optimal number of clusters since the K value is decreased by a negligible amount.B) tSNE algorithm provided each sample with a unique x-and y-coordinate (tSNE1 and tSNE2) according to each sample's gene expression of 32 core genes.All HOCM samples