Identification of gene co-regulatory modules and associated cis-elements involved in degenerative heart disease
© Danko and Pertsov. 2009
Received: 10 November 2008
Accepted: 28 May 2009
Published: 28 May 2009
Skip to main content
© Danko and Pertsov. 2009
Received: 10 November 2008
Accepted: 28 May 2009
Published: 28 May 2009
Cardiomyopathies, degenerative diseases of cardiac muscle, are among the leading causes of death in the developed world. Microarray studies of cardiomyopathies have identified up to several hundred genes that significantly alter their expression patterns as the disease progresses. However, the regulatory mechanisms driving these changes, in particular the networks of transcription factors involved, remain poorly understood. Our goals are (A) to identify modules of co-regulated genes that undergo similar changes in expression in various types of cardiomyopathies, and (B) to reveal the specific pattern of transcription factor binding sites, cis-elements, in the proximal promoter region of genes comprising such modules.
We analyzed 149 microarray samples from human hypertrophic and dilated cardiomyopathies of various etiologies. Hierarchical clustering and Gene Ontology annotations were applied to identify modules enriched in genes with highly correlated expression and a similar physiological function. To discover motifs that may underly changes in expression, we used the promoter regions for genes in three of the most interesting modules as input to motif discovery algorithms. The resulting motifs were used to construct a probabilistic model predictive of changes in expression across different cardiomyopathies.
We found that three modules with the highest degree of functional enrichment contain genes involved in myocardial contraction (n = 9), energy generation (n = 20), or protein translation (n = 20). Using motif discovery tools revealed that genes in the contractile module were found to contain a TATA-box followed by a CACC-box, and are depleted in other GC-rich motifs; whereas genes in the translation module contain a pyrimidine-rich initiator, Elk-1, SP-1, and a novel motif with a GCGC core. Using a naïve Bayes classifier revealed that patterns of motifs are statistically predictive of expression patterns, with odds ratios of 2.7 (contractile), 1.9 (energy generation), and 5.5 (protein translation).
We identified patterns comprised of putative cis-regulatory motifs enriched in the upstream promoter sequence of genes that undergo similar changes in expression secondary to cardiomyopathies of various etiologies. Our analysis is a first step towards understanding transcription factor networks that are active in regulating gene expression during degenerative heart disease.
Heart disease is the leading cause of death in the developed world. Chronic heart disease is usually associated with tissue remodeling that induces maladaptive changes in gene expression and the cellular composition of cardiac tissue. Different forms of the disease are widely believed to progress according to distinct programs of gene expression that converge in end stage heart failure to similar phenotypes . Microarrays have been used to characterize these differences, typically by focusing on changes in gene expression that exceed a statistical threshold [2, 3]. Such methods of gene selection have proven useful for classifying different etiologies [4, 5] and explaining certain aspects of the pathophysiology [6–9]. However, such a strategy is not able to identify the network of regulatory factors that facilitate gene expression in healthy tissue and during cardiac disease. In the present study, we apply a set of basic analytical tools to identify regulatory factors using microarray data and the upstream promoter sequence of each gene. We apply these tools to predict cis-regulatory motifs involved in remodeling cardiac tissue in different types of human cardiomyopathy.
It is well established in yeast  and cultured human cells  that genes involved in a common physiological function tend to be regulated as groups. In such a group, often called a co-regulatory module , genes undergo similar changes in expression that act to roughly preserve their expression ratio over different physiological conditions and intrinsic genetic cues. Our goal is to identify such modules in human cardiomyopathies, under the assumption that these modules can provide information about the regulatory factors that control expression. Our analysis uses publicly available microarray data for human ventricular tissue remodeling due to a variety of cardiac disease states. To identify likely regulatory modules in this data, we applied a hierarchical clustering algorithm to the Pearson correlation between gene expression levels across the different cardiomyopathies. Resulting clusters were visualized and characterized based on Gene Ontology annotations for function. With this analysis, we identified 35 modules, the largest of which are enriched in genes whose primary function is related to energy generation or protein translation.
Next, we addressed the question of what controls the coordinated changes in gene expression that are observed during heart disease. It is well accepted that changes in gene expression are encoded by the combinatorial activity of several different transcription factor proteins working in concert [13–15]. Changes over different physiological conditions presumably involve the activity of different combinations of transcription factors; genes whose expression is controlled by the same set of transcription factors may be expected to undergo similar changes in expression . Transcription factors associated with the regulation of a gene can be identified by the presence of characteristic cis-regulatory motifs in the upstream promoter sequence to which they bind. Therefore, we sought to identify putative regulatory motifs involved in transcriptional regulation of genes composing the different co-regulatory modules. Our motif discovery strategy identified 17 motifs, and we validated their function with additional bioinformatic analysis using other genes.
We verified the effectiveness of batch-effect correction using hierarchical clustering of the 149 different conditions. Whereas prior to the correction of batch effects, hierarchical clustering grouped samples by different labs, after correction samples were not grouped together (not shown). We also ensured that our correction preserved legitimate biological variation by training a naïve Bayes which classifier correctly classified the disease type with 75% accuracy (not shown). The resulting data table was a matrix with rows representing each probe in the Affymetrix U133A platform, and columns representing each of the 149 individual microarray samples.
The mean and variance of each probe was calculated for each of the 16 biological conditions described in Additional file 1, leaving a data table with rows representing each of the Affymetrix U133A probesets and 16 columns corresponding to each biological condition. This data table was used for subsequent analysis, and is referred to below as the "cardiomyopathy data".
To determine the optimal number of clusters, we used the hypothesis that genes that are clustered together will share a biological function. Based around this idea, we introduce a measure of uniformity for the biological function of genes in each cluster. This measure can be expressed as: Qi = Sumj[(Nc ij/Nij) * (Nij-1)/Nm i]/i, where Nc ij is the number of probesets that share the biological function annotation assigned to cluster j using the minimal Fisher's exact p-value assigned using the Bioconductor package "topGO", Nij is the number of probesets in cluster j, Nm i is the mean cluster size (expressed as the number of probesets) when the data is broken into i clusters. The measure was designed such that the contribution of trivial clusters with size 1 is 0, and the contribution from smaller clusters is reduced. The value of this measure for each number of clusters between 2 and 60 is plotted in Additional file 2. At 35 clusters, the measure reaches a global maximum. This number of clusters is used in all subsequent analysis.
Transcription start sites for each gene selected for detailed promoter analysis were obtained using the database of transcription start sites version 6.0 . When multiple transcription start sites were reported, we chose the most frequent start site reported to be expressed in the heart or pericardium. Promoter sequences from -1 kb to +200 bp were obtained from the ENSEMBL genome browser, release 49, March 2008 , as this range allowed us to focus on the strength of IAMMS position filtering algorithm to identify motifs in the core and proximal promoter. A background group of 129 promoter sequences (-1 kb to +200 bp) whose genes show uniform expression across all human tissues was constructed. To build the list of genes, we calculated the ratio of maximum expression to the sum of expression in all tissues across the Gene Atlas. We took the probes with the lowest ratio (indicating uniform expression across tissues). Analysis using DAVID  revealed that this group is not enriched in any biological function. For each of these genes, 1 kb of promoter sequence and the entire 5' UTR were obtained using BioMart. The UTR was then truncated at +200 bp to yield the same -1 kb to +200 bp as the foreground set.
The upstream region of genes in the contractile, energy generation, and protein translation sets were scanned against the background promoter set using IAMMS, as described previously . In the initial stage, IAMMS detected 10,240, 19,665, and 16,719 motifs to test for enrichment in the contractile, energy, and translation promoter groups, respectively. The p-value of enrichment of each motif in each group was calculated using the hypergeometric distribution, and a p-value threshold of 1e-5 was used to select significantly enriched motifs. At this significance threshold, maximal expected false discovery rates of 0.10, 0.20, and 0.17 were estimated using the Bonferroni method for the contractile, energy generation, and translation promoter sets, respectively.
After motif detection, variations that distinguish contractile, energy, and translation genes were discovered by identifying IUPAC degenerate consensus sequences that optimally separate genes in the each module from those in the other modules. For nearly identical motifs detected independently in energy and translation promoter sequences, occurrences were optimized with respect to genes in the contractile module. To avoid over-fitting, only sequences with a Pearson correlation of 0.9 or above were scanned for enrichment (values of 0.7 were used for Initiator and TATA sequences). Similarity between sequences was measured by concatenating column vectors from the position-weight-matrix representation of a motif, and taking the Pearson correlation between vectors.
To complement the de novo search, our next objective was to identify known motifs that are enriched in the promoter. We used the Gene Set Enrichment Analysis (GSEA) software package  to identify phylogenetically conserved motifs that are correlated with the contractile, energy generation, or protein translation module. For each module, the 298 highly expressed probesets were ranked based on their correlation to the mean expression patterns of each module. Gene lists were compared against the curated motif gene list ("all", v. 2.5) using GSEA v.2.0.4, requiring motifs to occur in at least 8 of the genes. All motifs that have a false discovery rate corrected q-value less than 0.1 are reported.
To determine motif intervals, we built a histogram of the number of occurrences in a sliding window. The mean number of occurrences was calculated based on the size of the motif, assuming a GC content of 50%. We took the interval nearest the transcription start site for which the histogram was above the 99% confidence interval of expected occurrences given a Poisson distribution (as illustrated in Figure 5). Intervals were determined for motifs for which we expected to find a significant bias, either because it was detected by IAMMS as position specific, or (for TATA) because previous literature indicated a significant bias .
Our overall strategy is outlined in Figure 1. The first objective was to identify groups of co-regulated genes (called co-regulatory modules). To this end, we use gene expression microarray data from 149 human ventricle samples collected from public microarray sources, including Gene Expression Omnibus and Harvard's CardioGenomics website (see Methods). Samples were separated into 16 disease conditions (Additional file 1). Different conditions include cardiomyopathies of eight different etiologies, including familial, hypertrophic, idiopathic, ischemic, post-partum, viral, and dilated cardiomyopathy. The response of ischemia and myocardial infarction-induced myocardial failure to treatment by a left ventricular assist device (after LVAD) was also included. Control ("normal") ventricle samples from three different labs were included and treated separately to correct microarray samples for batch effects (see Methods for the normalization procedure).
We started with the identification of co-regulated gene pairs. Figure 2 shows the matrix of pairwise correlation for the 222 highest expressed genes in the Gene Atlas heart data. In our visualization scheme, each pixel represents the degree of co-regulation between a pair of genes. Color represents the square of the Pearson correlation (preserving the sign), with red representing a high Pearson correlation of +1 and green a negative correlation of -1 (see the color scale in Figure 2). Each row (or column) represents the correlation between a single gene and all others. Since the order of genes in the matrix is the same from left to right and bottom to top, the visualization is symmetric about the diagonal. The order of genes is chosen using hierarchical clustering (see Methods), locating co-regulated genes near one another in the matrix.
To identify the most biologically relevant gene expression modules, it was important to identify the best possible divisions between groups of genes provided by the clustering algorithm. We used the hypothesis that the most accurate groups inferred from the data should share a biological function. Therefore, we defined a measure to maximize the number of genes classified as the same biological function in each module (uniformity) and maximize module size (see methods). We found that our measure reaches a global maximum when the data is divided into 35 groups (Additional file 2), indicating that this number is the optimal balance between group size and uniformity of biological function. The divisions indicated by this analysis are denoted on the correlation matrix visualization (Figure 2) by the white, teal, and yellow squares along the diagonal. These divisions are also indicated by the red-dotted line along the dendrogram (Figure 2, left).
Three of the larger modules, enriched in genes with functional relevance to cardiac disease, are shown in detail in Figure 3. In the module shown in Figure 3-I, 6/9 genes (bold font) have a primary function related to muscle contraction (Gene Ontology term GO:0006939). Specific functions include two of three genes in the troponin complex (TNNI3 and TNNC1), tropomyosin (TPM1), cardiac alpha actin (ACTC1), and the ventricular myosin light chain isoform (MYL2). One additional muscle-specific gene (CSRP3) is also likely to play a supporting role in contraction. The gene with the lowest correlation to others in the module, ACTA1, is represented by the least-intense red row (r = 0.70, with TNNC1 and CSRP3). This lower correlation is indicative of a relatively large decrease in expression in this gene in one of the dilated cardiomyopathy expression sets (heat maps are shown in Additional file 3). Exceptions to the primary function (names in italics), such as myobglobin (MB), are nonetheless strongly co-regulated with other genes in the module (r = 0.83 with TNNC1).
In addition to the contractile module, other modules with a primary function related to energy generation and translation are shown in Figure 3. In addition to the large modules shown in Figure 3II and 3III, small neighboring modules enriched for genes with the same function are also shown (Figure 3ii and 3iii). Modules shown in Figure 3II and 3ii contain 16/20 unique genes whose primary function is relevant to the generation of precursor metabolites and energy (GO:0006091). One of the exceptions encodes a selenoprotein (SEPW1), an oxygen free-radical scavenger with a role in mitigating the oxidative stress associated with energy generation [28, 29]. Modules shown in Figure 3III and 3iii contain 20/20 unique genes that encode proteins in the 80S ribosome complex (GO:0006412). Previous studies have linked both energy generation and protein translation genes with particular types of cardiomyopathy [4, 7], highlighting the relevance of these modules to cardiac disease.
The next step was to identify motifs enriched in these promoter sequences that may be able to explain the observed patterns of co-regulation (Figure 1B). Our hypothesis was that co-regulated genes should share a common set of transcription factor proteins, which could be detected by the presence of a specific pattern of regulatory motifs in the upstream promoter sequence. The three groups highlighted in Figure 3 were chosen for analysis because of their relevance to expression patterns previously observed in cardiac disease [4, 7]. We used two strategies to analyze the core promoter region of co-regulated genes. First, the motif discovery algorithm Iterative Alignment/Modular Motif Selection (IAMMS)  was used to identify putative regulatory sites de novo. Second, we searched for an enrichment of known regulatory motifs using the gene set enrichment analysis (GSEA) tools . To minimize the possibility of false-positive errors, a stringent p-value cutoff threshold (p < 1e-5) was chosen based on a Bonferroni correction for IAMMS, and a corrected q-value cutoff (q < 0.1) for GSEA (see methods). Sequences are presented both as sequence logos and as IUPAC consensus sequences using the degeneracy codes R (A or G), Y (C or T), W (A or T), S (C or G), and N (A, T, C, or G).
The highest scoring motifs enriched in contractile, energy generation, and translation groups are shown in Figure 4. Among the highest scoring motifs enriched in contractile promoter sequences (Figure 4, top) was found to be a degenerate TATA consensus found in all 12 contractile promoter sequences, but only rarely found in energy generation or translation promoters. Several C-rich sequencers were also discovered by IAMMS to be enriched in contractile promoter sequences. One C-rich variation closely resembles a CACC-box known to be involved in the transcriptional regulation of several cardiac genes [30–33]. Our CACC-box includes all of these experimentally validated occurrences, strongly suggesting a functional role for this motif in the other genes as well. Several additional motifs similar to previously characterized cardiac-muscle specific transcription factors were discovered using the GSEA software, including motifs annotated as MEF2 , SRF , and E-box . GSEA also detected a strong enrichment in a motif annotated as ERR1 (TGACCT), with no previously known role in regulating contractile genes.
Motifs enriched in energy generation and translation promoter sequences are shown in the middle and bottom of Figure 4, respectively. Of the motifs detected as enriched in each set, three nearly identical motifs were detected independently in energy and translation promoters. Similar motifs are GC-rich, including a 4 bp core sequence GCGC, an SP-1-like motif (CCGCC), and a sequence nearly identical to the core of Elk-1 (TTCCGG). Individual motifs that separate energy from translation promoters include a degenerate pyrimidine-rich initiator element enriched near the transcription start site in translation genes (YYCTTTYY), and a GC-rich sequence found to be enriched only in energy generation genes (GCGGA). Energy generation promoters were also found by GSEA to contain the ERR1 motif (TGACCT), which was also discovered independently in the contractile promoter sequences.
For each motif we identified the range of positions in which it is enriched relative to the transcription start site using the approach illustrated in Figure 5. Plots depict the number of occurrences of a given motif found inside a window of fixed size surrounding each point. The top dashed line indicates the 99th percentile of expected occurrences based on a Poisson distribution. We define the range of enrichment to be the interval near the transcription start site in which the number of occurrences of a motif increases above this dashed line, shown by the colored boxes surrounding the plot. The range of any significant bias in distribution discovered using this approach is indicated in Figure 4, and used in subsequent analysis.
Our analysis shows that for each module there is a pattern of regulatory motifs present in nearly all promoter regions that distinguish it from the other modules. Figure 6 shows these patterns plotted to scale in the promoter region of genes in the contractile, energy generation, and protein translation modules. The CACC-box (shown by the red box) occurs in all contractile promoters and only 7/37 energy generation and translation promoter sequences. Similarly, the TATA-like motifs (YAYWWA and TANWWR) can be found in all contractile promoters (dark blue box), but only 4/37 energy generation and translation promoters. Translation promoters are characterized by the presence of a pyrimidine-rich initiator-like element (black box) that appears surrounding the transcription start site in nearly all translation genes (18/19), whereas this motif occurs only twice in energy generation promoters. Occurrences of the Elk-1 motif (green box) and the GC-rich motif (dark yellow box) are in enriched energy generation and translation promoters, but occur rarely in contractile sequences.
The pattern of motifs in the core promoter has a significant association with gene expression in a group of holdout genes not used in motif discovery analysis.
2.66 (1.46 – 5.28)
1.93 (1.24 – 3.13)
5.50 (3.37 – 8.89)
Fisher's p <
In the present study, we divide genes into groups, or modules, that undergo similar changes in their expression patterns over different forms of heart failure. The modules indicated by our analysis are exploited to identify putative cis-regulatory motifs which may bind transcription factors that contribute to the observed pathological expression patterns. Our approach was motivated by the hypothesis that genes with similar expression patterns are regulated by the same set of transcription factors, and therefore are likely to have similar cis-regulatory motifs in their upstream promoter region. The power of this assumption was recently demonstrated in the yeast model , where motifs were discovered that were able to predict gene expression patterns. To our knowledge, this article is the first in which this assumption has been applied to human tissue or to the study of a specific disease in which a holdout set of genes has been used to validate predictions. We report 17 putative cis-regulatory motifs, including ELK-1, a CACC-box, and a pyrimidine-rich initiatory variant that are predicted to play a role in cardiac gene expression. Our predictions may serve as a base for experimental studies seeking to understand how genes are reorganized during heart failure.
We discovered several motifs that are similar to known regulatory elements. Of these, many are variations of either core promoter (TATA, Initiator) or proximal enhancer (ELK-1, CACC, SP-1) elements. We show here that these motifs are more likely to be found together in promoter sequences that drive specific changes in gene expression during heart disease (Table 1). This implies that the motifs discovered here are strong candidates for modulating disease-related changes in gene expression. Our study highlights the importance of core and proximal promoter motifs in determining changes in transcription, and suggest that they play a larger role in the combinational regulatory code than has previously been ascribed.
Promoter regions of genes in the contractile module are characterized by the presence of a TATA variation (TANWWR), CACC box (GGGRWGG), MEF2 (YTAWWWWWTR), SRF (CCWWWWWWGG), and an E-box (CAGCTG). Single-promoter experimental studies have associated both all of these motifs with certain contractile promoters [30–37]. For instance, an AT-rich sequence that resembles a TATA-box has been previously identified in the promoter of TNNI3 , TNNC1 , MYH7 , and MB . Similarly, the CACC box has also been previously identified in certain contractile-related promoter sequences, including TNNI3 [30, 31], TNNC1 , and MB . All of these experimentally verified occurrences were discovered by our motif discovery strategy, along with previously unknown occurrences in nine additional promoter sequences, including ACTA1, MYL2, TNNC1, MYH7, TPM1, HSPB1, CRYAB, CSRP3, and PTGDS. Our results complement the single-promoter experimental studies, and suggest that many of these motifs play a larger role in mediating cardiac gene expression than previously anticipated.
At the level of the core promoter, energy generation and protein translation genes are controlled by surprisingly similar regulatory programs. Of four motifs discovered as significantly enriched in both energy and translation sequences, three share nearly identical GC-rich core sequences (Figure 4), including an Elk-1 binding site (CCGGAA), a motif with a GC-rich core sequence (GCGC), and a degenerate SP-1 like motif (CCGCC). These motifs were all found to be enriched in additional promoter sequences not used in the motif discovery procedure that share the expression patterns with the energy or translation module (not shown; p < 0.02) and were important for our naïve Bayes classification analysis. These results suggest that these factors may play a role in mediating gene expression changes during cardiac disease.
The Elk-1 prediction has particular relevance to the study of cardiac disease, because Elk-1 activity is modulated by both MAPK-p38 and calcineurin signaling pathways  that are implicated in animal models of heart failure (Reviewed in ). Calcineurin signaling is activated by calcium, particularly under conditions of calcium overload that are a nearly universal effect of end-stage heart failure. The literature on MAPK-p38 in heart failure suggests that p38 is activated, at least in the early stages of animal models of hypertrophy . Unfortunately, the two signaling pathways have opposite effects on Elk-1 activity [41, 38], making it difficult to predict how a particular heart disease will affect the transcription of Elk-1 targets. Experimental studies in human patients have directly demonstrated changes in Elk-1 signaling during heart failure , however, highlighting the importance of discovering heart-specific targets. Our analysis suggests that Elk-1 occurrences bind preferentially in the promoters of energy and translation genes and may be relevant to understanding changes in expression during cardiac disease.
We also discovered a novel motif with a GC-rich core sequence (GCGC) that occurs many times in the promoter of genes in the energy and translation modules. In addition to this motif, we also found a longer, non-position specific motif highly enriched in the same promoter sequences that closely resembles two nearby position-specific GC-rich half sites (Additional file 4). Moreover, we find that at least seven promoters contain either directly adjacent or partially overlapping occurrences of the short GCGC-rich sequence (dark yellow motif, shown in Figure 6), which is consistent with the idea that this finding represents half of a longer motif. Occurrences of the longer motif are highly conserved through evolution compared to surrounding sequence (Additional file 4), strongly suggesting that this motif plays a functional role.
A non-canonical initiator motif (YYCTTTYY) appears to be a nearly universal feature of translation promoter sequences. In addition to its sequence, the motif is highly specific to the positive strand and the area immediately surrounding the transcription start site (Figure 5, bottom right panel). Given the pyrimidine-rich sequence and the unambiguous bias in position, it is clear that this motif is a degenerate initiator consensus common to the promoter of translation genes. A similar motif has been identified in a previous study  and found to be enriched in translation promoter sequences. Here, we identified this motif in the promoter of a group of co-regulated genes, suggesting that it may play a direct role in transcriptional regulation. Given that the initiator consensus is the initial binding site for RNA polymerase II, it is not far-fetched to speculate that small differences in the sequence may play a role in determining either the affinity of polymerase II or the energy required to separate the DNA strands and initiate transcription; both of which could potentially modulate the quantity of transcript produced. Nonetheless, we cannot rule out the possibility that this motif it is just more likely to occur in combination with another motif (such as Elk-1) that more directly determines the level of transcription.
We describe an analysis of public microarray experiments that identifies groups of genes, or co-regulatory modules, that undergo similar changes in expression over different forms of hypertrophic and dilated cardiomyopathy. Three of these modules were associated with cardiac disease by previous microarray studies. The promoter sequence of genes in these modules were used to identify putative regulatory motifs predicted to cause the observed changes in gene expression. Our analysis discovered 17 regulatory motifs, including a CACC-box, Elk-1, a degenerate initiator sequence, and a novel GC-rich motif. Searching for the reported pattern of motifs in additional promoter sequences (not used for motif discovery) reveals that promoters with each pattern are significantly more likely to drive similar changes in gene expression in the cardiomyopathies analyzed in this study. Our analysis reveals motifs predicted to play a role in gene expression changes associated with several different types of human heart disease.
Iterative Alignment/Modular Motif Selection
Gene Set Enrichment Analysis
Left ventricular assist device
We thank Eva Oxford, Rebecca Smith, Christan Zemlin, and Joe Stein for providing discussion and comments on the final manuscript. In addition to reading the final manuscript Frank Middleton and Tom Duncan provided thoughtful comments throughout the duration of the study that helped shape its direction. We also think our reviewers for insightful comments and careful reading of our manuscript. Work supported by NIH grant RO1-HL071635-05.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.