Identification of gene co-regulatory modules and associated cis-elements involved in degenerative heart disease

Background Cardiomyopathies, degenerative diseases of cardiac muscle, are among the leading causes of death in the developed world. Microarray studies of cardiomyopathies have identified up to several hundred genes that significantly alter their expression patterns as the disease progresses. However, the regulatory mechanisms driving these changes, in particular the networks of transcription factors involved, remain poorly understood. Our goals are (A) to identify modules of co-regulated genes that undergo similar changes in expression in various types of cardiomyopathies, and (B) to reveal the specific pattern of transcription factor binding sites, cis-elements, in the proximal promoter region of genes comprising such modules. Methods We analyzed 149 microarray samples from human hypertrophic and dilated cardiomyopathies of various etiologies. Hierarchical clustering and Gene Ontology annotations were applied to identify modules enriched in genes with highly correlated expression and a similar physiological function. To discover motifs that may underly changes in expression, we used the promoter regions for genes in three of the most interesting modules as input to motif discovery algorithms. The resulting motifs were used to construct a probabilistic model predictive of changes in expression across different cardiomyopathies. Results We found that three modules with the highest degree of functional enrichment contain genes involved in myocardial contraction (n = 9), energy generation (n = 20), or protein translation (n = 20). Using motif discovery tools revealed that genes in the contractile module were found to contain a TATA-box followed by a CACC-box, and are depleted in other GC-rich motifs; whereas genes in the translation module contain a pyrimidine-rich initiator, Elk-1, SP-1, and a novel motif with a GCGC core. Using a naïve Bayes classifier revealed that patterns of motifs are statistically predictive of expression patterns, with odds ratios of 2.7 (contractile), 1.9 (energy generation), and 5.5 (protein translation). Conclusion We identified patterns comprised of putative cis-regulatory motifs enriched in the upstream promoter sequence of genes that undergo similar changes in expression secondary to cardiomyopathies of various etiologies. Our analysis is a first step towards understanding transcription factor networks that are active in regulating gene expression during degenerative heart disease.


Background
Heart disease is the leading cause of death in the developed world. Chronic heart disease is usually associated with tissue remodeling that induces maladaptive changes in gene expression and the cellular composition of cardiac tissue. Different forms of the disease are widely believed to progress according to distinct programs of gene expression that converge in end stage heart failure to similar phenotypes [1]. Microarrays have been used to characterize these differences, typically by focusing on changes in gene expression that exceed a statistical threshold [2,3]. Such methods of gene selection have proven useful for classifying different etiologies [4,5] and explaining certain aspects of the pathophysiology [6][7][8][9]. However, such a strategy is not able to identify the network of regulatory factors that facilitate gene expression in healthy tissue and during cardiac disease. In the present study, we apply a set of basic analytical tools to identify regulatory factors using microarray data and the upstream promoter sequence of each gene. We apply these tools to predict cis-regulatory motifs involved in remodeling cardiac tissue in different types of human cardiomyopathy.
It is well established in yeast [10] and cultured human cells [11] that genes involved in a common physiological function tend to be regulated as groups. In such a group, often called a co-regulatory module [12], genes undergo similar changes in expression that act to roughly preserve their expression ratio over different physiological conditions and intrinsic genetic cues. Our goal is to identify such modules in human cardiomyopathies, under the assumption that these modules can provide information about the regulatory factors that control expression. Our analysis uses publicly available microarray data for human ventricular tissue remodeling due to a variety of cardiac disease states. To identify likely regulatory modules in this data, we applied a hierarchical clustering algorithm to the Pearson correlation between gene expression levels across the different cardiomyopathies. Resulting clusters were visualized and characterized based on Gene Ontology annotations for function. With this analysis, we identified 35 modules, the largest of which are enriched in genes whose primary function is related to energy generation or protein translation.
Next, we addressed the question of what controls the coordinated changes in gene expression that are observed during heart disease. It is well accepted that changes in gene expression are encoded by the combinatorial activity of several different transcription factor proteins working in concert [13][14][15]. Changes over different physiological conditions presumably involve the activity of different combinations of transcription factors; genes whose expression is controlled by the same set of transcription factors may be expected to undergo similar changes in expression [15]. Transcription factors associated with the regulation of a gene can be identified by the presence of characteristic cis-regulatory motifs in the upstream promoter sequence to which they bind. Therefore, we sought to identify putative regulatory motifs involved in transcriptional regulation of genes composing the different co-regulatory modules. Our motif discovery strategy identified 17 motifs, and we validated their function with additional bioinformatic analysis using other genes.

Microarray Data Normalization and Batch Effect Correction
The first step in our basic experimental plan (outlined in Figure 1) was to identify genes that are co-expressed across the spectrum of different heart diseases. All ventricular microarray experiments based on the Affymetrix U133A or U133 2.0 platform for which raw CEL files were available, were collected from Gene Expression Omnibus [16] and Harvard's Cardiogenomics website (http://www.car diogenomics.org, Jan 2007). CEL files for the U133A and U133 2.0 platform were RMA normalized separately using Flowchart for identifying and testing putative motifs that affect gene transcription in cardiac disease Figure 1 Flowchart for identifying and testing putative motifs that affect gene transcription in cardiac disease.
Bioconductor's Affy package [17]. Data sets were combined by dropping probesets not present in the U133A array from the U133 2.0 data. To correct for batch effects we used a recently described method [18]. First, samples were grouped categorically by biological condition and lab of origin, as in Figure 1. The same category of "normal ventricle" was assigned to control samples from the GSE1145, GSE2240, and GSE3585 datasets, and the category "ischemic cardiomyopathy" was assigned to the corresponding samples from GSE1145 and GSE974 datasets. All other samples were assigned a separate category. Next, the Empirical Bayes batch effect correction method [18] was applied using an R script kindly provided on the authors' website. We dropped one outlier from the final table (GSM14948) because we noted abnormally low expression of many cardiac genes, potentially indicating changes in the cellular composition of the tissue.
We verified the effectiveness of batch-effect correction using hierarchical clustering of the 149 different conditions. Whereas prior to the correction of batch effects, hierarchical clustering grouped samples by different labs, after correction samples were not grouped together (not shown). We also ensured that our correction preserved legitimate biological variation by training a naïve Bayes which classifier correctly classified the disease type with 75% accuracy (not shown). The resulting data table was a matrix with rows representing each probe in the Affymetrix U133A platform, and columns representing each of the 149 individual microarray samples.
The mean and variance of each probe was calculated for each of the 16 biological conditions described in Additional file 1, leaving a data table with rows representing each of the Affymetrix U133A probesets and 16 columns corresponding to each biological condition. This data table was used for subsequent analysis, and is referred to below as the "cardiomyopathy data".

Creating Pairwise Correlations and Assembling the Correlation Matrix
All highly expressed genes in normal human heart tissue with a mean expression greater than 20,000 units in the Gene Atlas whole-heart data [19] were selected for analysis. This included 298 different Affymetrix probesets monitoring the expression of 222 unique genes. A matrix representing the degree of co-expression between each pair of these probesets was calculated by taking the Pearson correlation coefficient of the cardiomyopathy data table using R [20]. To visualize the matrix of Pearson coefficients for these highly expressed genes, Pearson correlations in this matrix were assigned a color-scale (Figure 2, right) that ranges between red indicating high correlation (R = +1), fading to black indicating no correlation (R = +0), fading to green indicating negative correlation (R = -1).
We sorted this matrix using a procedure based on hierarchical clustering. Clustering was preformed using the R package "Cluster". The data presented here are based on complete-linkage, Euclidean distance hierarchical clustering. Subsequently, gene order was refined using a procedure that maximizes the correlation between neighboring clusters while maintaining the same network of connections in the dendrogram. Briefly, the algorithm proceeds from top to bottom along the dendrogram. At each branch, the mean correlation between neighboring clusters left and right of the new branch on the dendrogram is calculated for both potential orders of the clusters below the branch. The order is set such that it maximizes the mean correlation between neighboring clusters. Since the procedure is applied from top to bottom, it does not affect the composition of clusters, only the order of clusters in the visualization. The resulting visualization is depicted in Figure 2, and details for three clusters are shown in Figure  3.

Choosing the Optimal Number of Modules that Describe the Data
To determine the optimal number of clusters, we used the hypothesis that genes that are clustered together will share a biological function. Based around this idea, we introduce a measure of uniformity for the biological function of genes in each cluster. This measure can be expressed as: Q i = Sum j [(N c ij /N ij ) * (N ij -1)/N m i ]/i, where N c ij is the number of probesets that share the biological function annotation assigned to cluster j using the minimal Fisher's exact p-value assigned using the Bioconductor package "topGO", N ij is the number of probesets in cluster j, N m i is the mean cluster size (expressed as the number of probesets) when the data is broken into i clusters. The measure was designed such that the contribution of trivial clusters with size 1 is 0, and the contribution from smaller clusters is reduced. The value of this measure for each number of clusters between 2 and 60 is plotted in Additional file 2. At 35 clusters, the measure reaches a global maximum. This number of clusters is used in all subsequent analysis.

Assembling Promoters for Motif Discovery Analysis
Transcription start sites for each gene selected for detailed promoter analysis were obtained using the database of transcription start sites version 6.0 [21]. When multiple transcription start sites were reported, we chose the most frequent start site reported to be expressed in the heart or pericardium. Promoter sequences from -1 kb to +200 bp were obtained from the ENSEMBL genome browser, release 49, March 2008 [22], as this range allowed us to focus on the strength of IAMMS position filtering algo-rithm to identify motifs in the core and proximal promoter. A background group of 129 promoter sequences (-1 kb to +200 bp) whose genes show uniform expression across all human tissues was constructed. To build the list of genes, we calculated the ratio of maximum expression to the sum of expression in all tissues across the Gene Atlas. We took the probes with the lowest ratio (indicating uniform expression across tissues). Analysis using DAVID [23] revealed that this group is not enriched in any biological function. For each of these genes, 1 kb of promoter sequence and the entire 5' UTR were obtained using BioMart. The UTR was then truncated at +200 bp to yield the same -1 kb to +200 bp as the foreground set.

IAMMS De Novo Motif Discovery Search
The upstream region of genes in the contractile, energy generation, and protein translation sets were scanned against the background promoter set using IAMMS, as described previously [24]. In the initial stage, IAMMS detected 10,240, 19,665, and 16,719 motifs to test for enrichment in the contractile, energy, and translation promoter groups, respectively. The p-value of enrichment of each motif in each group was calculated using the hypergeometric distribution, and a p-value threshold of 1e-5 was used to select significantly enriched motifs. At this significance threshold, maximal expected false discovery rates of 0.10, 0.20, and 0.17 were estimated using the Identification of gene expression modules in cardiac disease Figure 2 Identification of gene expression modules in cardiac disease. Highly expressed genes in the heart were sorted using hierarchical clustering (dendrogram shown on left) based on the similarity in their expression pattern over different types of cardiac disease. These relationships are visualized on a correlation matrix (center), in which each pixel represents the correlation between a pair of genes, and each row represents the relationship between one gene and all others. The color scale (right) represents the degree of correlation, ranging from strongly positive (red) to uncorrelated (black) to strongly negative (green). Boxes along the diagonal of the matrix correspond to modules formed by chopping the dendrogram at the level indicated by the dotted line (determined by maximizing the number of genes in each module associated with the same gene ontology term; see methods). Boxes shown in bold contain genes with particular relevance to cardiac disease, and are shown in detail in figure  3. The bold portion of the dendrogram is also shown in Figure 3.
Bonferroni method for the contractile, energy generation, and translation promoter sets, respectively.
After motif detection, variations that distinguish contractile, energy, and translation genes were discovered by identifying IUPAC degenerate consensus sequences that optimally separate genes in the each module from those in the other modules. For nearly identical motifs detected independently in energy and translation promoter sequences, occurrences were optimized with respect to genes in the contractile module. To avoid over-fitting, only sequences with a Pearson correlation of 0.9 or above were scanned for enrichment (values of 0.7 were used for Initiator and TATA sequences). Similarity between sequences was measured by concatenating column vectors from the position-weight-matrix representation of a motif, and taking the Pearson correlation between vectors.

Detection of Known Motifs
To complement the de novo search, our next objective was to identify known motifs that are enriched in the promoter. We used the Gene Set Enrichment Analysis (GSEA) software package [25] to identify phylogenetically conserved motifs that are correlated with the contractile,  Figure 2). Genes in each module are given by MGI ID along the vertical axis. The same gene order is presented along the horizontal axis, as shown for (I). Module (I) is strongly enriched in genes whose products function primarily in the cardiac muscle contractile apparatus (6/9). Note that some genes are represented by two probes (e.g. TPM1) and counted only once. The dendrogram for this module is given on the left (bold section in Figure 2). A heatmap representing the relative expression of each gene over the different cardiac diseases is shown on the right. In the heatmap, color indicates the Z-score of expression relative to the mean over all diseases. Modules (II) and (ii) are enriched in genes involved in generation of precursor metabolites and energy (16/20). Modules (III) and (iii) are entirely comprised of genes involved in protein translation (20/ 20). Genes whose primary function is not known to match these groups are indicated using italic font. energy generation, or protein translation module. For each module, the 298 highly expressed probesets were ranked based on their correlation to the mean expression patterns of each module. Gene lists were compared against the curated motif gene list ("all", v. 2.5) using GSEA v.2.0.4, requiring motifs to occur in at least 8 of the genes. All motifs that have a false discovery rate corrected q-value less than 0.1 are reported.

Determining Motif Interval
To determine motif intervals, we built a histogram of the number of occurrences in a sliding window. The mean number of occurrences was calculated based on the size of the motif, assuming a GC content of 50%. We took the interval nearest the transcription start site for which the histogram was above the 99% confidence interval of expected occurrences given a Poisson distribution (as illustrated in Figure 5). Intervals were determined for motifs for which we expected to find a significant bias, either because it was detected by IAMMS as position specific, or (for TATA) because previous literature indicated a significant bias [26].

Naïve Bayes Classifier
The Pearson correlation was calculated between the mean expression patterns for genes in the contractile, energy, and translation module, and all genes designated as "present" in the Gene Atlas heart data [19]. All promoter sequences in the database of transcription start sites [21] corresponding to genes with Pearson correlation above a "high" threshold (0.8, test) or below a "low" threshold (0, background) were obtained for analysis. Other values of "high" and "low" correlation thresholds gave similar results (Additional file 5). The number of occurrences in the respective intervals of each of the motifs in Figure 4 were determined using custom Java software based on the BioJava [27] libraries for position-enriched motifs. GSEA annotations were taken from the file "c3.all.v2.5.symbols.gmt" obtained from the GSEA website, and were applied to each promoter sequence using custom java software. These two sources of annotation were combined into a single data table with columns representing the different IUPAC consensus sequences or GSEA annotations (those shown in Figure 4), and rows representing test and background promoter sequences. Values in this table represented the number of occurrences of each IUPAC consensus sequence in the specified interval in each promoter sequence, or a categorical variable indicating whether the gene is a member of each GSEA gene set.
This data table was used to train and test a naïve Bayes classifier. We use an implementation of the naïve Bayes classifier in the "e1071" R package with default parameters for both training and prediction. The classifier was trained using values based on the promoter sequences of contractile, energy, and translation module depicted in Figure 6 (high co-expression), and 150 of promoters with a Pearson correlation lower than 0 (low co-expression). The classifier was then tested on the remaining promoter sequences not used in training. Default parameters were used for both training and prediction. Statistical analysis of the procedure was completed using Fisher's exact test in R.

Identifying Gene Expression Modules
Our overall strategy is outlined in Figure 1. The first objective was to identify groups of co-regulated genes (called co-regulatory modules). To this end, we use gene expression microarray data from 149 human ventricle samples collected from public microarray sources, including Gene Expression Omnibus and Harvard's CardioGenomics website (see Methods). Samples were separated into 16 disease conditions (Additional file 1). Different conditions include cardiomyopathies of eight different etiologies, including familial, hypertrophic, idiopathic, ischemic, post-partum, viral, and dilated cardiomyopathy. The response of ischemia and myocardial infarctioninduced myocardial failure to treatment by a left ventricular assist device (after LVAD) was also included. Control ("normal") ventricle samples from three different labs were included and treated separately to correct microarray samples for batch effects (see Methods for the normalization procedure).
We started with the identification of co-regulated gene pairs. Figure 2 shows the matrix of pairwise correlation for the 222 highest expressed genes in the Gene Atlas heart data. In our visualization scheme, each pixel represents the degree of co-regulation between a pair of genes. Color represents the square of the Pearson correlation (preserving the sign), with red representing a high Pearson correlation of +1 and green a negative correlation of -1 (see the color scale in Figure 2). Each row (or column) represents the correlation between a single gene and all others. Since the order of genes in the matrix is the same from left to right and bottom to top, the visualization is symmetric about the diagonal. The order of genes is chosen using hierarchical clustering (see Methods), locating co-regulated genes near one another in the matrix.
To identify the most biologically relevant gene expression modules, it was important to identify the best possible divisions between groups of genes provided by the clustering algorithm. We used the hypothesis that the most accurate groups inferred from the data should share a biological function. Therefore, we defined a measure to maximize the number of genes classified as the same biological function in each module (uniformity) and maximize module size (see methods). We found that our Top scoring motifs detected in the promoter sequence of genes in the contractile, energy generation, and protein translation modules (Figure 3) Identifying the range of motif enrichment Figure 5 Identifying the range of motif enrichment. Each plot tracks the number of occurrences of a given motif in enriched promoter sequences within a sliding window. The lower dotted line indicates the number of occurrences expected by chance. The range in position was determined by taking the approximate point at which the largest peak crosses the 99th percentile (upper dashed line) calculated using a Poisson distribution. For each motif, this range is marked by the colored lines above and below each plot.
Specific patterns of top-scoring motifs in the promoter sequence of genes in either the contractile (top), energy generation (center), or protein translation (bottom) modules Figure 6 Specific patterns of top-scoring motifs in the promoter sequence of genes in either the contractile (top), energy generation (center), or protein translation (bottom) modules. Gray bars represent 435 bp of promoter and 5' UTR sequence (-400  +35 bp). Colored boxes indicate occurrences of each motif using the color patterns shown in the legend (bottom). Occurrences are drawn to scale, and overlapping sites are drawn at half-height to ease viewing. The scale bar is given on the top, with numbers relative to the transcription start site (gray dashed line, +1 bp). Colored lines above the scale indicate the position range used for each motif. Note that the pyrimidine-rich sequence (black) is directly adjacent to the transcription start site in most genes of the translation module, but not in contractile or energy generation genes. Also note the complete absence of the CCGGAA (green) motif in the contractile promoter sequences. Instead, contractile promoters have a high proportion of a different G-rich motif (GGGRWGGG, red), occurring between -400 bp and the transcription start site. measure reaches a global maximum when the data is divided into 35 groups (Additional file 2), indicating that this number is the optimal balance between group size and uniformity of biological function. The divisions indicated by this analysis are denoted on the correlation matrix visualization (Figure 2) by the white, teal, and yellow squares along the diagonal. These divisions are also indicated by the red-dotted line along the dendrogram (Figure 2, left).
Three of the larger modules, enriched in genes with functional relevance to cardiac disease, are shown in detail in Figure 3. In the module shown in Figure 3-I, 6/9 genes (bold font) have a primary function related to muscle contraction (Gene Ontology term GO:0006939). Specific functions include two of three genes in the troponin complex (TNNI3 and TNNC1), tropomyosin (TPM1), cardiac alpha actin (ACTC1), and the ventricular myosin light chain isoform (MYL2). One additional muscle-specific gene (CSRP3) is also likely to play a supporting role in contraction. The gene with the lowest correlation to others in the module, ACTA1, is represented by the least-intense red row (r = 0.70, with TNNC1 and CSRP3). This lower correlation is indicative of a relatively large decrease in expression in this gene in one of the dilated cardiomyopathy expression sets (heat maps are shown in Additional file 3). Exceptions to the primary function (names in italics), such as myobglobin (MB), are nonetheless strongly co-regulated with other genes in the module (r = 0.83 with TNNC1).
In addition to the contractile module, other modules with a primary function related to energy generation and translation are shown in Figure 3. In addition to the large modules shown in Figure 3II and 3III, small neighboring modules enriched for genes with the same function are also shown (Figure 3ii and 3iii). Modules shown in Figure  3II and 3ii contain 16/20 unique genes whose primary function is relevant to the generation of precursor metabolites and energy (GO:0006091). One of the exceptions encodes a selenoprotein (SEPW1), an oxygen free-radical scavenger with a role in mitigating the oxidative stress associated with energy generation [28,29]. Modules shown in Figure 3III and 3iii contain 20/20 unique genes that encode proteins in the 80S ribosome complex (GO:0006412). Previous studies have linked both energy generation and protein translation genes with particular types of cardiomyopathy [4,7], highlighting the relevance of these modules to cardiac disease.

Identification of Putative cis-Elements in the Promoter of Co-regulated Genes
The next step was to identify motifs enriched in these promoter sequences that may be able to explain the observed patterns of co-regulation ( Figure 1B). Our hypothesis was that co-regulated genes should share a common set of transcription factor proteins, which could be detected by the presence of a specific pattern of regulatory motifs in the upstream promoter sequence. The three groups highlighted in Figure 3 were chosen for analysis because of their relevance to expression patterns previously observed in cardiac disease [4,7]. We used two strategies to analyze the core promoter region of co-regulated genes. First, the motif discovery algorithm Iterative Alignment/Modular Motif Selection (IAMMS) [24] was used to identify putative regulatory sites de novo. Second, we searched for an enrichment of known regulatory motifs using the gene set enrichment analysis (GSEA) tools [25]. To minimize the possibility of false-positive errors, a stringent p-value cutoff threshold (p < 1e-5) was chosen based on a Bonferroni correction for IAMMS, and a corrected q-value cutoff (q < 0.1) for GSEA (see methods). Sequences are presented both as sequence logos and as IUPAC consensus sequences using the degeneracy codes R (A or G), Y (C or T), W (A or T), S (C or G), and N (A, T, C, or G).
The highest scoring motifs enriched in contractile, energy generation, and translation groups are shown in Figure 4. Among the highest scoring motifs enriched in contractile promoter sequences (Figure 4, top) was found to be a degenerate TATA consensus found in all 12 contractile promoter sequences, but only rarely found in energy generation or translation promoters. Several C-rich sequencers were also discovered by IAMMS to be enriched in contractile promoter sequences. One C-rich variation closely resembles a CACC-box known to be involved in the transcriptional regulation of several cardiac genes [30][31][32][33]. Our CACC-box includes all of these experimentally validated occurrences, strongly suggesting a functional role for this motif in the other genes as well. Several additional motifs similar to previously characterized cardiacmuscle specific transcription factors were discovered using the GSEA software, including motifs annotated as MEF2 [33], SRF [34], and E-box [35]. GSEA also detected a strong enrichment in a motif annotated as ERR1 (TGACCT), with no previously known role in regulating contractile genes.
Motifs enriched in energy generation and translation promoter sequences are shown in the middle and bottom of Figure 4, respectively. Of the motifs detected as enriched in each set, three nearly identical motifs were detected independently in energy and translation promoters. Similar motifs are GC-rich, including a 4 bp core sequence GCGC, an SP-1-like motif (CCGCC), and a sequence nearly identical to the core of Elk-1 (TTCCGG). Individual motifs that separate energy from translation promoters include a degenerate pyrimidine-rich initiator element enriched near the transcription start site in translation genes (YYCTTTYY), and a GC-rich sequence found to be enriched only in energy generation genes (GCGGA). Energy generation promoters were also found by GSEA to contain the ERR1 motif (TGACCT), which was also discovered independently in the contractile promoter sequences.
For each motif we identified the range of positions in which it is enriched relative to the transcription start site using the approach illustrated in Figure 5. Plots depict the number of occurrences of a given motif found inside a window of fixed size surrounding each point. The top dashed line indicates the 99 th percentile of expected occurrences based on a Poisson distribution. We define the range of enrichment to be the interval near the transcription start site in which the number of occurrences of a motif increases above this dashed line, shown by the colored boxes surrounding the plot. The range of any significant bias in distribution discovered using this approach is indicated in Figure 4, and used in subsequent analysis.
Our analysis shows that for each module there is a pattern of regulatory motifs present in nearly all promoter regions that distinguish it from the other modules. Figure 6 shows these patterns plotted to scale in the promoter region of genes in the contractile, energy generation, and protein translation modules. The CACC-box (shown by the red box) occurs in all contractile promoters and only 7/37 energy generation and translation promoter sequences. Similarly, the TATA-like motifs (YAYWWA and TAN-WWR) can be found in all contractile promoters (dark blue box), but only 4/37 energy generation and translation promoters. Translation promoters are characterized by the presence of a pyrimidine-rich initiator-like element (black box) that appears surrounding the transcription start site in nearly all translation genes (18/19), whereas this motif occurs only twice in energy generation promoters. Occurrences of the Elk-1 motif (green box) and the GC-rich motif (dark yellow box) are in enriched energy generation and translation promoters, but occur rarely in contractile sequences.
We test the hypothesis that the unique pattern of motifs discovered in highly expressed cardiac genes plays a role in the co-regulation of additional genes with similar expression patterns compared to the contractile, energy generation, and protein translation module. Our bioinformatic approach to this question is outlined in Figure  1(C-D). First, we identified additional promoter sequences from across the cardiac transcriptome that were not used in motif discovery, but contain the same pattern of motifs as those in Figure 6 (Figure 1C). Next, we looked at the expression patterns of these additional genes and assessed whether these promoters are more likely to be coregulated with those used to build the model ( Figure 1D). As shown in Table 1, genes with the reported pattern of motifs are significantly more likely to be co-expressed with each module compared with those in the larger population (p = 5.65e-4, 2.03e-3, and 1.16e-11, respectively). The most accurate classifier was based on genes in the Protein Translation module where the odds of having the motif pattern shown in Figure 6 for promoter sequences of highly co-regulated genes (r > 0.8) were 35:51. Conversely, for promoters with low (r < 0) co-expression, the odds of having the motif pattern were 178:1429. The overall strength of association between the presence of the motif pattern and high levels of co-regulation were estimated using the odds ratio OR = (35/51)/(178/1429) = 5.5 (95 th Confidence Interval = 3. 37 -8.89), shown at the bottom of Table 1. The classifier based on the contractile and energy promoters also recognized co-regulated genes significantly better than chance, with odds ratios of 2.7 (95 th confidence interval = 1.46 -5.28) and 1.9 (95 th confidence interval = 1.24 -3.13), respectively. Moreover, the statistical significance of each classifier is robust across a wide range of definitions for high and low co-regulation (Additional file 5). This provides strong evidence that the pattern of motifs described here plays a role in the changes in gene expression observed in heart disease. Contingency tables show the relationship between the pattern of motifs in the core promoter region and the expression patterns of genes in the test group (n = 3473), not used during motif the discovery procedure. The correlation thresholds were set to r > 0.8 (high) and r < 0 (low). The odds ratio measures the strength of association between the pattern of motifs in each promoter and the pattern of expression. The 95% confidence interval of the odds ratio is shown in parentheses. The p-value calculated using Fisher's exact test is given in the bottom row.

Discussion
In the present study, we divide genes into groups, or modules, that undergo similar changes in their expression patterns over different forms of heart failure. The modules indicated by our analysis are exploited to identify putative cis-regulatory motifs which may bind transcription factors that contribute to the observed pathological expression patterns. Our approach was motivated by the hypothesis that genes with similar expression patterns are regulated by the same set of transcription factors, and therefore are likely to have similar cis-regulatory motifs in their upstream promoter region. The power of this assumption was recently demonstrated in the yeast model [15], where motifs were discovered that were able to predict gene expression patterns. To our knowledge, this article is the first in which this assumption has been applied to human tissue or to the study of a specific disease in which a holdout set of genes has been used to validate predictions. We report 17 putative cis-regulatory motifs, including ELK-1, a CACC-box, and a pyrimidine-rich initiatory variant that are predicted to play a role in cardiac gene expression. Our predictions may serve as a base for experimental studies seeking to understand how genes are reorganized during heart failure.
We discovered several motifs that are similar to known regulatory elements. Of these, many are variations of either core promoter (TATA, Initiator) or proximal enhancer (ELK-1, CACC, SP-1) elements. We show here that these motifs are more likely to be found together in promoter sequences that drive specific changes in gene expression during heart disease ( Table 1). This implies that the motifs discovered here are strong candidates for modulating disease-related changes in gene expression. Our study highlights the importance of core and proximal promoter motifs in determining changes in transcription, and suggest that they play a larger role in the combinational regulatory code than has previously been ascribed.
At the level of the core promoter, energy generation and protein translation genes are controlled by surprisingly similar regulatory programs. Of four motifs discovered as significantly enriched in both energy and translation sequences, three share nearly identical GC-rich core sequences (Figure 4), including an Elk-1 binding site (CCGGAA), a motif with a GC-rich core sequence (GCGC), and a degenerate SP-1 like motif (CCGCC). These motifs were all found to be enriched in additional promoter sequences not used in the motif discovery procedure that share the expression patterns with the energy or translation module (not shown; p < 0.02) and were important for our naïve Bayes classification analysis. These results suggest that these factors may play a role in mediating gene expression changes during cardiac disease.
The Elk-1 prediction has particular relevance to the study of cardiac disease, because Elk-1 activity is modulated by both MAPK-p38 and calcineurin signaling pathways [38] that are implicated in animal models of heart failure (Reviewed in [39]). Calcineurin signaling is activated by calcium, particularly under conditions of calcium overload that are a nearly universal effect of end-stage heart failure. The literature on MAPK-p38 in heart failure suggests that p38 is activated, at least in the early stages of animal models of hypertrophy [40]. Unfortunately, the two signaling pathways have opposite effects on Elk-1 activity [41,38], making it difficult to predict how a particular heart disease will affect the transcription of Elk-1 targets. Experimental studies in human patients have directly demonstrated changes in Elk-1 signaling during heart failure [42], however, highlighting the importance of discovering heart-specific targets. Our analysis suggests that Elk-1 occurrences bind preferentially in the promoters of energy and translation genes and may be relevant to understanding changes in expression during cardiac disease.
We also discovered a novel motif with a GC-rich core sequence (GCGC) that occurs many times in the promoter of genes in the energy and translation modules. In addition to this motif, we also found a longer, non-position specific motif highly enriched in the same promoter sequences that closely resembles two nearby position-specific GC-rich half sites (Additional file 4). Moreover, we find that at least seven promoters contain either directly adjacent or partially overlapping occurrences of the short GCGC-rich sequence (dark yellow motif, shown in Figure  6), which is consistent with the idea that this finding represents half of a longer motif. Occurrences of the longer motif are highly conserved through evolution compared to surrounding sequence (Additional file 4), strongly suggesting that this motif plays a functional role.
A non-canonical initiator motif (YYCTTTYY) appears to be a nearly universal feature of translation promoter sequences. In addition to its sequence, the motif is highly specific to the positive strand and the area immediately surrounding the transcription start site ( Figure 5, bottom right panel). Given the pyrimidine-rich sequence and the unambiguous bias in position, it is clear that this motif is a degenerate initiator consensus common to the promoter of translation genes. A similar motif has been identified in a previous study [43] and found to be enriched in translation promoter sequences. Here, we identified this motif in the promoter of a group of co-regulated genes, suggesting that it may play a direct role in transcriptional regulation. Given that the initiator consensus is the initial binding site for RNA polymerase II, it is not far-fetched to speculate that small differences in the sequence may play a role in determining either the affinity of polymerase II or the energy required to separate the DNA strands and initiate transcription; both of which could potentially modulate the quantity of transcript produced. Nonetheless, we cannot rule out the possibility that this motif it is just more likely to occur in combination with another motif (such as Elk-1) that more directly determines the level of transcription.

Conclusion
We describe an analysis of public microarray experiments that identifies groups of genes, or co-regulatory modules, that undergo similar changes in expression over different forms of hypertrophic and dilated cardiomyopathy. Three of these modules were associated with cardiac disease by previous microarray studies. The promoter sequence of genes in these modules were used to identify putative regulatory motifs predicted to cause the observed changes in gene expression. Our analysis discovered 17 regulatory motifs, including a CACC-box, Elk-1, a degenerate initiator sequence, and a novel GC-rich motif. Searching for the reported pattern of motifs in additional promoter sequences (not used for motif discovery) reveals that promoters with each pattern are significantly more likely to drive similar changes in gene expression in the cardiomyopathies analyzed in this study. Our analysis reveals motifs predicted to play a role in gene expression changes associated with several different types of human heart disease.