Genome-wide transcriptome study using deep RNA sequencing for myocardial infarction and coronary artery calcification

Background Coronary artery calcification (CAC) is a noninvasive measure of coronary atherosclerosis, the proximal pathophysiology underlying most cases of myocardial infarction (MI). We sought to identify expression signatures of early MI and subclinical atherosclerosis in the Framingham Heart Study (FHS). In this study, we conducted paired-end RNA sequencing on whole blood collected from 198 FHS participants (55 with a history of early MI, 72 with high CAC without prior MI, and 71 controls free of elevated CAC levels or history of MI). We applied DESeq2 to identify coding-genes and long intergenic noncoding RNAs (lincRNAs) differentially expressed in MI and high CAC, respectively, compared with the control. Results On average, 150 million paired-end reads were obtained for each sample. At the false discovery rate (FDR) < 0.1, we found 68 coding genes and 2 lincRNAs that were differentially expressed in early MI versus controls. Among them, 60 coding genes were detectable and thus tested in an independent RNA-Seq data of 807 individuals from the Rotterdam Study, and 8 genes were supported by p value and direction of the effect. Immune response, lipid metabolic process, and interferon regulatory factor were enriched in these 68 genes. By contrast, only 3 coding genes and 1 lincRNA were differentially expressed in high CAC versus controls. APOD, encoding a component of high-density lipoprotein, was significantly downregulated in both early MI (FDR = 0.007) and high CAC (FDR = 0.01) compared with controls. Conclusions We identified transcriptomic signatures of early MI that include differentially expressed protein-coding genes and lincRNAs, suggesting important roles for protein-coding genes and lincRNAs in the pathogenesis of MI.

• Immune response, lipid metabolic process, and interferon regulatory factor were enriched in these 68 protein-coding genes. • Alternatively, only 3 coding genes and 1 lincRNA were differentially expressed in high coronary artery calcification (CAC) cases versus controls. • APOD, encoding a component of high density lipoprotein, was significantly downregulated in both early MI and high CAC compared with the control group after adjusting for sex and 9 clinical vascularrelated covariates, suggesting a potential novel target for the treatment and prevention of atherosclerotic disease.

Background
Myocardial infarction (MI) is a leading cause of death in men and women worldwide [1,2] Genetic inheritance is a major component to MI risk, particularly for early onset MI [1]. Coronary artery calcification (CAC) is directly correlated with quantity of coronary atherosclerotic plaque [3]. CAC detected by computed tomography is a noninvasive measure of coronary atherosclerosis and a CAC score is a strong independent predictor of future MI [4] including early MI [5,6]. Genome-wide association studies (GWAS) have identified common and rare genetic variants associated with both CAC and early MI, including variants in the 9p21, SORT1 and PHACTR1 loci [7][8][9][10][11]. However, the molecular mechanisms underlying early MI and CAC remain unclear. In particular, data are sparse regarding gene expression signatures for early MI and for subclinical coronary atherosclerosis, detected as high CAC. RNA sequencing for atherothrombotic cardiovascular disease offers a complementary genome-wide molecular approach to investigate disease-related mechanisms by measuring expression abundance of protein-coding genes (mRNA) and long intergenic non-coding genes (lincRNAs) in specific tissues. Altered expression levels in disease can reflect the effects of genetic variation, environmental effects, interaction between genetic variation and environmental effects, and the effects of the disease process itself or drugs used for its treatment. We conducted deep paired-end RNA sequencing (RNA-Seq) on whole blood samples collected from Framingham Heart Study (FHS) participants. Blood is an easily accessible tissue relevant for expression profiling of cardiovascular disease and its risk factors, has the advantage of providing information on patients' real-life state in contrast with cell-lines, and can be extended to very large sample sizes for biomarker screening.
Our current study aimed to generate and characterize coding and noncoding gene expression signatures of early-onset MI and CAC in whole blood collected from a single large cohort, the Framingham Heart Study (FHS), and to further examine the relationships between high CAC and early-onset MI based on expression profiling of whole blood. We studied 198 European ancestry individuals (55 with history of early MI, 72 with high CAC without MI, and 71 control participants free of elevated CAC levels or history of MI) with whole-blood RNA-Seq. To the best of our knowledge, this is a first whole-transcriptome study using RNA-Seq in participants with a history of prior MI or coronary atherosclerosis detected by the presence of CAC in a single study sample. We first performed a genome-wide transcriptome screen to identify blood-specific transcripts including mRNA and lincRNAs. Second, we conducted association analyses between MI/CAC and the expression levels of individual mRNAs and of lin-cRNAs. Last, we categorized the functional pathways of differentially expressed genes (DEGs) between MI/ CAC and controls to identify biological functions of the differentially expressed genes. Using deep coverage RNA-Seq data, we identified 12,062 protein-coding genes and 3707 lincRNAs expressed at relatively high levels in blood as well as significant MI-specific expression signatures, with eight genes (15%) supported in an independent cohort with RNA-Seq data. We sought to provide insights into mechanisms through which transcriptome-level variation may influence the development of subclinical coronary atherosclerosis and, ultimately, clinical MI.

Study population and sample collection
The FHS started in 1948 with 5209 randomly ascertained participants from Framingham, MA, who underwent biennial examinations to investigate cardiovascular disease and its risk factors [12]. In 1971, the Offspring cohort [13,14] (comprised of 5124 children of the original cohort and the children's spouses) and in 2002, the Third Generation (consisting of 4095 children of the Offspring cohort), were recruited [15]. Participants of the Framingham Heart Study (FHS) Offspring cohort who attended examination 8 (n = 202, 57 with early MI, 74 with high CAC without MI, and 71 control participants free of elevated CAC levels and MI matched with age and sex) were included, constituting a total of 198 individuals. The clinical characteristics of the FHS Offspring participants included in this study are presented in Table 1, and they are European ancestry. The study protocol was reviewed by the Boston University Medical Center Institutional Review Board, and all participants gave written informed consent.

RNA library preparation, sequencing, and data processing
Fasting peripheral whole blood samples (2.5 ml) were collected in PAXgene ™ tubes (PreAnalytiX, Hombrechtikon, Switzerland), incubated at room temperature for 4 h for RNA stabilization, and then stored at − 80 °C. Total RNA was isolated from frozen PAXgene blood tubes by Asuragen, Inc., according to the company's standard operating procedures for automated isolation of RNA from 96 samples in a single batch on a King-Fisher ® 96 robot. RNA was extracted; most globin RNA was removed (GLOBINclear Kits, Life Technologies, and Grand Island, NY, USA). 10 ng of total RNA was used as input for RNA-Seq library construction using Ovation RNAseq v2 (NuGEN Technologies, Inc., San Carlos, CA), following the guidelines for the Ovation SP Ultralow Multiplex System (NuGEN Technologies, Inc., San Carlos, CA). After the final amplification step, libraries were size selected between 250 to 450 base pairs. Library quality was verified for each sample using MiSeq (Illumina, Inc., San Diego, CA), sequencing with 75-bp paired-end reads. Sequencing data production was carried out with Illumina HiSeq 2000 (Illumina, Inc.; 75-bp pair-ended reads, 1 library/sample per lane) for 202 individuals, yielding 150 million paired-end reads (average) per individual. Reads were mapped to the NCBI v37 Homo sapiens reference genome using Tophat. Complete RNA-seq data are available in dbGaP (see data access note below). Details of RNA isolation, preparation of cDNA from RNA, and RNA-sequencing and data processing are available in the Additional file 1: Supplemental Material.
Data quality of raw sequencing data (.fastq) for each sample is assessed using FASTQC, and 75 bp pairedreads are aligned to the human reference genome sequence (hg19) using Bowtie2 within Tophat2 [16]. Samples with a low overall mapping rate (less than 50%) are defined as outliers and excluded from downstream analysis. We also sequenced 9 samples twice. For samples sequenced twice, the sample with higher number of sequenced reads and unique mapping rate was retained in the analysis. After assessment of quality based on mapping rate, sex mismatch checking using gene markers on chromosome Y, and outlier detection by principle component analysis (PCA), a total of 198 samples remained for use in all downstream analysis of this study. The flowchart of this analysis pipeline is shown in Fig. 1.

Characterization of coding genes and non-coding lincRNAs that are specifically expressed in blood using deep RNA-Seq
After alignment, using an annotation file (Ensembl) [17], fragments per kilobase of transcript per million mapped reads (FPKM) values are derived using Cufflinks [18] to be used as expression measurements for each feature (22,881 protein-coding genes and 7364 lincRNAs). For Illumina BodyMap RNA-seq data which contains 16 human tissues, the raw .fastq files were downloaded and processed as described above to obtain FPKM values for each transcript. After normalizing data and removing batch effects [19], a linear regression model is applied to identify coding genes and lincRNAs that are expressed at a higher level in blood than in 16 other tissues in the BodyMap. Candidate genes are viewed in IGV [20,21]. Expression measurements for coding-genes and lincRNAs were quantified by using Cufflinks as reported recently [22,23].

Estimation of hidden confounders and their association with known clinical phenotypes and technical variables
Besides known batch effects (sequencing batch) and known covariates, in order to take into account known and unknown technical effects, and other unwanted variations (e.g., proportions/frequencies of the various cell types present in whole blood) in the analysis, we applied the Surrogate Variable Analysis (SVA) [24] to calculate hidden variables, but in the meanwhile preserving biological heterogeneity in high-throughput experiments.
These computed hidden variations are needed to be adjusted in the downstream analysis model to decrease false positives. FPKM values of all transcripts estimated (including coding genes, lincRNAs, antisense transcripts, pseudogenes, etc) were used as the input for SVA package to estimate the surrogate variable (SV). Two significant SVs were finally selected and included in the downstream association analysis.

Identification of coding-gene and lincRNA expression signatures associated with MI and CAC
For our differential expression analysis, we first filtered for stably expressed mRNAs and lincRNAs in whole blood (FPKM > 0.1 in > = 10% of samples). After filtering non-expressed genes, for each of 12,062 proteincoding genes and 3707 lincRNAs, we applied DESeq2 [25] to identify genes differentially expressed in MI and high CAC, respectively, compared with the control group. Since sex are statistically different among early MI, high CAC and control (P = 0.00037), we adjusted for sex besides known batch effects and hidden confounders (i.e. 2 SVs estimated from the above step) in the model in which the raw count is the outcome (dependent variable), disease status (MI/CAC/control) is the predictor, and age, known batch effects and hidden confounders were included as covariates. Since these 198 participants were selected from un-related families, no family structure was adjusted for in the model. We applied the package DESeq2 in R to estimate the disease effects.

Protein-coding gene expression association analysis
A total of 12,062 tests were performed examining the associations between each of the expressed coding-genes and MI and high CAC, respectively, compared with the control group. Following a Bonferroni multiple test correction (Benjamini and Hochberg method), a false discovery rate (FDR) < 0.1 (corresponding to a nominal P value of 5.61E-4) was used to define significant associations between the expression level of code-genes and the disease status (differentially expressed coding genes with MI and high CAC, respectively). In addition, for genes differentially expressed between MI and control at FDR < 0.1, we performed a secondary, hypothesis-generating analysis to test expression level changes between MI and high CAC.

LincRNA expression association analysis
A total of 3707 tests were performed examining the associations between each of the expressed lincRNAs and MI and high CAC, respectively, compared with the control group. The same significance level of FDR < 0.1 (corresponding to a nominal P value of 6.33E-05) was used to define significant associations between the lin-cRNA expression level and the disease status (differentially expressed coding genes with MI and high CAC, respectively).

Gene ontology enrichment analysis to examine biological functions of gene signatures
For the 435 coding genes expressed moderately and highly in blood, we submitted their unique gene symbol to the DAVID website [26,27] to identify GO molecular function categories, KEGG pathways, and CGAP Bio-Carta Pathways over-represented among the 435 coding genes compared to background (all human genes). We accounted for multiple testing using Bonferroni-corrected significance levels: 0.05/1472 = 0.00003 (cellular component) or 0.05/8972 = 0.000006 (biological process). The same analysis was conducted for the 68 coding genes differentially expressed between MI and control participants (FDR < 0.1).

Gene set enrichment analysis (GSEA)
The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets for use with GSEA software [28]. Gene sets from the MSigDB were used to search for pathways that may be more modestly altered in MIassociated gene expression data. A ranked gene list was generated by ranking t-value of 12,062 filtered genes in differentially expressed gene analysis using DeSeq2 with RNA-Seq count data. Then a total of 3283 gene sets in GSEA C2 category [28] (MSigDB) were used for the enrichment analysis. FDR < 0.25 was used to define the significance of enrichment. Gene set size of < 15 and > 500 were filtered out, resulting in filtering out 1442/4725 gene sets. Therefore, the remaining 3283 gene sets were used in the analysis.

Comparison of RNA-Seq and exon array platform and validation of MI gene signatures detected from deep RNA-Seq data using Affymetrix exon array data
In a sample of 198 samples with high quality RNA-Seq data being used in the analysis of this study, 193 RNA samples were also analyzed by Affymetrix Exon-array to obtain gene-level measurements. In order to be comparable between these two platforms, We first created a custom BED file based on the coordinates of core probe sets on exon-arrays so that only reads mapping to coreprobe sets are used by RSeQC [29] to obtain gene-level RPKM values. We found that the overall correlation of coding genes between RNA-Seq and Exon-array is only a little lower (r^2 = 0.56, Pearson correlation coefficient r = 0.75) than a previous study (r^2 = 0.62) [30], indicating the high quality of RNA-Seq human blood data considering that the previous study [30] was conducted in cell line samples.
We have identified 68 coding genes differentially expressed between 55 early MI and 71 control participants in whole blood, and then these genes were classified as MI genes. In order to estimate the relationship of the MI genes between RNA-Seq data in this study and Exon array data obtained before, MI genes were mapped to Exon array dataset via gene symbol. PCAs were performed for each dataset across the mapped genes using z-score normalized data. The relationship was subsequently defined quantitatively using GSEA. The samples in each dataset were divided into two groups: case versus control (MI vs. control). Then all of the genes in Exon array dataset were ranked by the signal to noise statistic (t value) in the MI case: control comparison. GSEA was used to determine if the gene set (MI-associated genes) detected from deep RNA-Seq data were significantly enriched in the above ranked gene list generated from Exon array data.

Replication of MI expression signatures using IlIumina RNA-Seq in an independent Rotterdam cohort
Rotterdam Study RNA-Seq data was generated as part of the BIOS project and is described in detail elsewhere [31]. In short, total RNA was globin cleared using Ambion GLOBINclear and sequenced to a minimum yield of 15 M paired-end reads on HiSeq 2000. Data was aligned to reference genome hg19 using STAR, followed by quantification of all GENCODE v16 genes using custom scripts. We then extracted the counts of 56,515 transcripts for 807 participants of the Rotterdam Study (RS-I, II and III). Using edgeR, counts per million (CPM) mapped reads were generated for each sample, and transcripts with CPM < 1 in more than 90% of samples were excluded, allowing 15,331 transcripts in further analysis. In summary, among 404 individuals (15 from RS-I and 389 from RS-II, 190 male and 214 female), there are 28 MI cases vs. 376 controls and 41 CHD cases vs. 363 controls for the differential expression analyses.
For each coded phenotype (MI/CHD), linear regression analysis was performed in R, correcting for age, gender, flow-cell and the number of sequenced reads. A custom linear regression script was used as the dataset was too large to be processed by edgeR or DESeq. The effect sizes and uncorrected p-values were reported for each of the candidate genes of the discovery analysis.
See Additional file 1: Supplemental Material for details of Rotterdam cohort and its measurement of CAC.

Sample characteristics and RNA quality of the blood samples
Two hundred two FHS Offspring participants with a history of early MI, high CAC, or neither and with whole blood expression data were selected for this RNA sequencing study. After exclusion for poor quality and outliers of two participants with history of early MI and two participants with high CAC, 198 samples remained for inclusion in the analysis. Our final study sample with RNA-Seq of whole-blood RNA consisted of 198 European ancestry individuals (55 with history of early MI, 72 with high CAC without MI, and 71 control participants free of elevated CAC levels or history of MI).
Clinical characteristics of these study participants are presented in Table 1. Participant selection was targeted to match for age and sex across three groups. As shown in Table 1, there was no difference in age (mean = 68, P = 0.76), but there were differences in the distributions of sex (P = 0.00037) and several other clinical covariates across groups including diastolic blood pressure, total cholesterol, HDL-C, smoking, diabetes and treatments for hypertension, dyslipidemia, or diabetes as well as use of aspirin (P < 0.05).
We compared the RNA quality across the three groups. The concentration and yield of RNA were slightly different (P = 0.03) with a relatively higher concentration and yield in controls, respectively. However, as shown in Additional file 1: Table S1, there are no differences among groups for RNA quality score i.e., RNA integrity number (RIN), and 260/280 ratio.

Transcriptome profiling of coding and non-coding genes in human whole blood
By sequencing one sample per lane using Illumina HiSeq platform, we generated high-quality and deep coverage RNA-Seq data for 201 blood samples besides nine replicates. On average, there were 150 million paired-end 75 bp reads for each sample. The overall unique mapping rate is 75% and the concordant pair alignment rate was 62%. Sequencing summary and mapping statistics are shown in Additional file 1: Fig. S1. For the nine samples sequenced twice, we checked the correlation of all measured transcriptome in whole blood between samples sequenced twice, which ranged from 0.77 to 0.99 (Additional file 1: Table S2). After QC, in the final analysis, for samples sequenced twice, we selected the one with the higher number of total unique mapped reads. Samples were also checked for sex mismatch. After exclusion for poor quality RNA and outliers, 198 samples remained for inclusion in the analysis. Of  We further identified protein-coding genes and lin-cRNAs that were more highly expressed in whole blood by comparing to expression in other tissues in the Illumina Human Body Map RNA-Seq data. The Illumina Human BodyMap 2.0 data [33] includes RNA-Seq data for 16 other human tissues, and we processed the Illumina Human BodyMap 2.0 data with the same tools and parameters as for our blood RNA-Seq data. In this comparison, ~ 60% of the detectable coding-genes were expressed much higher in our blood samples than in 16 other human tissues. As shown in the heat map of 52 genes highly expressed in blood (log 2 [FPKM] > 6 in 100% samples) and 36 lincRNAs moderately expressed in blood (log 2 [FPKM] > 1 in 100% samples), half of coding genes are expressed specifically in blood (Additional file 1: Fig.  S2a), and two thirds of lincRNAs are expressed higher in blood compared to other 16 human tissues. By evaluating the expression profiles of 36 lincRNAs expressed highly in blood (log 2 [FPKM) > 1 in all samples), two clusters were identified as shown in Additional file 1: Fig.  S2b. Among them, the expression level of 26 lincRNAs (top panel) is substantially higher in blood compared to 16 other human tissues in the Human BodyMap data, including the FAM157A lincRNA, a known lincRNA known to be overexpressed in blood, that contains 14 exons and has 2 transcripts (splice variants) as shown in the Ensembl Genome Browser [34].
Finally, we compared the overall expression profiling on Illumina RNA-Seq and Affymetrix Exon-array platforms in all 193 participants with samples with expression data from both platforms. The correlation of expression level between the two platforms is on average 0.745 with a median of 0.748, which is slighter lower (R^2 = 0.56) than a prior report with an R^2 of 0.62 between RNA-Seq and Exon-array data for 5 cell line samples [30]. The correlation result is high, with overall r = 0.75. Additional file 1: Fig. S3a shows the correlation plot for one sample, and Additional file 1: Fig. S3b shows the distribution of r values for all 193 samples.

Differential expression analysis to identify mRNA and lincRNA signatures for early MI and high CAC
At a false discovery rate (FDR) < 0.1, we found 68 coding genes and two lincRNAs differentially expressed between early MI and controls (Table 2), with 21 genes overexpressed in MI cases and 49 down-regulated, including two lincRNAs. By contrast, only three coding genes and  one lincRNA (RP11-245 J9) were differentially expressed between high CAC and controls (Table 3). Notably, APOD, encoding a component of high-density lipoprotein, was expressed significantly lower in both early MI (FDR = 0.007) and in high CAC (FDR = 0.01) compared with controls, respectively, highlighting a novel candidate for both MI and subclinical atherosclerosis. A supervised clustering analysis of the 69 expression signatures (68 MI and 3 CAC genes with 2 genes shared by both MI and CAC) shows significant expression changes across the three groups (Fig. 2a). Principle component analysis (PCA) of 68 MI gene signatures (Fig. 2b) also shows a separation pattern between MI and controls. In addition, for DEGs, particularly for APOD detected in both MI and high CAC, boxplots show a clear trend in early MI and high CAC relative to controls (Fig. 2c). APOD is ranked as the top gene in both MI (log 2 fold change = − 0.2, FDR = 0.007) and high CAC (FDR = 0.01) gene signatures. Of note, APOD was moderately expressed in blood with an average of FPKM of 3.79 across all 198 samples.
To assess whether our findings may represent transcriptomic signatures confounded by baseline clinical variables/covariates, established vascular risk factors, or of drug treatments for MI or its risk factors, we further adjusted for 9 covariates that differed among the three groups (see Table 1) in the primary simple differential analysis model in which we had adjusted for sex, known batch effects, and hidden confounders (i.e. two SVs). The 9 clinical covariates include diastolic blood pressure, total cholesterol, HDL-cholesterol, cigarette smoking, diabetes, and drug treatment for hypertension, dyslipidemia, or diabetes as well as use of aspirin. For all 70 MI genes reported in Table 2, after adjustment for these covariates,  except for APOD and DUS1L, there was overall attenuation of the significance for MI signatures, and associations for 9 genes remained significant at FDR < 0.1 (Additional file 1: Table S6). APOD remained significantly downregulated in both early MI (FDR = 0.003, beta/log2FC = − 0.23) and high CAC (FDR = 0.01, beta/ log2FC = − 0.21). We hypothesize that APOD may represent a novel target for the treatment and prevention of atherosclerotic disease. Finally, for the 68 coding genes differentially expressed between MI and control at FDR < 0.1, our secondary analysis found four genes also differentially expressed between MI and high CAC at a Bonferroni corrected p-value ≤ 0.0007 (0.05/68). Their expression level was changed in the same direction across the three groups. Three genes (IRF7, HBG2, and HBG1) were up-regulated in MI compared to high CAC and controls, and one gene (PEX26) was down-regulated in MI compared to high CAC and controls.

Pathway enrichment analysis to identify biological function pathway signatures for early MI
To explore biological functions and pathways in which the gene signatures of MI might act, we found that annotations of the 68 MI genes were highly enriched for a few GO categories including protein complex binding and phosphoprotein (FDR < 0.05), compare to all human genes as background.
Gene set enrichment analysis (GSEA) [28] was conducted to explore for enrichment of gene sets for the up-regulated and down-regulated genes in participants with MI compared to controls. In a total of 3283 gene sets in GSEA C2 category [28] at FDR < 25% enrichment level, 215 gene sets were significantly up-regulated (out of 1938) and eight gene sets were significantly down-regulated (out of 1345). As shown in Table 4 for gene sets significantly enriched at FDR < 5%, all 36 gene sets were up-regulated in MI with a positive Normalized Enrichment Score, including several interferon response pathways, insulin receptor recycling, known targets of transcription factor STAT3, and a gene set related to epigenetic regulation in which 17 genes were significantly silenced by methylation (p-value = 0, and FDR q-value = 0.02). The up-regulation of gene expression in the group with early MI compared to controls (Additional file 1: Fig. S4) supports the hypothesis for an epigenetic role in MI pathogenesis.
Other interesting gene sets at significantly enriched FDR < 25% but FDR > 5% include sets for hypoxia, oxidative phosphorylation, inflammatory response, obesity and cholesterol biosynthesis (Additional file 1: Table S7), consistent with a number of pathophysiological pathways previously implicated in coronary artery disease and MI. New knowledge regarding these regulatory changes may improve our ability to functionally characterize susceptibility variants associated with diseases and related risk factors.

Validation of MI expression signatures using Affymetrix exon-array expression data
Of 198 RNA-Seq samples, there was an overlap of 193 with the previous 5626 Affymetrix Exon-array data. Using these 193 common samples, only one gene (CLDN8, beta = 0.16, p = 2.84e-06, FDR = 0.05) was differentially expressed between MI and controls, a finding that was validated in exon array data at FDR < 0.1. For 10,595 expressed genes found in both platforms, we first computed the statistic t value for each gene on each platform by comparing MI cases with controls. We found a significant correlation of the t values between RNA-Seq and exon array (r = 0.21, P < 1e-324, Additional file 1: Fig. S5), which is consistent with previous reports [30].
The replication rate of expression signatures may be low when performing single gene comparisons between different platforms (e.g., RNA-Seq and exon array) [28] or across different high-throughput studies. Therefore, we conducted a GSEA analysis to validate whether the 68 DEG considered as an entire gene set is significantly enriched using exon array MI cases versus controls. For the protein coding genes, 66 unique genes were found on the exon-array. 17,873 exon array genes were ranked by t value (from positive to negative) in the MI case:control comparison. The enrichment is significant (nominal p-value < 0.001) with Normalized Enrichment Score = − 2.14 ( Fig. 3), indicating overall downregulation in MI compared with the controls in exon array data, consistent with our findings from RNA-Seq (47 of 68 down-regulated genes). Additional file 1:  Table S8 reports 35 leading edge genes among the set of 68 genes. The leading-edge subset can be interpreted as the core group of genes that accounts for the gene set's enrichment signal [28].

Independent replication of MI expression signatures using Illumina RNA-Seq in the Rotterdam study cohort
We further replicated our MI gene signatures in an independent cohort, the Rotterdam Cohort study (N = 807) in which Illumina RNA-Seq was conducted. Among our 70 MI genes (68 coding and 2 lincRNAs), 60 coding genes were detectable (CPM > 1 in > 10% of samples) and therefore analyzed in the Rotterdam RNA-Seq dataset. We

Table 4 Gene set enrichment analysis (GSEA) for enrichment of gene sets/pathways for the up-regulated and downregulated genes in participants with MI compared to controls
A total of 3283 gene sets in GSEA C2 category [28] were tested. At FDR < 25% enrichment level, 215 gene sets were significantly up-regulated (out of 1938) and eight gene sets were significantly down-regulated (out of 1345) found a significant correlation of the effect size of MI between our discovery FHS RNA-Seq and Rotterdam replication RNA-Seq (Spearman.r = 0.53, P < 1.3e− 05).

NES
Specifically, among 60 genes tested, we found 9 genes were differentially expressed between MI cases (n = 28) and controls (n = 376) at P < 0.05 (Table 2), and 8 were expressed differently in the same direction as indicated in bold in Table 2. In addition, among 8 coding genes supported by p value and direction, two genes were also differentially expressed between CHD cases and controls in the same direction (HNRNPR with P = 0.01, and NPDC1 with P = 0.0026). Furthermore, among 60 genes tested, 77% (46 genes) of the associations are in the same effect direction between MI and controls, indicating consistency of effect, although a larger sample size is needed to replicate significant associations with MI.

Discussion
Although GWAS have identified many genetic variants associated with MI and subclinical coronary atherosclerosis (e.g. high CAC), the totality of evidence suggests that many GWAS variants are located in non-coding genomic regions. Genes reported for these variants are based on their proximity to nearby genes and limited to annotated protein-coding genes that may not represent the causal genes responsible for the traits studied, and much of the functional genomics of coronary artery disease remains unknown. Whole transcriptome studies using RNA-Seq can simultaneously comprehensively profile both coding and non-coding genes and transcripts associated with disease status, providing new knowledge and functional biological insights into human diseases. We first systematically characterized expression patterns of coding mRNAs and non-coding lincRNAs in whole blood through a high-coverage of RNA-sequencing experiment (one sample per lane). Much more highly-expressed coding genes are identified comparing to lincRNAs in whole blood, which is consistent with a previous report [32]. When compared to 16 other human tissues, a larger percentage of lincRNAs versus coding genes are expressed especially in whole blood, indicating that lincRNAs might be more tissue/cell-type specific compared to coding genes. Further studies with a greater diversity of tissue/cell data generated from the same research participants are needed to confirm this finding.
We identified coding and non-coding gene expression signatures associated with prior early MI, and a few expression signatures were also discovered for high CAC, a noninvasive measure of coronary atherosclerosis, which precedes most cases of MI [3,4]. Of note, APOD, encoding a component of high density lipoprotein, was expressed significantly lower in both early MI (FDR = 0.007) and in high CAC (FDR = 0.01) compared with controls, respectively. Altered expression of APOD was not reported to be significantly associated with coronary heart disease in our prior FHS investigation [35], but the cases were not of early onset and only half of the cases had prior MI. Furthermore, a prior separate FHS investigation found that protein level of APOD is also decreased in MI new-onset patients compared to controls [36], providing orthogonal evidence for APOD as an attractive novel candidate for clinical and subclinical atherosclerosis. Tsukamoto et al. reported altered response to myocardial infarction in Apod knockout mice [37], revealing APOD as a cardioprotective gene using a mouse model of lethal atherosclerotic coronary artery disease. In addition, high levels of APOD protein in humans are associated with protective inflammatory levels and fatty liver in initial human studies [38,39]. Further investigation of APOD in mouse models and larger human studies will be needed for experimental validation of this mechanism, and to allow investigation of underlying mechanisms related to atherosclerotic coronary artery disease.
In addition, despite our relatively modest sample size, we identified 71 gene expression signatures for MI and CAC in whole blood. Pathway analysis for these genes highlighted immune response, lipid metabolic processes, and interferon regulatory factor as potential pathways involved in disease progress/pathogenesis, consistent with the known pathophysiology of coronary artery disease.  Table 2), of which 66 unique genes found in Exon array platform. The list of 17,873 coding genes ranked by t value of comparing early MI with control on Affymetrix Exon array data (N = 193 samples). Normalized Enrichment Score (NES) = −2.14, Nominal p-value = 0 and FDR q-value =0 Only a few lincRNA expression signatures were found to be associated in either MI or CAC, which might be due to a relatively small sample size and much lower abundance of lincRNAs in human tissues. Because we undertook deep sequencing coverage, we were able to identify many lincRNA specifically expressed in blood that are likely not reliably detected in lower coverage RNA-Seq experiments. The lincRNA RP11-245 J9.5 associated with CAC is of interest. It expressed high in peripheral blood and log2FC = − 0.6 (Fig. 2d). This gene is also known as PSMD6-AS2, a gene that could disrupt expression of a proteasome subunit. This gene was identified as differentially expressed in a study of atherosclerotic macrophages [40]. Another proteasomal subunit, PSMC3 mutation is associated with subcutaneous calcifications [41], indicating that this lincRNA RP11-245 J9.5 might be involved in atherosclerosis by regulating several proteasome subunits. However, further replication of its association with CAC in independent studies is needed to warrant future experiment mechanism studies of this lincRNA and identification of its functional targets. LincRNA may act as key transcriptional regulators in different stages of biological systems, from chromatin regulation to transcription regulation [22]. Future studies are warranted to explore the relationship between these lincRNA signatures and their regulated mRNA targets and specific biological process in atherosclerosis, and the implications for new therapeutic targets for treatment and prevention of clinical MI and subclinical atherosclerosis.
Pathway enrichment analysis identified interesting results including a pathway/gene set called "STAT3_TAR-GETS_DN". Signal transducer and activator of transcription 3 (STAT3) protein has been linked to cardiovascular disease through multiple pathways in experimental and animal studies [42,43]. STAT3 is a key regulator of cellto-cell communication in the heart, modulates proliferation, differentiation, survival, oxidative stress, and/or metabolism in cardiomyocytes, fibroblasts, endothelial cells, progenitor cells, and various inflammatory cells [44]. It has been well documented that monocytes and macrophages produce inflammatory cytokines to repair the injury during myocardial infarction and hypertrophy [45]. The early activation of STAT3 during diseased stage could be the protective response of system to reduce the cardiac death and remodeling through transcription factor STAT3 binds to promoter region of cardio-protective genes in nucleus [43].
In our study, we have adjusted for well-known risk factors for MI and subclinical atherosclerosis. Future studies in larger cohorts with clinically apparent coronary heart disease or subclinical atherosclerosis will allow exploration of the role of specific risk factors in the progression of subclinical atherosclerosis to clinical atherosclerosis at the molecular level. In addition to our modest sample size, the major limitation of our study is the use of whole blood RNA, which includes heterogeneity of leukocyte cell types, although we adjusted for differences of cell types in our study. Future RNA-Seq experiments in affected tissues/cells such as atherosclerotic aortic root cells and coronary artery endothelial cells are warranted.
While we acknowledge there are limitations to our pilot study, we believe we have identified several lessons for future applications of whole blood RNA sequencing for discovery of coronary atherosclerosis genes. Among the limitations, RNA-Seq experiments are still costly, and as with our study, the resulting relatively small sample size of these experiments may continue to limit statistical power to study rare RNA species such as lincRNAs that require deep coverage sequencing. Further, non-strand-specific RNA-Seq protocol limits accurate discovery of antisense transcripts and might lead to bias for quantifying genes that overlap with anti-sense transcripts. Finally, unmeasured confounders may affect results of association studies. Nevertheless, we conclude that blood RNA sequencing analysis is feasible and may detect a much fuller and informative spectrum of gene expression than is seen on gene chip arrays, although careful RNA extraction and high resolution sequencing will be required for early studies.

Conclusion
In summary, we identified significant MI-specific expression signatures, with eight genes (15%) supported in an independent cohort with RNA-Seq data. Of note, APOD, encoding a component of high-density lipoprotein, was significantly downregulated in both early MI and in high CAC compared with controls, indicating a novel candidate target for the treatment and prevention of atherosclerotic disease. Our findings provide insights into mechanisms through which transcriptome-level variation may influence the development of subclinical coronary atherosclerosis and, ultimately, clinical MI.