An integrated transcriptome and epigenome analysis identifies a novel candidate gene for pancreatic cancer

Background Pancreatic cancer is a highly lethal cancer with limited diagnostic and therapeutic modalities. Methods To begin to explore the genomic landscape of pancreatic cancer, we used massively parallel sequencing to catalog and compare transcribed regions and potential regulatory elements in two human cell lines derived from normal and cancerous pancreas. Results By RNA-sequencing, we identified 2,146 differentially expressed genes in these cell lines that were enriched in cancer related pathways and biological processes that include cell adhesion, growth factor and receptor activity, signaling, transcription and differentiation. Our high throughput Chromatin immunoprecipitation (ChIP) sequence analysis furthermore identified over 100,000 regions enriched in epigenetic marks, showing either positive (H3K4me1, H3K4me3, RNA Pol II) or negative (H3K27me3) correlation with gene expression. Notably, an overall enrichment of RNA Pol II binding and depletion of H3K27me3 binding were seen in the cancer derived cell line as compared to the normal derived cell line. By selecting genes for further assessment based on this difference, we confirmed enhanced expression of aldehyde dehydrogenase 1A3 (ALDH1A3) in two larger sets of pancreatic cancer cell lines and in tumor tissues as compared to normal derived tissues. Conclusions As aldehyde dehydrogenase (ALDH) activity is a key feature of cancer stem cells, our results indicate that a member of the ALDH superfamily, ALDH1A3, may be upregulated in pancreatic cancer, where it could mark pancreatic cancer stem cells.


Background
Although pancreatic cancer is the tenth most commonly diagnosed cancer in the U.S., it is the fourth most common cause of cancer mortality with close to 37,000 deaths per year in the U.S. [1]. The aggressive nature of this cancer is further emphasized by the dismal five-year survival rate of less than 5% [2]. Pancreatic ductal adenocarcinoma (PDAC), the most common form of pancreatic cancer, is thought to arise through a multistep process involving intermediate precursor lesions known as pancreatic intraepithelial neoplasias (PanINs) [3]. Genome-wide sequencing approaches have revealed a complex set of somatic alterations in pancreatic tumors such as single base mutations, amplifications, deletions and complex rearrangements that drive cancerous growth through specific signaling pathways [4,5]. The genes most frequently altered in pancreatic cancer are KRAS, TP53, CDKN2A and SMAD4, leading to an activation of growth promoting and cell survival pathways, and inactivation of tumor suppressors and apoptotic pathways [4].
In addition to an accumulation of somatic mutations that affect the DNA sequence directly, epigenetic events can lead to an increase in cell growth or survival and thereby contribute to tumor development [6]. DNA methylation and post-translational modification of histone proteins are epigenetic mechanisms that regulate gene expression in normal and cancerous cells. DNA methylation at specific genes can increase or decrease in cancer and is often associated with an inactivation of tumor suppressor genes [7]. Post-translational modification of histones mediates appropriate gene expression by regulating access of the transcriptional machinery to the underlying DNA [8]. Changes in DNA methylation have been investigated in pancreatic cancer [9][10][11] but genome wide maps of histone modification patterns have not been reported in this disease.
To enhance our understanding of the pancreatic cancer genome, we catalogued transcribed sequences and potential functional elements in cell lines derived from normal and neoplastic pancreatic tissues by next generation RNA-sequencing and chromatin immunoprecipitation (ChIP) with massively parallel sequencing. Differentially expressed genes and pathways were identified, and gene expression was correlated to histone modification patterns that mark active and repressed chromatin.

RNA preparation
Total RNA was isolated from~5×10 6 cells using the mirVana™ miRNA isolation kit (Ambion) according to the manufacturer's protocol (retaining the small RNA fraction in the pool of total RNA). DNAse treatment was performed with TURBO DNase (Ambion); RNA quantity was assessed by Nano Drop (Thermo Scientific) and RNA quality by using the Agilent 2100 Bioanalyzer (Agilent Technologies). RNA quality (RIN) scores for RNA prepared from cell lines were > 9.0.

ChIP and template preparation
Chromatin immunoprecipitation (ChIP) was performed with the ChIP-IT™ Express kit (Active Motif ) according to the manufacturer's protocol. Briefly, cells were grown to log phase and cross-linked with 1% formaldehyde for 10 minutes at room temperature. The fixation reaction was stopped by a glycine stop solution and cells were washed in PBS and homogenized in lysis buffer supplemented with protease inhibitors. Shearing was performed by sonication in a Diagenode Bioruptor (30-sec on/off pulses for 15-20 min at high setting) to obtain 200-500 bp DNA fragments. After centrifugation, the supernatant was used for chromatin immunoprecipitation (from~1×10 7 cells each) using antibodies for: anti-mono-methylated histone H3 at lysine 4 (H3K4 me1) (Abcam, ab8895), anti-tri-methylated histone H3 at lysine 4 (H3K4me3) (Abcam, ab12209), anti-tri-methylated histone H3 at lysine 27 (H3K27me3) (Abcam, ab6002) and anti-RNA polymerase II (Pol II) (Abcam, ab817). Input samples consisted of 10 μL sonicated chromatin. For each ChIP, 20 μg sheared chromatin was incubated overnight at 4°C on an end-to-end rotator, with 8μg antibody, protein G magnetic beads and protease inhibitors. Finally, the ChIP and input DNA samples were reverse cross-linked, treated with proteinase K and purified with the QIAquick PCR Purification Kit (Qiagen).

Library preparation and sequencing
For mRNA-seq sample preparation, the mRNA-seq Sample Prep Kit (Illumina) was used according to the manufacturer's protocol. Briefly, 1 μg total RNA was used for polyA mRNA selection using Sera-mag oligo (dT) beads, followed by thermal fragmentation at 94°C. First strand cDNA synthesis was performed using Superscript II reverse transcriptase and random primers. Second strand synthesis was performed using DNA Polymerase I followed by end repair with T4 DNA polymerase, Klenow DNA polymerase and T4 polynucleotide kinase (PNK). Finally, the double-stranded cDNA was ligated to Illumina paired-end (PE) adaptors and size selection performed for fragments in the 350±25 bp range. Libraries were amplified using Phusion DNA polymerase, followed by purification with the QIAquick PCR Purification Kit (Qiagen). Amplified libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4-6 pM on two Genome Analyzer (GAII) lanes.
MicroRNA-seq sample preparation was performed with the Small RNA v1.5 sample prep Kit (Illumina) according to the manufacturer's protocol. Briefly, 3′ and 5′ small RNA adapters were ligated to 10 μg total RNA followed by reverse transcription with Superscript II and amplification using Phusion DNA polymerase. This was followed by size selection of cDNA in the 100-108 bp range and measurement using a Bioanalyzer. Amplified miRNA-seq libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4-6 pM on one Genome Analyzer (GAIIx) lanes.
For ChIP-seq sample preparation, pooled ChIP reactions (10 ng) were used to prepare a single library using the ChIP-seq Sample Prep Kit (Illumina) according to the manufacturer's protocol. Briefly, DNA ends were repaired with T4 DNA polymerase, Klenow polymerase and T4 PNK. The Klenow fragment (3′ to 5′ exo minus) was then used to generate a protruding 3′ A base and adapter oligos were ligated using DNA ligase. ChIP DNA in the 200±25 bp range was recovered and amplified using Phusion polymerase. Library concentration was assessed using a Bioanalyzer. Amplified libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4-6 pM on 2-3 Genome Analyzer (GAII or GAIIx) lanes. All nextgeneration sequencing was carried out at the National Cancer Institute's Center for Cancer Research Sequencing Facility operated by the Advanced Technology Program of SAIC-Frederick, Inc. (Gaithersburg, MD).

Sequence alignment and analysis
mRNA-sequencing: paired-end RNA-seq was repeated twice for each cell line to obtain reliable numbers of sequence reads (Additional file 1: Table S1A). The median number of reads (35 bp) generated per paired-end experiment was 8.9 M (8.6 to 11.2 M). Total reads were combined for each cell line: 19,778,468 for the hTERT-HPNE and 17,769,249 for PANC-1 cells. No significant difference was observed in the number of reads generated between the two cell lines (Student's t-test, P=0.53). Paired-end reads were mapped to the RefSeq database (National Center for Biotechnology Information (NCBI) build 37) using the Burrows-Wheeler Aligner (BWA) software with default parameters that allow up to 3 alignments for each read and up to 2 mismatches for the seed sequence (the first 25 bp of each read) [15]. Reads that failed to map to RefSeq were mapped to the Ensembl database, which includes additional transcripts and pseudogenes [16]. Remaining unmapped reads were mapped to the human genome assembly (NCBI build 37). The analysis pipeline is shown in Additional file 2: Figure S1. Gene expression was calculated in reads per kilobase of exon per million mapped sequence reads (RPKM) [17]. After filtering, we included 11,249 genes expressed above 1 RPKM in at least one cell line for further analysis. The edgeR package was used to identify genes that were differentially expressed in the two cell lines using total tag count values for each gene [18]. Data was normalized using the weighted trimmed mean of log-ratio values (using hTERT-HPNE as a reference) to account for library size; statistical analysis was performed using the Fisher's exact test. We considered genes to be differentially expressed between cell lines when the Benjamini-Hochberg false discovery rate (FDR) was less than 0.05 and the fold change difference in expression was equal to or greater than 3.
miRNA-sequencing: After trimming adaptor sequences and filtering low-quality reads, a total of 9,262,094 and 18,006,865 single-end short (22 bp) sequence reads were obtained from PANC-1 and hTERT-HPNE cells, respectively (Additional file 1: Table S1B). The miRNAkey software [19] was used to remove adaptor sequence from the 3′-ends of reads, to map raw reads to the miRBase mature database (release 16) [20], calculate normalized RPKM expression values and quantify differentially expressed miRNAs. Two hundred and fifty eight miRNA genes were detected at 1 RPKM or higher in at least one of the cell lines. Genes were considered differentially expressed with a fold change difference ≥ 3 and a Bonferroni corrected Chi-square P-value < 0.05.
Pathway enrichment analysis: Gene set enrichment analysis of differentially expressed mRNA and miRNA genes was performed based on KEGG (Kyoto Encyclopedia of Genes and Genomes) [21] and GO (Gene Ontology) [22] annotations using GOseq, as it accounts for the bias of over-detection of differential expression for long and highly expressed transcripts [23]. The Wallenius approximation method with Benjamini-Hochberg corrected P-values was used to determine enriched pathways; and an FDR < 0.05 was considered significant.
ChIP-sequencing: ChIP-seq single-end experiments were run on 2-3 lanes to obtain an adequate number of reads (Additional file 1: Table S1C). Matching input DNA was sequenced to obtain a reference for each cell line. The total number of short (35 bp) reads generated per lane varied from 4,509,035 to 27,577,446 with a median of 19,279,525. Raw reads were combined for each histone modification mark and RNA Pol II as well as for input DNA for each of the cell lines. No significant difference was observed in the number of reads generated in the two cell lines for the different histone modification marks or for RNA Pol II (Student's t-test, P-value range: 0.17-0.83). Alignment to the human genome assembly (NCBI build 37) was performed using Bowtie, allowing for up to 2 mismatches and only one (best) alignment [24]. The SICER software [25] was used to identify qualified peaks (islands) of histone and Pol II binding by comparing sequence reads from immunoprecipitated and input DNA. A consolidating window size of 200 bps was used, a gap size of 200 bps (H3K4me1, H3K4me3 and Pol II) or 600 bps (H3K27me3), effective genome size of 81%, ratio of enrichment between experimental data (PANC-1) and control (hTERT-HPNE) ≥3 and an FDR <0.05. Differentially enriched epigenetic marks across the two cell lines were identified by the SICER-df function. A cis-regulatory element annotation system (CEAS) was used to attain summary statistics on ChIP enrichment peaks based on location in promoters, gene bodies or nongenic regions using the RefSeq database [26].

Integration of ChIP-seq and RNA-seq
Genes were divided into quartiles based on digital expression levels (RPKM values) for each cell line. Global profiling curves were generated for genes in each of the quartile groups by plotting the read distributions (tags were binned into 25 bps bins and trimmed based on Poisson distribution) of different histone modification marks and RNA Pol II within 5,000 bp up-and downstream of transcription start sites (TSS) for RefSeq genes.

Genomic copy number alterations
To assess if a bias existed in read mapping for ChIP-Seq because of chromosomal copy number gains or losses, we extracted DNA (Gentra Puregene Tissue Kit, Qiagen) from the hTERT-HPNE and PANC-1 cell lines, and genotyped at NCI's Cancer Genomics Research Laboratory using the Illumina Human Hap 1M-Duo chip. The Illumina GenomeStudio software was used to compute affinity-normalized probe intensities (normalized genotype probe intensities, log R ratio (LRR) and B allele frequency (BAF)), quality scores and to call genotypes. Renormalized (quantile) LRR and BAF values were then analyzed using a custom software pipeline, R-Gada [27], to detect whole-chromosome and segmental events as previously described [28].
We only observed chromosomal copy number alterations in the PANC-1 cell line but not in the hTERT-HPNE cells. Adjacent events were merged if they had an identical state and distance between segments was <1 Mb. Each event was then classified as copy number gain, loss or copy neutral loss of heterozygosity (CNLOH). Since we observed only 200 (0.49%) regions enriched for RNA Pol II binding and 41 (0.57%) regions enriched for H3K27me3 binding in regions that were lost in PANC-1 cells, these were excluded from further analysis. To estimate a possible bias in mapping, we then calculated the number of regions enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells per 1 Mb for each autosomal chromosome for gain, CNLOH and copy neutral events. For interclass, unpaired comparisons, we used the Mann-Whitney U-test to assess differences in the distribution of genomic regions enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells.
To assess if genes selected for replication by TaqMan expression analysis (see below) were located in regions of copy number gain or loss, we also genotyped the following cell lines and assessed chromosomal abnor-malities as described above: AsPC-1, BxPC-3, Capan-1, CFPAC-1, Hs 766T, SU.86.86 and SW 1990.
Assessment of differential mRNA expression analysis using Real-Time qPCR An assessment of differential expression levels was attempted for 4 genes using custom mRNA Taqman expression assays (Applied Biosystems): ALDH1A3 (Hs001 67476_m1), ITGBL1 (Hs00191224_m1), NFE2L3 (Hs00 852569_g1) and SEMA4B (Hs00384240_m1). Fresh frozen pancreatic tissue samples (10 normal and 10 tumor derived) were obtained from the Mayo Clinic in Rochester MN (approved by the Institutional Review Board of the Mayo Clinic). All tumors were from patients diagnosed as adenocarcinoma of the pancreas, with tumor percentages ≥70%; all normal derived samples were adjacent to tumors (unmatched to tumors), confirmed by a pathologist to be normal by pathology and with ≥80% epithelial component. Eleven additional pancreatic cancer cell lines were purchased from ATCC and cultured according to their guidelines: AsPC-1, BxPC-3, Capan-1, Capan-2, CFPAC-1, Hs 766T, HPAF-II, MIA PaCa-2, Panc 10.05, SU.86.86 and SW 1990. An additional 39 pancreatic cancer cell lines (see Additional file 3: Table S6 for names of cell lines and references) were kindly provided by Dr. Udo Rudloff, Surgery Branch, NCI, NIH, and maintained in DMEM with 10% FBS. Frozen tissue sections (10 μm cut in at −20°C in a Cryostat) and cell lines (log phase) were harvested for RNA isolation as described above. Total RNA was reverse transcribed using Superscript III first-strand synthesis kit (Invitrogen). RT-qPCR was performed in triplicates using Taqman gene expression Master Mix (Applied Biosystems) and a 7900HT sequence detecting system (Applied Biosystems). The Cy 0 method was used to obtain the best fit estimators of reaction parameters for real-time PCR data [29]. The ΔΔCt method was used to calculate expression values by normalizing to the geometric mean of two housekeeping genes (B2M #Hs00187842_m1 and PPIA #Hs9999990 4_m1; Applied Biosystems).

Results
Using a genome-wide approach, we have assessed global gene expression profiles and regulatory elements in two cell lines derived from normal pancreatic ducts (hTERT-HPNE) and pancreatic ductal adenocarcinoma (PANC-1) by massively parallel sequencing-by-synthesis.
Transcriptome analysis of pancreatic cell lines by mRNAseq and miRNA-seq The total number of paired-end mRNA-sequence reads generated for each cell line is listed in Additional file 1: Table S1A. We constructed our mRNA-seq analysis pipeline to align reads sequentially to the RefSeq, Ensembl and Human Genome (NCBI build 37) databases using BWA [15] (Additional file 2: Figure S1). By comparing the cancer derived PANC-1 cells to the normal pancreatic tissue derived hTERT-HPNE cells, we identified 1,983 differentially expressed mRNA genes at a threshold of 3-fold difference or greater (Additional file 4: Table S2). Of these genes, 971 were expressed at higher levels and 1,012 at lower levels in the cancer derived cells compared to the normal derived pancreatic cells. Examples of mRNA genes that were expressed at higher levels in the PANC-1 cell line include growth factors, receptors, and signaling molecules such as SHH, PDGFB, SMAD3, AKT2 and multiple WNT genes (WNT6, WNT7B, WNT9A and WNT10A). Genes expressed at lower levels in the PANC-1 cells include CDKN1A, CDKN2B, HHIP, PDGFRA, PDGFRB and MMP1. In addition, many genes encoding extracellular matrix (ECM) proteins and their receptors (integrins) were differentially expressed in the two cell lines.
The total number of single-end miRNA sequence reads is listed in Additional file 1: Table S1B. We aligned reads to miRNA genes in the miRBase mature database, using miRNAKey [19]. Of the 258 miRNA genes expressed at 1 RPKM or higher, 128 were expressed at higher levels and 35 at lower levels in the cancer derived PANC-1 as compared to normal derived hTERT-HPNE cell line (Additional file 5: Table S3). Differential miRNA expression levels included higher levels of MIR767, MIR135B, MIR1269, MIR182, MIR183, and MIR203 and lower levels of MIR494, MIR424, MIR381, MIR452, and MIR155 in PANC-1 cells.
We used KEGG [21] and GO [22] enrichment analysis to categorize differentially expressed mRNA and miRNA genes into biologically relevant pathways and processes. The fifteen significantly enriched KEGG pathway categories included: pathways in cancer, focal adhesion, ECM-receptor interaction, Wnt signaling and cell adhesion molecules (Table 1). An enrichment of genes related to cardiomyopathy was also noted, but the list of genes overlapped considerably with two other pathways: focal adhesion and ECM-receptor interaction. Similarly, we detected significant enrichment of 56 GO biological process categories including: signal transduction, cell differentiation and cell adhesion (Additional file 6: Table  S4). We also detected 25 GO molecular function categories including: signal transducer activity, receptor activity, transcription factor activity, calcium ion binding, actin binding and growth factor activity (Additional file 7: Table S5).

Epigenome analysis of normal and cancer derived pancreatic cell lines by ChIP-seq
To evaluate genome-wide distribution of epigenetic marks that associate with active or repressed expression, we performed ChIP-seq in the PANC-1 and hTERT-HPNE cell lines using antibodies specific for three histone modification marks: H3K4me1, H3K4me3 and H3K27me3, and for RNA Pol II. Reads (Additional file 1: Table S1C) were aligned to the human genome (NCBI build 37) using Bowtie [24]. Binding sites for histone modification marks and RNA Pol II were identified by comparing immunoprecipitated chromatin with input DNA using SICER [25]. The total number of all binding sites for H3K4me1, H3K4me3, H3K27me3, and RNA Pol II identified in each cell line at a FDR of ≤0.001 was   Table S1C). The most prominent binding sites in promoters (up to 3 kb upstream of known transcription start sites) were seen for H3K4me3 (26.6% of all sites in the hTERT-HPNE cell line and 27.3% in the PANC-1 cell line), in accord with its role in activation of gene expression. A similar pattern of binding sites was seen for H3K4me3 and RNA Pol II in the two cell lines, particularly in exons (range 2.7-3.1%), untranslated regions (range 2.8-4.0%) and introns (41.2-41.4% for RNA Pol II and 49.2-51.9% for H3K4me3). Similar overall binding patterns were noted for H3K4me1 and H3K27me3 in both cell lines with regard to promoters (4.1-7.6%), exons (0.9-1.8%) and untranslated regions (0.7-1.4%). A greater number of intronic binding sites were seen for H3K4me1 (54.3-56.7%) as compared to H3K27me3 (36.8-38.1%), and fewer sites were noted in distal intergenic regions for H3K4me1 (28.9-37.6%) as compared to H3K27me3 (50.5-51.5%) (Additional file 8: Figure S2).

Integration of ChIP-seq and mRNA-seq
To correlate enrichment for histone modification marks to gene expression, we divided genes into four equal subsets based on digital gene expression levels (in RPKM) from highest to lowest. We then summarized sequence reads from ChIP-seq relative to transcription start sites (TSS) (+/− 5,000 bps) for all RefSeq genes ( Figure 1 for the PANC-1 cell line and Additional file 9: Figure S3 for the hTERT-HPNE cell line) to assess their patterns in promoter elements of genes expressed at different levels. Read density for H3K4me1 and H3K4me3 was increased surrounding the TSS of highly expressed genes in both cell lines, especially for the latter histone mark. The distribution of reads is bimodal, with an increased number of reads upstream and downstream of the TSS and a valley in-between. The peaks are located at approximately 600-800 bps upstream and 1,200-1,400 bps downstream of the TSS for H3K4me1 and at approximately 300-400 bps up-and downstream of the TSS for H3K4me3 with the latter being more prominent.   RNA Pol II binding sites were prominent over highly expressed genes with a peak of greatest density right around the TSS. Genes expressed at lower levels gradually lost binding of histones modified at lysine 4 and RNA Pol II (Figure 1, Figure 2, Additional file 9: Figure S3). In contrast, these genes are characterized by a relatively high density of reads for H3K27me3 with the greatest enrichment at approximately 200-300 bps downstream of the TSS (Figure 2A and B). Notably, when comparing epigenetic marks across the two cell lines, the greatest fraction of peaks enriched ≥3 fold in the PANC-1 cells was seen for RNA Pol II (71.8% increased, 28.2% decreased) and the greatest number of peaks decreased ≥3 fold in the PANC-1 cells was seen for the negative H3K27me3 histone modification mark (81.0% decreased, 19.0% increased), implying that the cancer cells may be better poised for active transcription (Table 2A). This appears to happen on a genome wide level and can be seen in Figure 2 where H3K27me3 sequence tags "dip" around the TSS of highly expressed genes in the PANC-1 cell line (Figure 2A) but not the hTERT-HPNE line ( Figure 2B). To verify that this differential enrichment in RNA Pol II and H3K27me3 binding sites was not caused by a bias in read mapping due to differences in DNA copy number, we genotyped the cell lines using the Illumina Human Hap 1M chip. We noted chromosomal alterations in the PANC-1 cells but not in the hTERT-HPNE cells. To assess mapping bias, we compared binding sites for RNA Pol II and H3K27me3 in genomic regions in the PANC-1 cells showing either gain or copy neutral loss of heterozygosity (CNLOH) relative to copy neutral genomic regions. We did not observe statistically significant differences in the distribution of regions enriched for RNA Pol II binding in regions showing copy number gain (P=0.52) or CNLOH (P=0.11) as compared to copy neutral genomic regions. Similarly, we did not observe differences in the distribution of regions enriched for H3K27me3 binding in regions showing gain (P=0.61) or CNLOH (P=0.89) as compared to copy neutral genomic regions, suggesting that the observed differences in epigenetic marks in the two cell lines are not due to a difference in copy number alterations.

Overexpression of ALDH1A3 in pancreatic cancer cell lines and tumors
We used the transcriptome and epigenome data described above to select genes for assessment of expression differences in independent sets of normal and tumor derived pancreatic tissue samples and cell lines. We based our selection of genes on the strong differences in RNA Pol II and H3K27me3 binding sites in the two cell lines by applying the following criteria: 1.) located within 5 kb of a region enriched ≥ 3 fold for RNA Pol II binding in the PANC-1 cells, 2.) located within 5 kb of a region enriched ≥ 3 fold for H3K27me3 binding in the hTERT-HPNE cells, and 3.) showed ≥ 3 fold increased expression in the PANC-1 as compared to Table 2 Characteristics of epigenetic marks that were increased or decreased over 3 fold in the PANC-1 as compared to the hTERT-HPNE cell line A. Epigenetic marks that were enriched or depleted in PANC-1 as compared to hTERT-HPNE pancreatic cells  hTERT-HPNE cells. This resulted in a total of 60 genes (labeled in bold in Additional file 4: Table S2). We then selected 4 genes from this list based on varying levels of expression differences and location in genomic regions that either showed copy-number gain (ALDH1A3, SEMA4B) or did not (ITGBL1, NFE2L3) in PANC-1 cells. We first assessed gene expression by RT-qPCR in 10 normal and 10 tumor derived pancreatic tissue samples as well as in 11 additional pancreatic cancer cell lines. By selecting differentially expressed genes that were also enriched in positive marks and depleted in negative marks we confirmed differential expression of one of the four genes, ALDH1A3, in this set of tumor and normal derived pancreatic tissues and cell lines (Figure 3, panels A and B). This gene was expressed on average 1.93 fold higher in the tumor derived samples as compared to the normal derived pancreatic tissue samples (P Mann-Whitney U test =0.034), and 3 of the 10 tumor samples had over 3 fold higher expression than the average of the normal samples (range 3.32-4.10). The ALDH1A3 gene was expressed on average 1,308 fold higher in the tumor cell lines as compared to the normal derived hTERT-HPNE cells, and 23.5 fold higher in the tumor cell lines as compared to the average of the normal derived tissue samples. The difference between ALDH1A3 expression levels in the tumor derived cell lines as compared to the tissue samples may be due to the higher degree of cellular heterogeneity in the latter set. Interestingly, by analyzing copy number alterations, we noted that its expression is increased in cell lines showing copy number gain of the genomic region it resides in (PANC-1, SU.86.86 and BxPC-3) as well as in cell lines without gain (Hs 766T, AsPC-1, Capan-1, SW 1990 and CFPAC-1) as shown in Figure 3B. We confirmed differential expression for the three other genes in the hTERT-HPNE and PANC-1 cell lines by RT-qPCR, but did not observe significantly differential expression in the larger set of cell lines and tissue samples. A second validation in an independent set of 39 additional pancreatic cancer cell lines (Additional file 3: Table S6) showed that ALDH1A3 was expressed on average 1,630 fold higher in the tumor cell lines as compared to hTERT-HPNE cells, and 78.5 fold higher in the tumor cell lines as compared to the average of the normal derived tissue samples ( Figure 3C). Increased RNA Pol II binding was seen in the promoter region of ALDH1A3 (12.39 fold), and decreased H3K27me3 binding (0.23 fold) in a region overlapping and surrounding exons 7 and 8, in the PANC-1 as compared to the hTERT-HPNE cells (Table 2B). In addition, binding of the negative mark H3K27me3 featured prominently over the first half of the gene in hTERT-HPNE cells (Additional file 10: Figure S4).

Discussion
Generating global datasets of transcribed sequences and epigenetic marks can form a foundation for genomic and functional investigations of the molecular mechanisms underlying specific cancers. Such datasets have been generated for many tissue types as part of the ENCODE project [30]. To begin to establish genomic datasets for the pancreas we analyzed expressed sequences and signatures of functional elements in normal and pancreatic cancer derived cell lines by massively parallel sequencing. We used a diploid cell line derived from normal pancreatic ducts and immortalized with the TERT gene, hTERT-HPNE, and a commonly used hypertriploid pancreatic ductal carcinoma derived cell line, PANC-1, for our studies. In this process, we identified differentially expressed genes that define pathways and cellular processes important for cancer, such as growth factor and receptor activity, signal transduction, focal adhesion, ECM-receptor interaction and cell differentiation. Enrichment of genes in these pathways reiterates the increased growth potential and diminished adhesive properties associated with the development and progression of pancreatic cancer. Many of the differentially expressed genes encode ligands, receptors and signaling molecules in the wingless and sonic hedgehog signaling pathways, implicated in embryonic development and particular cancers, including that of the pancreas [31][32][33]. Genes important for focal adhesion and extracellular matrix receptor interaction were also prominent among differentially expressed genes featuring numerous integrin, collagen and laminin subtypes possibly indicating an underlying potential for altered proliferation and interaction with the microenvironment of the tumor derived cell line in an in vivo setting. MicroRNA expression has not been assessed before in pancreatic cancer by next generation sequencing methods. Our results in the two cell lines agree with previously published data on the upregulation of MIR93, MIR95, MIR135B, MIR181C, MIR181D, MIR182, MIR183, MIR190, MIR196B and MIR203, and downregulation of MIR20A and MIR29C in pancreatic intraepithelial neoplasms (PanIns) or pancreatic adenocarcinomas (PDACs) as compared to normal pancreatic tissue [34][35][36]. In agreement, we noted substantially higher (from 6 to over 200 fold) expression of the former group of genes and lower (3-18 fold) expression of the latter group in our dataset. MIR135B, MIR182 and MIR183 are overexpressed in a wide range of other cancer types such as bladder, colon, prostate cancer and glioma [37][38][39][40][41]. In addition, we noted differential expression of miRNA genes not previously reported to be deregulated in cancer. Some of the largest differences were observed for MIR767 and MIR1269, both absent in the hTERT-HPNE cells but highly expressed in the cancer derived PANC-1 cells. The function of MIR767 and MIR1269 is largely unknown although the latter may be important during differentiation of human embryonic  stem cells [42]. Our results confirm previous findings suggesting that altered miRNA expression contributes to cancerous growth in the pancreas and provide new candidate miRNA genes that require validation in larger sample sets of normal and cancerous pancreatic tissues. Epigenomic assessment of histone modification patterns revealed an enrichment of mono-and trimethylated H3K4 in promoters that was highest in actively transcribed genes and gradually lost with lower expression. The transcriptional start site of actively transcribed genes was occupied by RNA polymerase, surrounded by a layer of nucleosomes with trimethylation at H3K4 and an outer core of nucleosomes with mono-methylation at H3K4. Trimethylation at H3K27 showed an inverse correlation with gene expression. This pattern of modification marks is in accordance with the activating and silencing effects associated with these histone modifications in other cell types [43,44]. Notably, an overall depletion of H3K27me3 marks and enrichment of RNA Pol II sites was seen in the cancer derived cell line, as compared to the normal derived cell line, indicating a possible mechanism of general alleviation of negative epigenetic regulation in pancreatic cancer. By focusing on this apparent difference between epigenetic marks in the two cell lines when selecting genes for further examination, we confirmed higher levels of ALDH1A3 gene expression in an independent set of tumor derived pancreatic tissues and cell lines in comparison to normal derived pancreatic samples. Of note is that of the genes selected for further assessment, all four showed very similar results in the initial set of cell lines using RT-qPCR, whereas only one showed a similar expression difference in the additional sample sets analyzed. This is probably due to the small initial sample set. Validation in a second larger set of pancreatic cancer cell lines confirmed higher expression levels of ALDH1A3 as compared to the normal derived cell line and normal derived tissue samples. A limitation of our study is the small number of samples, especially for normal derived pancreatic cell lines and for normal and tumor derived tissue samples.
A general function for members of the aldehyde dehydrogenase (ALDH) superfamily is to serve as detoxification enzymes by converting aldehydes to carboxyl acids (for review see [45]). The different ALDH subfamilies and their members have additional catalytic functions; the ALDH1A family is responsible for the conversion of retinaldehyde to retinoic acid (RA) and its three members (ALDH1A1 through ALD1A3) are thus important regulators of RA signaling. Increased aldehyde dehydrogenase activity is commonly seen in many cancer types where it appears to specifically mark cancer stem cells [46,47]. These cells tend to be highly tumorigenic and invasive, and are often resistant to conventional therapy [48]. The presence of aldehyde dehydrogenase positive cells in tumor from patients with pancreatic cancer has been associated with decreased survival, and ALDH1A1 expression (ALDH1A3 was not tested) correlated with invasion of pancreatic cancer cell lines [49]. Of the 19 ALDH isoforms, ALDH1A1 has most often been linked to the increased ALDH activity of cancer stem cells [50]. However, ALDH1A3 expression has been shown to correlate better with aldehyde activity in breast cancer stem cells than ALDH1A1 expression, and correlate with tumor grade, stage and metastasis [51]. Although we confirmed increased mRNA levels of ALDH1A3 in an independent set of pancreatic tissue samples and two series of cell lines, our finding should be validated in additional sample sets to thoroughly investigate the relationship between ADH1A3, cancer stem cells and pancreatic cancer.

Conclusion
The high resolution and read depth of next generation sequencing has allowed us to accurately quantify gene expression and locate epigenetic marks, thus enabling a detailed analysis of the pancreatic cancer genome. Although limited by a small set of samples, this study has thoroughly catalogued the transcriptome and epigenome of two pancreatic cell lines, confirmed previous genes and pathways deregulated in pancreatic cancer and generated a series of hypotheses worth following-up in future studies to investigate key events in pancreatic carcinogenesis.

Additional files
Additional file 1: Table S1. (See figure on previous page.) Figure 3 Expression of the ALDH1A3 gene in pancreatic cell lines and tissue samples. Expression levels for ALDH1A3 were determined by RT-qPCR in (A) normal derived pancreatic tissue samples (N-1 though N-10) and tumor derived pancreatic tissue samples (T1 through T10), as well as (B) in pancreatic normal (n=1) and tumor (n=12) derived cell lines. Expression levels were normalized to the geometric mean of B2M and PPIA and are shown in arbitrary units (A.U.). Chromosomal gain of the ALDH1A3 gene region is indicated below panel (B) as G: copy number gain, N: no gain. Copy number alterations were not assessed for cell lines marked with "-". (C) ALDH1A3 expression levels were assessed in additional pancreatic cancer cell lines (n=39) by RT-qPCR.