A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease
© Raghavachari et al.; licensee BioMed Central Ltd. 2012
Received: 23 November 2011
Accepted: 29 June 2012
Published: 29 June 2012
Transcriptomic studies in clinical research are essential tools for deciphering the functional elements of the genome and unraveling underlying disease mechanisms. Various technologies have been developed to deduce and quantify the transcriptome including hybridization and sequencing-based approaches. Recently, high density exon microarrays have been successfully employed for detecting differentially expressed genes and alternative splicing events for biomarker discovery and disease diagnostics. The field of transcriptomics is currently being revolutionized by high throughput DNA sequencing methodologies to map, characterize, and quantify the transcriptome.
In an effort to understand the merits and limitations of each of these tools, we undertook a study of the transcriptome in sickle cell disease, a monogenic disease comparing the Affymetrix Human Exon 1.0 ST microarray (Exon array) and Illumina’s deep sequencing technology (RNA-seq) on whole blood clinical specimens.
Analysis indicated a strong concordance (R = 0.64) between Exon array and RNA-seq data at both gene level and exon level transcript expression. The magnitude of differential expression was found to be generally higher in RNA-seq than in the Exon microarrays. We also demonstrate for the first time the ability of RNA-seq technology to discover novel transcript variants and differential expression in previously unannotated genomic regions in sickle cell disease. In addition to detecting expression level changes, RNA-seq technology was also able to identify sequence variation in the expressed transcripts.
Our findings suggest that microarrays remain useful and accurate for transcriptomic analysis of clinical samples with low input requirements, while RNA-seq technology complements and extends microarray measurements for novel discoveries.
KeywordsSickle cell disease RNA-Seq Exon arrays Transcriptome Clinical genomics
With the completion of the human genome project, the monitoring of changes in the entire human transcriptome is an increasingly attractive method for dissecting the molecular basis of disease processes [1, 2]. In this regard, the ability to utilize a patient’s transcriptome to detect the onset of disease, monitor its progression, and even to suggest treatment modalities with the highest probability of success would greatly enhance the quality of medical care and treatment [3–6]. Peripheral whole blood is a nucleic acid-rich and inflammatory cell-rich information reservoir. Analytical methodologies to detect transcriptomic changes in these cells may reveal novel biomarkers for disease diagnosis and treatment.
Until recently, high throughput microarray technologies have been the method of choice in clinical studies with limited amounts of RNA from blood samples, biopsy specimens, or enriched cell populations, to obtain gene expression profiles. The high density Affymetrix Human Exon 1.0 ST array (Exon array) with 1.2 million probesets targeting every known and predicted exon in the entire genome has been successfully employed in clinical investigations for obtaining gene expression profiles and associated alternative splicing events in disease processes [7–9]. Despite these successes, inherent limitations in the dynamic range of arrays and the lack of complete coverage for detecting alternative splicing events have constrained the application of this technology.
With the advent of next-generation sequencing technologies, RNA-seq has emerged as a powerful tool for transcriptome analysis [10–12]. By mapping millions of RNA-seq reads to individual transcripts, estimation of expression levels of individual exons and whole transcripts can be performed. It is likely that the microarray-based (analog) gene-expression profiling technology will be replaced by digital sequencing based gene-expression profiling (RNA-seq) [8, 13, 14]. The purported advantages of RNA-seq include generation of expression data for individual annotated genes with nearly unlimited dynamic range; ability to comprehensively detect novel transcripts and mRNA variants resulting from alternative promoter usages, splicing, polyadenylation and sequence variation; and lowered background. However, the technology also brings with it new issues such as the requirement for large amounts of starting material, cumbersome library preparation; novel systematic biases during sample preparation and sequencing that must be accounted for when analyzing the data. Additionally, data processing of multireads and splice junctions have been problematic when mapping the sequences back to a genome.
Considering the merits and limitations of each of the technologies, we undertook the current study using complex whole blood specimens from patients with sickle cell disease (SCD) and matched healthy controls to address the following major questions:
(1). How do the technologies compare with regard to sensitivity, specificity, and variability of gene expression data? (2). Do the differentially expressed transcripts and alternatively spliced exons in SCD correlate well between RNA-seq and Exon arrays? (3). Does the abundant expression of globin transcripts in SCD interfere with RNA-seq analysis? (4). Are we able to discover novel differentially expressed transcripts in SCD using RNA-seq? (5). Can we detect sequence variants in the expressed transcripts?
Although previous comparative studies [15, 16] have reported the advantages of RNA-Seq and microarrays, to our knowledge, our study is the first to use a human monogenic disease model to compare RNA-Seq and microarrays. We report here our observations on the SCD transcriptome from RNA-seq and Exon arrays with the belief that our findings will be useful to clinical investigators in choosing the appropriate genomic technique for understanding molecular mechanism of diseases for diagnosis and the development of novel therapeutics.
The study was approved by the National Institute of Diabetes and Digestive and Kidney Diseases institutional review board (NIH protocol 03-CC-0015) and written informed consents were obtained from study participants. Patients selected for this study included patients (n = 6) with sickle cell disease of mean age 41.6 ± 10.1 and the control group (n = 4) of self-identified African American subjects of mean age 42.2 ± 8.9, without sickle cell disease. The biosamples were collected in steady state condition and none of the individuals was receiving anti-platelet medication.
Isolation of RNA from whole blood specimens
Blood specimens (2.5 ml) collected in PAXgene™ tubes from each subject were incubated at room temperature for 4 h for RNA stabilization and then stored at − 80 °C. RNA was extracted from whole blood using the PAXgene™ Blood RNA System Kit following the manufacturer's guidelines. Briefly, samples were removed from −80 °C and incubated at room temperature for 2 hours to ensure complete lysis. Following lysis, the tubes were centrifuged for 10 min at 5,000 × g the supernatant was discarded and 500 μL of RNase-free water added to the pellet. The tube was vortexed thoroughly to re-suspend the pellet, centrifuged for 10 min at 5000 × g and the entire supernatant was discarded. The pellet was re-suspended in 360 μL of buffer BR1 by vortexing and further purification of RNA was done following the manufacturer's protocol with on-column DNase digestion. Quality of the purified RNA from was verified on an Agilent® 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA); RNA concentrations were determined using a NanoDrop® ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE).
Total RNA from six SCD patients and four healthy controls were used for detailed analysis on RNA-seq and exon array platforms. Data analysis was carried out on 6 SCD and 3 healthy controls after removing a control sample which was identified to be an outlier based on principal component analysis (PCA) of the transformed data from RNA-seq.
Depletion of globin transcripts
Globin mRNA was depleted from a portion of each total RNA sample isolated from PAXgene tubes using the GLOBINclear™-Human kit (Ambion, Austin, TX). In brief 2 μg of total RNA from human whole blood was mixed with a biotinylated Capture Oligo Mix in hybridization buffer. The mixture was incubated for 15 minutes to allow the biotinylated oligonucleotides to hybridize with the globin mRNA species. Streptavidin magnetic beads were then added, to capture the globin mRNA and the magnetic beads were then pulled to the side of the tube with a magnet and the RNA, depleted of the globin mRNA, was transferred to a fresh tube. The RNA was further purified using a rapid magnetic bead-based purification method .
Preparation of biotinylated cDNA targets for exon array hybridizations in GCAS
50 ng RNA samples were amplified using the WT-Ovation™ Pico RNA Amplification System (NuGEN, San Carlos, CA) in a 3 step process as recommended by the manufacturer. All processes were performed in an automated manner using the genechip array station (GCAS). In brief, first strand cDNA was prepared using a unique first strand DNA/RNA chimeric primer mix and reverse transcriptase. In the second step, DNA/RNA Heteroduplex Double Stranded cDNA was generated which serves as the substrate for SPIA amplification - a linear isothermal DNA amplification process developed by NuGEN. RNase H degrades the RNA in the DNA/RNA heteroduplex at the 5´ end of the first cDNA strand exposing part of the unique priming site for the initiation of the next round of cDNA synthesis. The process of SPIA DNA/RNA primer binding, DNA replication, strand displacement and RNA cleavage is repeated, resulting in rapid accumulation of SPIA cDNA. Following amplification and purification, 3 μg of the amplified cDNA were processed with the WT-Ovation Exon Module to produce sense strand ST-cDNA. Following purification and quantitation, 5 μg ST-cDNA was fragmented and labeled with the FL-Ovation™ cDNA Biotin Module using a proprietary two-step fragmentation and labeling process. The first step is a combined chemical and enzymatic fragmentation process that yields single-stranded cDNA products in the 50 to 100 base range. In the second step, this fragmented product is labeled via enzymatic attachment of a biotin-labeled nucleotide to the 3-hydroxyl end of the fragmented cDNA generated in the first step. Hybridization, washing and laser scanning of Affymetrix Human Exon 1.0 ST microarrays were performed according to the manufacturer’s protocol (Affymetrix, Santa Clara, CA). Hybridization was performed at 45 °C overnight, followed by washing and staining using FS450 fluidics station. Scanning was carried out using the 7 G GCS3000 scanner.
Microarray data collection and annotation
Exon-level core RMA-sketch intensity values for each of the chips were collected using Affymetrix Expression Console (EC) Software (Affymetrix, Santa Clara, CA). The 284,258 core probesets were annotated using the Affymetrix annotation file from Netaffx (http://www.netaffx.com, HuEx-1_0-st-v2.na29.hg18.probeset.csv).
Analysis of exon arrays
to the data.
In the above formula, y ijk is the log 2 expression intensity, i is treatment, j is replicate within treatment and k indexes exons. The μ is the mean and the two fixed factors are the treatment effect (A) and exon effect (C). The random factor (β) is the sample within treatment effect and (ϵ) is the error. The fixed interaction between treatment and exon (AC) models the alternative splicing event. In this study, the treatment effect is sickle cell or control. The significance of a detected alternatively spliced event is denoted p-AC. Alternatively spliced events are declared if p-AC < = 10^-8 and the maximum absolute interaction effect (maxik|ACik|) is greater than or equal to 2. A p-AC < = 10^-8 corresponds to less than 1% false discovery rate (FDR) using the method of Benjamini and Hochberg .
Library construction for RNA-seq
High quality total RNA at 1.5 μg was used for analysis on the Illumina GAII analyzer on six SCD samples and four healthy controls. cDNA library preparation and sequencing reactions were carried out using Illumina library prep, clustering and sequencing reagents following the manufacturer's recommendations (http://www.illumina.com). Briefly, mRNAs were purified using poly-T oligo-attached magnetic beads and then fragmented. The first and the second strand cDNAs were synthesized and end repaired. Adaptors were ligated after adenylation at the 3'-ends. After gel purification, cDNA templates were enriched by PCR. cDNA libraries were validated using a High Sensitivity Chip on the Agilent2100 Bioanalyzer™ (Agilent Technologies, Palo Alto, CA). The samples were clustered on a flow cell using the cBOT. After clustering, the samples were loaded on the Illumina GA-II machine. The samples were sequenced using a single lane with 36 cycles. Initial base calling and quality filtering of the Illumina GA-II image data were performed using the default parameters of the Illumina GA Pipeline GERALD stage (http://www.illumina.com).
Mapping and analysis of RNA-seq data
The raw data Fastq sequence files obtained from GAII were mapped to the human genome (build HG18) to get genomic addresses using Bowtie/Tophat  allowing up to two mismatches. Reads that mapped to more than 10 locations were discarded. We obtained ~15.1 million reads per sample. We mapped reads both to exons of known RefSeq transcripts (human genome build 18) and to Affymetrix probe selection region coordinates. Reads mapped to Refseq exons and to Affymetrix probeset selection regions were counted using the CoverageBed method in BedTools . Reads were counted for exons within each RefSeq transcript. In order to compare RNA-seq data fairly to the Exon microarray, we counted reads mapped to each probeset selection region (or probeset) within each exon.
Transcript cluster level reads were counted per probeset within each transcript cluster. Very low count transcript clusters (fewer than 6 samples with 6 or more counts) were ignored. This filtered out a total of 7,146 transcript clusters leaving 11,562 for further statistical analysis. In order to normalize the data, transcript cluster counts were divided by the median transcript cluster count for that sample, and logarithm base 10 transformed, yielding transformed, normalized counts. After principal component analysis (PCA) of the transformed data, one outlier, a control sample was detected and disregarded from further analysis leaving 9 samples. A two sample t-test (sickle cell N1 = 6, control N2 = 3) was used on the normalized, transformed data to test for differential expression. Alternative splicing analysis was computed using the ExonANOVA as with the microarrays data. A conservative and reasonable background limit at 4.5 RMA units was applied for Exon arrays and 1.0 in RPKM units as recommended by Mortazavi et al.  was applied for RNA_seq. The RMA level of 4.5 is slightly above the median RMA for detected exons (Affymetrix DABG value p < 0.01, Affymetrix’s recommendation for a conservative detection of exons).
In order to identify novel transcription events, we counted reads mapping to each 200 base pair region of the genome. Only populated bins (5 or more samples had 6 or more reads) bins were considered further. This filter retained 187,764 bins for analysis. We disregarded bins that fell within annotated RefSeq exons. The remaining 48,462 bins, describe novel, unannotated transcription events. To normalize these data, counts were divided by the 90th percentile of counts for that sample and base 10 logarithm transformed. p-values were required to be 0.0005 or less (corresponding to FDR < 0.15), and the fold-change was required to be 2-fold or greater. If differential expression was found, it was classified as a novel transcript.
Real time Q-PCR analysis
First-strand cDNA was synthesized using 500 ng of RNA and random primers in a 20 μl reverse transcriptase reaction mixture using Invitrogen’s Superscript cDNA synthesis kit (Invitrogen, Carlsbad, CA) following the manufacturer’s directions. Quantitative real-time PCR assays were carried out on 11 differentially expressed genes from both the platforms with the use of gene-specific double fluorescently labeled probes in a 7900 Sequence Detector (PE Applied Biosystems, Norwalk, CT). Probes and primers were obtained from Applied Biosystems. In brief, PCR amplification was performed in a 384 well plate with a 20-μl reaction mixture containing 300 nm of each primer, 200 nm probe, 200 nm dNTP in 1x real time PCR buffer and passive reference (ROX) fluorochrome. The thermal cycling conditions were 2 min at 50° C and 10 min at 95° C, followed by 40 cycles of 15 sec denaturation at 95° C and 1 min annealing and extension at 60° C. Samples were analyzed in duplicate and the Ct values obtained were normalized to the housekeeping gene ß actin. The comparative CT (ΔΔ CT) method  which compares the differences in CT values between groups was used to achieve the relative fold change in gene expression between SCD and Controls.
Principal component analysis
Evaluation of dynamic range, within platform reproducibility - coefficient of variation and sensitivity
With the expectation that deeper sequencing and large amounts of starting material are needed to adequately cover low abundance transcripts, we tested the sensitivity of each of the platforms by examining the expression of transcripts above background. With the application of conservative and reasonable background limits at 4.5 RMA units for Exon arrays and 1.0 in RPKM units for RNA_seq to one sample for a control subject C3 (Figure 2), we were able to detect 6% more transcripts (12,310/11,662 = 1.06) above background in exon arrays than in RNA-seq.
Differentially expressed genes in SCD
Selected Differentially Expressed Genes Grouped by Pathways of Interest
Apoptosis/cell cycle regulation
basal cell adhesion molecule
BMP2 inducible kinase
death-associated protein kinase 2
damage-specific DNA binding protein 1, 127 kDa
DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y
DEAD (Asp-Glu-Ala-Asp) box polypeptide 60
DEAH (Asp-Glu-Ala-His) box polypeptide 29
dedicator of cytokinesis 5
growth differentiation factor 15
G1 to S phase transition 1
acid phosphatase 1, soluble
acyl-CoA synthetase long-chain family member 6
acyl-CoA synthetase medium-chain family member 3
arachidonate 5-lipoxygenase-activating protein
Rho guanine nucleotide exchange factor (GEF) 12
ATPase, H + transporting, V1 subunit D
RAN binding protein 10
RAP1 GTPase activating protein
Ras association (RalGDS
regulator of G-protein signaling 2, 24 kDa
serpin peptidase inhibitor, clade G (C1 inhibitor)
serpin peptidase inhibitor, clade I (neuroserpin), 1
transcription factor Dp-1
transferrin receptor 2
transmembrane protein 86B
trafficking protein, kinesin binding 2
ubiquitin-conjugating enzyme E2O
Oxidative stress/antioxidants/Stress Response
aminolevulinate, delta-, synthase 2
carbonic anhydrase I
carbonic anhydrase II
glutathione peroxidase 1
nuclear receptor coactivator 4
nuclear factor I
ornithine aminotransferase (gyrate atrophy)
oxysterol binding protein 2
oxysterol binding protein-like 6
selenium binding protein 1
sorting nexin 4
sortilin-related receptor, L(DLR class) A repeats
transferrin receptor 2
basal cell adhesion molecule (Lutheran blood group)
chemokine (C-C motif) receptor-like 2
CD36 molecule (thrombospondin receptor)
CDC42 effector protein (Rho GTPase binding) 1
centrin, EF-hand protein, 2
CDGSH iron sulfur domain 2
major histocompatibility complex, DP alpha 1
major histocompatibility complex, class II, DR beta 5
interferon, alpha-inducible protein 27
interferon-induced protein 44
interferon-induced protein 44-like
interferon-induced protein with tetratricopeptide 1
interleukin 1, beta
interleukin 1 receptor, type II
interleukin 1 receptor accessory protein
interleukin 3 receptor, alpha (low affinity)
intersectin 1 (SH3 domain protein)
T-cell acute lymphocytic leukemia 1
TBC1 domain family, member 22B
t-complex 11 (mouse)-like 2
Red Cell genes
ankyrin 1, erythrocytic
ankyrin repeat domain 9
basigin (Ok blood group)
erythroid associated factor
erythroblast membrane-associated protein
glycophorin A (MNS blood group)
glycophorin B (MNS blood group)
hemoglobin, alpha 1
hemoglobin, epsilon 1
hemoglobin, gamma A
Kell blood group, metallo-endopeptidase
Kruppel-like factor 1 (erythroid)
chromosome 10 open reading frame 10
chromosome 12 open reading frame 29
chromosome 13 open reading frame 15
chromosome 14 open reading frame 45
chromosome 17 open reading frame 99
chromosome 1 open reading frame 105
chromosome 1 open reading frame 128
chromosome 2 open reading frame 24
chromosome 2 open reading frame 88
complement component 4B
complement component 4 binding protein, alpha
chromosome 4 open reading frame 18
chromosome 5 open reading frame 32
chromosome 5 open reading frame 4
chromosome 6 open reading frame 192
chromosome 7 open reading frame 54
Genes identified as differentially expressed in SCD were subjected to gene ontology enrichment analysis to determine their molecular functions. This analysis selected 10 highly significant functional pathways including cell cycle regulators, apoptosis, oxidative stress response, inflammation and immune response, free radical scavenging, protein modification, and hematopoiesis. (Additional file 1: Figure S1). Examination of these pathways suggests that these differentially regulated genes are consistent with the homeostatic response to known pathobiological stresses in SCD, including oxidative and hemolytic stress, vascular injury, and participation in repair. We also observed upregulated expression of several reticulocyte specific genes such as ankyrin1, erythroid associated factor, hemoglobins, nuclear associated factor, glycophorin, transferrin receptor, and selenium binding protein, as expected with prominent reticulocytosis in SCD, thereby validating the performance of the two technologies in identifying biological alterations in the sickle cell disease model.
Effect of globin reduction on RNA-seq data quality
Currently, RNA-seq requires enrichment steps to select Poly A RNA for library construction from total RNA. Since ribosomal RNA represents over 90% of the RNA within a given cell, studies have shown that its removal increases the sensitivity to retrieve data from the remaining portion of the transcriptome. In large clinical studies where whole blood PAXgene RNA is used, there is an additional interference by the high levels of globins in whole blood RNA. This is further complicated in hematologic diseases such as sickle cell where globins account for more than 70% of mRNA.
In order to determine if globins interfere with the sequence reads and affect the sensitivity of transcriptome analysis on RNA-seq platform, we reduced the globins on one sickle cell patient sample and compared the globin reduced and non-reduced samples on RNA-seq. Scatter plot analysis of the normalized transcript read counts for these two samples showed a high correlation (R = 0.93, Additional file 2: Figure S2). This suggests that the globin transcripts in the sickle cell sample do not affect the sequence reads and do not introduce much bias in the analysis.
Validation by QPCR analysis
Detection of alternatively spliced exons
Highly Significant Alternatively Spliced Genes
activating transcription factor 6 beta
Cell Death/Immune Response
coactivator-associated arginine methyltransferase 1
cyclin D-type binding-protein 1
cytochrome c oxidase subunit IV iso1
dynactin 2 (p50)
Hermansky-Pudlak syndrome 1
Inflammatory Response; Cell Signaling
insulin induced gene 1
Lipid metabolism, Molecular Transport
nudix type motif 4
Cell signaling; Hematopoiesis
nudix motif 4 pseudogene
Cell signaling; Hematopoiesis
Rh blood group, CcEe antigens
Agglutination of red blood cells
Rh blood group, D antigen
Agglutination of red blood cells
tenascin XA (pseudogene)
unc-13 homolog D
hemoglobin, alpha 1
Transport of oxygen
Discovery of novel transcripts to be differentially expressed in SCD
Highly Significant Novel Differentially Expressed 200 bp Regions
End- Base 200
Analysis for sequence variation in expressed transcripts
Whole transcriptome sequencing (RNA-seq) is a powerful transcriptional profiling technology using next generation sequencing platforms [4, 23–25] and has signaled a new age in clinical genomics. Several recent studies have indicated that RNA-seq will be more useful than the current microarray technology due to the increased dynamic range of signal of sequencing [4, 26–28] and its ability to identify the exact location of transcription boundaries at single base resolution. RNA-seq also provides the sequence information needed to identify single nucleotide variants, map variant transcription start sites, and detect novel transcript splicing. These features make RNA-seq particularly useful for studying complex transcriptomes, such as those found in the clinical blood samples.
Although attractive, clinical application of RNA-seq is feasible only if the tool can demonstrate high specificity, sensitivity and reproducibility with limited amount of starting material. Although current technology for transcriptome sequencing requires at least 100 ng total RNA (tens of thousands of cell equivalents), along with additional enrichment steps to select for poly(A) + RNA and/or to reduce the content of ribosomal RNA (rRNA) prior to NGS library construction, to minimize the loss of input material researchers tend to start with a minimum of 1 microgram total RNA. The utility of RNA-seq data generated is believed to be sensitive to read length, mapping and assembly of reads and statistical and computational challenges. However, with the current availability of substantially improved mapping software, these challenges are expected to be well tackled. Microarrays, on the other hand, are known to suffer from reduced dynamic range of signals due to saturation biases and high background, and non-specific or cross-hybridization resulting in false-positive signals, especially for transcripts that have low expression levels.
Considering these challenges inherent to each of the high throughput platforms, we undertook this comparative study in an attempt to better understand the relative merits of high density exon microarrays and RNA-seq for biomarker discovery in the clinical setting. We chose the sickle cell disease because strong differential expression has been previously observed, and the phenotype of the disease is in manifested in blood, an accessible tissue for study. Sampling whole blood, a globin abundant tissue, allowed us to examine the potential interference of high abundant globin transcripts during sequencing and also to potentially discover novel genes that are associated with sickle cell disease from cell types such as nucleated red cells in addition to the conventional peripheral blood mononuclear cells. We believe that this is the first study that has compared the 2 platforms on a monogenic human disease model using easily accessible whole blood clinical specimens mimicking a large scale clinical research project.
In this study we used 50 ng of total RNA on exon arrays without any globin reduction or poly(A) + enrichment to identify differentially expressed genes and alternative splicing events in sickle cell disease. The same samples were also analyzed by RNA-seq using 1.5 micrograms of total RNA. RNA-seq analysis generated an average of 83% mappable reads from the whole blood samples after poly(A) + enrichment. The globin reduction process had insignificant effect on the mappable read count, even for sickle cells samples which have high levels of globin RNA suggetsing that it is not necessary to reduce the globins while analyzing whole blood samples by RNA-seq.
As expected, comparison of the dynamic range of the two platforms confirmed that RNA-seq has a dramatically larger range varying from 4 to 16 fold. This enticing feature of RNA-seq effectively removes the saturation biases inherent to the array platform. Examination of the technical reproducibility and coefficient of variation of the array and RNA-seq platforms at the gene and exon levels, as a function of the mean expression level, indicated that the coefficient of variation for microarray is much lower than that for RNA-seq and is also independent of the expression level for each transcript, suggesting that technical variability within group is higher in the RNA-seq platform than the arrays. A similar observation has been reported by Marioni et al. . They observed extremely high CVs when the read counts were low, a domain where Poisson counting error dominates in RNA-seq. In this domain, microarray produces moderately low CV (20%), suggesting that microarray may in fact be more effective at detecting expression changes for low-abundance genes.
Our comparative analysis of detection sensitivity with material from clinical samples revealed that even with the usage of 30 times less starting material (50 ng vs 1.5 micrograms) Exon arrays could detect as many transcripts above background as in RNA-seq. It should be noted that the sequencing depth in this study (~10 million reads) is comparable with most published RNA-seq studies. Xu and others  from their comparative study using GG Exon arrays to RNA-seq reported that although both platforms detect similar expression changes at the gene level, the Exon array is more sensitive at the exon level and deeper sequencing is required to adequately cover low abundance transcripts . It should be mentioned here that with the latest much improved sequencing instruments, it would be easier to generate ~ 80 million reads and this would substantially increase the sensitivity of detection in RNA-seq platform.
We found 331 transcripts with differentially expressed transcripts in SCD. These included genes involved in pathways related to sickle cell disease such as inflammatory response, oxidoreductase pathways, stress response, cell signaling and apoptosis. Of these 331 transcripts which showed a high degree of correlation (R = 0.64), 96 genes were identified by both the technologies. A similar observation has also been reported by correlating gene expression arrays and RNA-seq on their study on differentially expressed liver and kidney tissues . Only 11 genes out of the 331 genes from the current study showed an opposite trend in differential expression in sickle cell disease, suggesting that the number of false positives was small, using either method.
Gene ontology analysis of these genes helped to classify their molecular functions into ten highly significant functional pathway such as cellular cycle regulators, apoptosis, oxidative stress response, inflammation and immune response, free radical scavenging, protein modification, and hematopoiesis. Examination of these pathways suggests that differentially regulation may be in response to oxidant and hemolytic stress, vascular injury and participation in repair and homeostasis [17, 30–34]. Interestingly, GDF15 expression was upregulated, which also has been observed in thalassemia intermedia, and associated with repression of hepcidin, an important mediator of the inflammatory response on erythropoiesis .
We also observed upregulated expression of several reticulocyte specific genes as expected in SCD where a higher proportion of reticulocytes are observed. This finding validates the performance of both technologies in identifying alterations relevant to sickle cell disease. From a biological perspective, the whole blood expression profile provided a window into real time erythrocyte expression profiles. Insights into the transcription profile of these red blood cells may contribute greatly to our understanding of mechanism of disease, prognosis, and responses to therapeutics.
Using ExonAnova analysis on RNA-seq data, we identified 16 alternatively spliced genes. While further validation of these splice variants is needed, it is interesting to note that both RHCE and RHD are components of the important Rh antigen system on red cells. However the potential implications of altered Rh splicing in SCD is still unclear. Deficiency of UNC13D is known to result in defective exocytosis of cytolytic granules of cytotoxic T lymphocytes and natural killer cells, causing immune dysregulation . Whether altered expression of UNC13D in SCD could contribute to the relative immune compromise of SCD may merit future investigation.
To illustrate the power of RNA-seq in detecting differential transcription not associated with known genes, we scanned the entire genome for novel differential expression focusing only on unannotated genomic regions. By doing so, we found an interesting region to include an apparently novel, minor exon between exons 4 and 5 in the ALAS2 gene with SCD patients showing at least six times higher expression levels compared to the control subjects (p = 0.003). This could suggest alternative splicing in SCD which might serve as an ALAS2 transcription regulator. Follow up of this suggestion would require a functional analysis of this newly identified region of ALAS2 but is beyond the scope of the current study but is planned for the future. ALAS2 gene expression is restricted to the erythroid lineage  and plays a pivotal role in heme synthesis. In addition to heme-mediated feedback inhibition of enzymatic function, ALAS-2, a member of a small family of genes is modulated by iron . This ability of RNA-seq to identify regions in detail holds great promise for the future discovery of novel transcripts and biomarkers in clinical genomic studies.
Another key advantage of RNA-seq over existing technologies for transcriptomic studies is its ability to identify sequence variations in expressed transcripts. To illustrate the feasibility of identifying sequence variation in expressed genes, we focused on the known single nucleotide mutation in SCD in which glutamic acid-6 is replaced by valine (GAG replaced by GTG). We were able to successfully detect this mutation in all the sickle cell patients. Interestingly, we were also able to identify, that same mutation in heterozygous combination with a hemoglobin C beta globin variant having glutamic acid replaced by lysine (GAG replaced by AAG) in one compound heterozygous sickle cell patient, thereby demonstrating the ability of RNA-seq to reliably identify heterozygous single base mutations in the expressed transcripts.
In conclusion, our study clearly illustrates a high level of concordance between the array platform and the RNA-seq technology, and suggests that the high density Exon array still remains a powerful tool to generate meaningful data when the amount of material is limited. Although RNA-seq is still in the early stages of use in clinical studies, it has clear advantages over the array based transcriptomic methods, based on its ability to discover novel transcripts, identify sequence variants, and increased dynamic range of signals leading to increase fold change in measured expression levels. With the rapid evolution of NGS instruments and library preparation methods with multiplexing barcodes, longer read lengths and large number of paired end reads associated with reduced cost per lane is highly feasible in the near future. The use of picogram to few nanogram amounts to total RNA for RNAseq still needs to be optimized in order to capture low abundance transcripts.
We believe that the results from this study provide guidelines on the choice of tools in the form of arrays or RNA-seq for clinical transcriptomic studies using limited amount of starting material. We believe that the selection of an appropriate tool for clinical genomic studies is mostly driven by the biological question underlying the study: whether a formal hypothesis is being tested or is the study intended to better describe the complete transcriptome and discover novel transcripts. An emerging approach is to apply both RNA-seq and arrays in combination, in large scale clinical studies, where RNA-seq is used first to define the transcriptome elements associated with the disease in question, followed by high throughput and reliable screening of these elements on thousands of patient samples using the arrays. Integrating data from both microarray and RNA-seq experiments may open up new possibilities for creating meaningful informational networks which will aid our understanding of disease pathology and development of novel therapeutics.
We gratefully acknowledge Drs. Kairong Cui and Keji Zhao for their help in library construction and sequencing. We thank Dr. Shurjo Sen NHGRI for his valuable time and advice in analyzing the sequence data. We acknowledge the help of Ms. Marlene Peters-Lawrence for her assistance in collecting these specimens from volunteers and thank the sickle cell and healthy control subjects who participated in this study. This work was funded by the NHLBI Division of Intramural Research (1 ZIA HL006013-03) and by the CIT Division of Computational Bioscience.
- Melton SD, Genta RM, Souza RF: Biomarkers and molecular diagnosis of gastrointestinal and pancreatic neoplasms. Nat Rev Gastroenterol Hepatol. 2010, 7 (11): 620-628.PubMedPubMed CentralGoogle Scholar
- Rudan I: New technologies provide insights into genetic basis of psychiatric disorders and explain their co-morbidity. Psychiatr Danub. 2010, 22 (2): 190-192.PubMedGoogle Scholar
- Weitzel JN, Blazer KR, Macdonald DJ, Culver JO, Offit K: Genetics, genomics, and cancer risk assessment: State of the Art and Future Directions in the Era of Personalized Medicine. CA Cancer J Clin. 2011, 61 (5): 327-359.PubMedPubMed CentralGoogle Scholar
- Yang X, Jiao R, Yang L, Wu LP, Li YR, Wang J: New-generation high-throughput technologies based 'omics' research strategy in human disease. Yi Chuan. 2011, 33 (8): 829-846.PubMedGoogle Scholar
- Offit K: Personalized medicine: new genomics, old lessons. Hum Genet. 2011, 130 (1): 3-14.View ArticlePubMedPubMed CentralGoogle Scholar
- Lam CW, Lau KC, Tong SF: Microarrays for personalized genomic medicine. Adv Clin Chem. 2011, 52: 1-18.View ArticleGoogle Scholar
- Abdueva D, Wing MR, Schaub B, Triche TJ: Experimental comparison and evaluation of the Affymetrix exon and U133Plus2 GeneChip arrays. PLoS One. 2007, 2 (9): e913.View ArticlePubMedPubMed CentralGoogle Scholar
- Bradford JR, Hey Y, Yates T, Li Y, Pepper SD, Miller CJ: A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics. 2010, 11: 282.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu JM, Camilli A: Discovery of bacterial sRNAs by high-throughput sequencing. Methods Mol Biol. 2011, 733: 63-79.View ArticlePubMedPubMed CentralGoogle Scholar
- Cirulli ET, Singh A, Shianna KV, Ge D, Smith JP, Maia JM, Heinzen EL, Goedert JJ, Goldstein DB: Screening the human exome: a comparison of whole genome and whole transcriptome sequencing. Genome Biol. 2010, 11 (5): R57.View ArticlePubMedPubMed CentralGoogle Scholar
- Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, Lao K, Surani MA: RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat Protoc. 2010, 5 (3): 516-535.View ArticlePubMedGoogle Scholar
- Nagalakshmi U, Waern K, Snyder M: RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010, Chapter 4: 11-13. Unit 4 11PubMedGoogle Scholar
- Tariq MA, Kim HJ, Jejelowo O, Pourmand N: Whole-transcriptome RNAseq analysis from minute amount of total RNA. Nucleic Acids Res. 2011, 39 (18): e120.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao J, Grant SF: Advances in whole genome sequencing technology. Curr Pharm Biotechnol. 2010, 12 (2): 293-305.View ArticleGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 1509-1517.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu GE: Recent Applications of DNA Sequencing Technologies in Food, Nutrition and Agriculture. Recent Pat Food Nutr Agric. 2011, 3 (3): 187-195.View ArticlePubMedGoogle Scholar
- Raghavachari N, Xu X, Munson PJ, Gladwin MT: Characterization of whole blood gene expression profiles as a sequel to globin mRNA reduction in patients with sickle cell disease. PLoS One. 2009, 4 (8): e6484.View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological). 1995, 57 (1): 289-300.Google Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111.View ArticlePubMedPubMed CentralGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628.View ArticlePubMedGoogle Scholar
- Bieche I, Parfait B, Tozlu S, Lidereau R, Vidaud M: Quantitation of androgen receptor gene expression in sporadic breast tumors by real-time RT-PCR: evidence that MYC is an AR-regulated gene. Carcinogenesis. 2001, 22 (9): 1521-1526.View ArticlePubMedGoogle Scholar
- Ma Q, Lu AY: Pharmacogenetics, pharmacogenomics, and individualized medicine. Pharmacol Rev. 2011, 63 (2): 437-459.View ArticlePubMedGoogle Scholar
- Mardis ER: A decade's perspective on DNA sequencing technology. Nature. 2011, 470 (7333): 198-203.View ArticlePubMedGoogle Scholar
- Taylor BS, Ladanyi M: Clinical cancer genomics: how soon is now?. J Pathol. 2010, 223 (2): 318-326.PubMedGoogle Scholar
- Tommasi S, Danza K, Pilato B, De Summa S: Innovative technology for cancer risk analysis. Ann Oncol. 2011, 22 (Suppl 1): i37-43.View ArticlePubMedGoogle Scholar
- Schweiger MR, Kerick M, Timmermann B, Isau M: The power of NGS technologies to delineate the genome organization in cancer: from mutations to structural variations and epigenetic alterations. Cancer Metastasis Rev. 2011, 30 (2): 199-210.View ArticlePubMedGoogle Scholar
- Settle SH, Sulman EP: Tumor profiling: development of prognostic and predictive factors to guide brain tumor treatment. Curr Oncol Rep. 2010, 13 (1): 26-36.View ArticleGoogle Scholar
- Xu W, Seok J, Mindrinos MN, Schweitzer AC, Jiang H, Wilhelmy J, Clark TA, Kapur K, Xing Y, Faham M, et al: Human transcriptome array for high-throughput clinical studies. Proc Natl Acad Sci U S A. 2011, 108 (9): 3707-3712.View ArticlePubMedPubMed CentralGoogle Scholar
- Morris CR: Mechanisms of vasculopathy in sickle cell disease and thalassemia. Hematology Am Soc Hematol Educ Program. 2008, 177-185.Google Scholar
- Jison ML, Munson PJ, Barb JJ, Suffredini AF, Talwar S, Logun C, Raghavachari N, Beigel JH, Shelhamer JH, Danner RL, Gladwin MT: Blood mononuclear cell gene expression profiles characterize the oxidant, hemolytic, and inflammatory stress of sickle cell disease. Blood. 2004, 104 (1): 270-280.View ArticlePubMedGoogle Scholar
- Steinberg MH, Brugnara C: Pathophysiological-based approaches to treatment of sickle cell disease. Annu Rev Med. 2003, 54: 89-112.View ArticlePubMedGoogle Scholar
- Raghavachari N, Xu X, Harris A, Villagra J, Logun C, Barb J, Solomon MA, Suffredini AF, Danner RL, Kato G, et al: Amplified expression profiling of platelet transcriptome reveals changes in arginine metabolic pathways in patients with sickle cell disease. Circulation. 2007, 115 (12): 1551-1562.View ArticlePubMedPubMed CentralGoogle Scholar
- Kato GJ, Hebbel RP, Steinberg MH, Gladwin MT: Vasculopathy in sickle cell disease: Biology, pathophysiology, genetics, translational medicine, and new research directions. Am J Hematol. 2009, 84 (9): 618-625.View ArticlePubMedPubMed CentralGoogle Scholar
- Tanno T, Bhanu NV, Oneal PA, Goh SH, Staker P, Lee YT, Moroney JW, Reed CH, Luban NL, Wang RH, et al: High levels of GDF15 in thalassemia suppress expression of the iron regulatory protein hepcidin. Nat Med. 2007, 13 (9): 1096-1101.View ArticlePubMedGoogle Scholar
- Feldmann J, Callebaut I, Raposo G, Certain S, Bacq D, Dumont C, Lambert N, Ouachee-Chardin M, Chedeville G, Tamary H, et al: Munc13-4 is essential for cytolytic granules fusion and is mutated in a form of familial hemophagocytic lymphohistiocytosis (FHL3). Cell. 2003, 115 (4): 461-473.View ArticlePubMedGoogle Scholar
- Cotter PD, Willard HF, Gorski JL, Bishop DF: Assignment of human erythroid delta-aminolevulinate synthase (ALAS2) to a distal subregion of band Xp11.21 by PCR analysis of somatic cell hybrids containing X; autosome translocations. Genomics. 1992, 13 (1): 211-212.View ArticlePubMedGoogle Scholar
- Cox TC, Bawden MJ, Martin A, May BK: Human erythroid 5-aminolevulinate synthase: promoter analysis and identification of an iron-responsive element in the mRNA. The EMBO journal. 1991, 10 (7): 1891-1902.PubMedPubMed CentralGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/5/28/prepub