Study design
Institutional Review Board approval was obtained for research use of human samples in this project (#IRB 75–87). A pilot study was performed using FFPE and fresh frozen pairs for seven women diagnosed with benign breast disease to evaluate the performance of two library preparation protocol, Illumina’s TruSeq RNA Exome and NEBNext rRNA Depletion (Fig. 1a). To evaluate the precision of SNPs identified by the two protocols, we also performed whole exome sequencing (WES) for the three selected fresh frozen samples. The TruSeq Exome protocol exhibited better performance in bioinformatics metrics and was selected to process all study samples and technical controls in the main study. A total of 158 samples including study samples and technical controls were submitted for RNA extraction (Fig. 1b). Forty samples failed library preparation due to low RNA input quantity. The remaining samples were submitted for RNA sequencing in ten sequencing batches. Batches were designed so that samples with similar RNA quality were included in the same batch. This helped to minimize the potential sequencing bias toward high quality samples in the same batch. To examine the potential sequencing batching effect, the same two technical controls (FFPE and FFzn pair for the same subject) were included in each sequencing batch 1–7. For sequencing batch 8–10 where samples are of low RNA quality, we only included the FFPE technical control as the FFzn technical control would potentially attract more sequencing reads and bias the quantification of other low-quality study samples. Besides the two technical controls, we also included 11 study replicate samples in different sequencing batches. Thorough bioinformatics evaluations were performed to identify samples passing the qc metrics.
RNA quantitation and quality
Total RNA concentration was determined using the Qubit 2.0 Fluorometer and RNA HS Assay (Life Technologies Corp., Carlsbad, CA). RNA integrity was assessed and recorded with DV50, DV100, and DV200 values using the RNA 6000 Nano Kit on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA), but was not used for sample exclusion in the library preparation. DV values are a commonly used metric that represents the proportion of RNA fragments in a sample with greater length than the numeric value (i.e., DV200 equals the percentage of RNA fragments > 200 nucleotides).
RNA exome library preparation and sequencing
40–100 ng of experimental FFPE RNA, FFPE control RNA, or 20 ng fresh frozen control RNA was used for library preparation using the TruSeq RNA Library Prep for Enrichment and the Illumina Exome Panel-Enrichment Oligos kit (llumina, Inc., San Diego, CA) following the manufacturers protocol for FFPE RNA or high-quality total RNA respectively. As per the protocol, fragmentation of FFPE RNA was not performed. Following adaptor ligation and enrichment, the libraries were quantitated by Qubit and pooled for subsequent exome capture based on available yield. Up to a 4-plex pooling strategy was used for the exome capture, with capture groups consisting of 200 ng, 100 ng, 50 ng, 40 ng, and 30 ng of input library for each sample in the pre-capture pool. Up to 12 samples (3 pools) were batched for sequencing and included a paired FFPE control RNA and fresh frozen control RNA in each batch. Following two rounds of hybridization to the capture probes, the pools were PCR amplified and purified using AMPure XP beads. The amplified and enriched libraries were quality assessed using a combination of the Qubit dsDNA HS Assay (Invitrogen), the Bioanalyzer DNA 7500 Assay (Agilent Technologies), and KAPA Library Quantification Kit for Illumina (KAPA Biosystems, Boston, MA). The three capture pools for each batch were combined in equal molar amounts and sequenced across 3 lanes of an Illumina NextSeq 500 High Output flowcell using 75 × 2 bp paired end reads. Each flowcell generated a minimum of 700 million reads passing filter.
rRNA depletion library preparation and sequencing
20–100 ng of FFPE RNA and paired fresh frozen RNA was used for library preparation using the NEBNext rRNA Depletion Kit (Human/Mouse/Rat) and Ultra II Directional RNA Library Prep Kit for Illumina (New England Biolabs Inc., Ipswich, MA), following the manufacturers protocol for highly degraded (RIN ≤ 2) or intact (RIN > 7) samples respectively. Fragmentation is based on RIN value of RNA input and conducted as outlined in the protocol. Fragmentation for FFPE RNA was not performed. Experimental FFPE RNA and paired fresh frozen RNA from the same patient was used if available using similar input amounts for each sample type. A total of 13 libraries were prepared, including six patient pairs. Libraries were quality assessed using a combination of the Qubit dsDNA HS Assay (Invitrogen), the Bioanalyzer DNA 7500 Assay (Agilent Technologies), and KAPA Library Quantification Kit for Illumina (KAPA Biosystems, Boston, MA). Libraries were combined in equal molar amounts and sequenced across three lanes of an Illumina NextSeq 500 High Output flowcell using 75 × 2 bp paired end reads. Each flowcell generated a minimum of 800 million reads passing filter.
Whole exome sequencing of fresh frozen samples
Three fresh frozen samples were submitted for whole exome sequencing at Mayo Clinic molecular genomic facility. In brief, paired-end libraries were prepared with 1.0 μg of genomic DNA in accordance with the manufacturer's protocol (Agilent Technologies, Inc, Santa Clara, Calif). Whole-exon capture was performed with 750 ng of the prepped library following the protocol for the SureSelect Human All Exon v5 + UTRs 75 Mb kit (Agilent Technologies, Inc). The purified capture products were then amplified with use of SureSelect Post-Capture Indexing forward and Index polymerase chain reaction reverse primers (Agilent Technologies, Inc) for 12 cycles. Concentration and size distribution of the completed captured libraries were assessed on Qubit (Invitrogen, Waltham, Mass) and Bioanalyzer DNA 1000 chip (Agilent Technologies, Inc). Libraries were sequenced at an average coverage of about 80× in accordance with standard protocol of the cBot and HiSeq 3000/4000 PE Cluster Kit (Illumina, San Diego, Calif). The flow cells were sequenced as 150 × 2 paired end reads on the HiSeq 4000 with the HiSeq 3000/4000 sequencing kit and collection software (HCS version 3.3.52; Illumina). Base calling was performed with Real-Time Analysis version 2.7.3 (Illumina). All procedures were performed in accordance with the manufacturer's instructions.
RNA-seq alignment and gene quantification
After sequencing procedure, raw FASTQ files were processed through Mayo’s internal MAP-RSeq pipeline [15] (Version 3.0). MAP-RSeq uses a variety of publicly available bioinformatics tools tailored by in-house developed methods. Briefly, the aligning and mapping of reads are performed using Star aligner [16] against hg38 genome build. The gene and exon counts are generated by FeatureCounts [17] using the gene definitions files from Ensembl v78. Quality control was carried out using RSeqQC [18]. Gene expression data was normalized to counts per million (CPM) and transcript per million (TPM) using Trimmed Mean of M-values (TMM) method as implemented in edgeR [19] followed by log2 transformation.
Estimation of SNP confirmation rate and false positive rate
SNPs were identified using GATK haplotype caller [20] and further filtered by RVBoost [21]. For the pilot study, SNP confirmation rate (precision) was calculated for each mutation type (C>T, C>G, etc.) as:
$$SNP \,confirmation\,rate = SNP_{rna} \cap SNP_{dna} /SNP_{rna} ,$$
(1)
where SNPRNA represents the SNPs identified in RNA-seq data and SNPDNA represents the SNPs identified in WES data for the same set of samples. For both pilot and main study, false positive rate (FPR) between either technical or study replicate samples was calculated as:
$$FPR = N_{MAD > lfc} /N_{total} ,$$
(2)
where Ntotal denotes the total number of genes and NMAD>lfc denotes the number of genes with maximum absolute difference (MAD) above a certain logarithm fold change cutoff.
Bioinformatics QC and model building
We defined three bioinformatics metrics for QC purpose, including sample-wise median correlation of gene expression (median_cor_expr), number of genes mapped to genic regions (gene_reads), number of detectable genes with transcript per million (TPM) larger than 4 (gene_tpm4). TPM was calculated based on:
$$TPM = \frac{FPKM}{{\sum FPKM}} \times 10^{6} ,$$
(3)
FPKM was calculated as described earlier [22]. For each sample, we firstly calculate its Spearman rank correlation of gene expression with each of the rest of samples in the cohort. Then, ‘median_cor_expr’ for each sample is the median Spearman correlation value with the rest of samples in the cohort. After thorough bioinformatics evaluation, samples meeting any of the below criteria were flagged as QC-fail:
-
1.
Sample-wise median gene expression correlation smaller than 0.75
-
2.
Gene mapped reads smaller than 25 million gene mapped reads,
-
3.
Less than 11,400 # of detected genes with TPM > 4
For each of the sequencing batch, if the technical controls/replicates failed QC, the whole batch of samples will be flagged as QC-fail as well. For our study, all technical controls/replicates have passed QC. A decision tree model was built based on CART Modeling via rpart R package to learn the relationship between pre-sequencing QC metrics, such as RNA qubit or pre-capture library qubit, and QC pass/fail status predicted by post sequencing bioinformatics metrics. Samples were split into training and testing set with a ratio of 7:3. Repeated cross validation were performed to optimal parameters (complexity parameter) during model training. Similarly, an alternative model based on logistic regression was also constructed:
$$\ln \left( {\frac{p\left( X \right)}{{1 - p\left( X \right)}}} \right) = \beta_{0} + \beta_{1} X$$
(4)
All models were built based on Caret R package [23] and all statistical analysis was carried out in R under R version 4.0.3.