Skip to main content

Table 1 Bioinformatic output measures for small RNA sequencing quality control

From: Biomarker discovery: quantification of microRNAs and other small non-coding RNAs using next generation sequencing

QC metric

Description

Raw Reads

According to Illumina guidelines for small RNA sequencing, 1–2 M reads is an accepted range for expression profiling experiments, while 2–5 M reads is the accepted range for discovery applications.

Size

To avoid background noise due to small fragments of degraded RNA, we removed all reads <15 nt. Size filtering can be easily modified to target a specific small RNA species. For example, 15–28 nt (miRNAs), 24–31 nt (piRNAs), or 15–40 nt if interested in all small ncRNAs.

Quality

Quality (Q) is based on a Phred score, which estimates sequencing error probabilities per base. A Q = 10 means a 1/10 probability of incorrect base calling or 90 % accuracy; Q = 20 (1/100; 99 %); Q = 30 (1/1000; 99.9 %); and Q = 40 (1/10000; 99.99 %). We removed reads with a quality score <30.

Adapter-Adapter

Adapter detection can be adjusted to allow for one or more mismatches in the first 10 nt to identify and trim the adapters. In order to enhance high quality reads, we set our adapter detection threshold to a perfect-10 nt match. Ligation of the 3′ and 5′ adapters to each other happens by chance at a very low rate. However, this can become an important issue for libraries prepared from very small amounts of RNA. We removed all adapter-adapter reads.

RNAs > 40 nt

This feature refers to RNA reads larger than 40 nt in length. In most cases these reads map to midsize and larger non-coding RNA populations. The percentage of reads >40 nt can vary (1 %–50 %) depending on library preparation method used.

Surviving Reads

This metric shows the number of reads that pass all the quality and trimming filters previously described. A good quality library should have surviving rates between 50 % and 100 %, depending on method used.

Unmapped

Due to sequencing errors, stringent QC filters, or RNA from other species (usually added as control, i.e. PhiX), a very small percentage of reads do not map to any human genomic location.

Unique & Multi-Mapped

In contrast to other types of sequencing (DNA and larger RNA), the percentage of reads that map to multiple genomic locations in small RNA sequencing is expected to be high (>50 %). Several small RNAs are encoded at more than one genomic location. This is thought to be a compensatory mechanism or response to ncRNA knockouts by random mutations.

miRNA

We used miRBase to align our reads to known miRNA species. A high percentage of reads aligned to miRNAs is expected. However, this percentage can vary depending on the source and quality of RNA.

Other ncRNAs

Rfam and NCBI’s piRNA databases were used to map our reads to other small RNA species. The number of these reads is very small compared to miRNAs. However, just like with miRNAs, the number of reads mapping back to other sncRNAs is associated with the source and quality of RNA.

(Repeat, Coding gene, Unknown)

This refers to an additional portion of reads that map to repetitive sequences, coding genes, and unknown sequences in the human genome. The number of these reads is expected to be low.

miRNA Count

We set a detection threshold at one count per miRNA (present at least once in each of the libraries tested) in order to get a better picture of lowly expressed miRNAs. However, for quantification and discovery studies, we recommend higher detection thresholds, usually >10 or >20 counts per miRNA, to avoid background noise and false positives.

  1. Important quality control (QC) measures for bioinformatic analysis of our high-throughput biomarker discovery pipeline