An integrated clinical and genomic information system for cancer precision medicine

Background Increasing affordability of next-generation sequencing (NGS) has created an opportunity for realizing genomically-informed personalized cancer therapy as a path to precision oncology. However, the complex nature of genomic information presents a huge challenge for clinicians in interpreting the patient’s genomic alterations and selecting the optimum approved or investigational therapy. An elaborate and practical information system is urgently needed to support clinical decision as well as to test clinical hypotheses quickly. Results Here, we present an integrated clinical and genomic information system (CGIS) based on NGS data analyses. Major components include modules for handling clinical data, NGS data processing, variant annotation and prioritization, drug-target-pathway analysis, and population cohort explorer. We built a comprehensive knowledgebase of genes, variants, drugs by collecting annotated information from public and in-house resources. Structured reports for molecular pathology are generated using standardized terminology in order to help clinicians interpret genomic variants and utilize them for targeted cancer therapy. We also implemented many features useful for testing hypotheses to develop prognostic markers from mutation and gene expression data. Conclusions Our CGIS software is an attempt to provide useful information for both clinicians and scientists who want to explore genomic information for precision oncology. Electronic supplementary material The online version of this article (10.1186/s12920-018-0347-9) contains supplementary material, which is available to authorized users.


Background
Deep sequencing is about to become a part of clinical tests, but the probabilistic and complex nature of the results makes it vastly different from conventional clinical tests that are deterministic and simple to use without sophisticated informatics analysis. Systematic interpretation of genomic alterations obtained from NGS data remains challenging especially intended for clinical application. In particular, determining clinical and biological significance of each variant in terms of the diagnostic, therapeutic, and prognostic implications for individual patients poses considerable difficulties due to the inconsistency in biological annotations on human genome, variations, and therapeutics from various parties [1]. Furthermore, the complexity in NGS data analysis procedure makes it unrealistic for practicing oncologists to grasp meanings and uncertainties of the results easily without ongoing education in genomics and bioinformatics. Thus, a systematic and easy-tounderstand interpretation system with a readily accessible knowledgebase is urgently needed to identify specific genomic alterations and genotype-matched therapeutic options with clinical relevance, the most critical step in implementing precision oncology.
Recently several groups reported implementation of CGISs which addressed computational and clinical issues involved. PathOS is a web-based CGIS incorporating variant filtering, curation and reporting, but it was mostly for targeted (amplicon) gene sequencing and did not include variant-level recommendation of targeted drugs [2]. CVE was developed as an R package to identify drivers, resistance mechanisms and to assess druggability, but lacks support for patient cohort population [3]. Most systems are focused on either NGS data processing and annotation, or information management issues relevant to clinical applications. Thus, it would be desirable to develop a comprehensive information system that supports diverse features helpful for cancer precision medicine not only for clinical service providers but also for medical scientists. Here, we describe a CGIS implementation of such features and discuss the key bioinformatic challenges in software development.

Overview of system and features
The aims of our CGIS software are (1) to provide a clinical report of recommended therapies with full variantlevel annotation based on NGS data analysis and (2) to support medical scientists for exploring patient cohort data to test hypotheses for developing patient stratification schemes, molecular biomarkers, and alternative treatment options.
Representative features are as follows in the order of information processing as summarized in Fig. 1: A) NGS data processing which includes variant calling from whole exome sequencing (WES) data and expression quantification from whole transcriptome sequencing (WTS, a.k.a. RNA-seq) data. We calculate the somatic single nucleotide variants (SNVs), insertions and deletions (INDELs), and copy number variations (CNVs) using Mutect [4], Strelka [5], and EXCAVATOR [6], respectively. The MapSplice-RSEM [7,8] pipeline was used for RNA-seq quantification to warrant accuracy in spite of long computation time. Galaxy [9] pipelines for WES and WTS data processing are shown in Additional file 1: Figure S1 and Additional file 2: Figure S2 respectively. We also provide Galaxy workflow files for WES and WTS data processing in Additional files 3 and 4 respectively so that those files can be imported into another Galaxy server. Additionally, users can upload their own FASTQ files into our BioCloud system for processing NGS data and for getting the various reports described below.
Step by step demonstration for this procedure is fully described in Additional file 5. B) Import of clinical information from patient's medical record, which includes de-identification and encryption using standard data model of NCI Clinical Data Elements (https://gdc.cancer.gov/clinical-data-elements). C) Variant annotation and prioritization to identify driver alterations or targeted drugs. Genomic alterations were curated at both the gene and variant levels to identify function-affecting variants in cancer genes of the COSMIC database [10]. D) Targeted therapy with clinical relevance to obtain "actionable" targets of different significance. Many curated resources were amassed to establish the list of actionable target genes and variants (i.e. cases where the targeted drugs are available clinically). E) Pathway view of genomic alterations and available targets. Key pathway genes are manually curated for several cancer types to enhance mechanistic understanding that might lead to alternative therapies. F) Patient stratification and survival analysis which facilitate medical scientists to test clinical hypotheses for the purpose of developing diagnostic or prognostic molecular markers. We support patient classification by the mutual exclusivity of somatic mutations and by the gene expression signatures. G) Clinical report system to help clinical decision in an easy-to-use GUI format.

BioDataBank
High-quality interpretations of individual genomic variants inevitably requires vast amount of information collection, proper data modeling, curation of raw data, and integration to build a comprehensive knowledgebase. BioDataBank is our knowledgebase encompassing gene, protein, gene variants in cancer, population (cohort) data, and drugs for clinical therapy. Table 1 is the list of resources that we integrated to build the BioDataBank. Specifically, cancer gene variants were catalogued from the COSMIC [10] and TCGA databases. Curated information on targeted drugs in clinical use or in clinical trials were amassed from various databases such as OncoKB [11], MyCancerGenome [12], and the Personalized Cancer Medicine Knowledge Base [13] (see Table 1).

Cohort database and selection of background patients
Patient grouping and management is an essential part of CGIS to identify other patients with similar mutations or gene expression pattern, which can be used to predict the progress of the disease as well as to identify appropriate therapies. For example, identifying patients with similar molecular characteristics makes it possible to interrogate clinical questions like 'how did the cancer progress?' and 'what would be the effective or non-effective treatments?'.
Cancer omics data at population scale is also important for patient stratification to identify subtypes on molecular basis. Our cohort database contains the TCGA multi-  To support researchers to find patient cohorts that meet their study goals, we implemented a filtering scheme to select the patient cohort based on their clinical or molecular features, including histological subtypes, risk factors, mutational features, diagnosis, therapeutic actions, and treatment outcome at an individual patient level. For example, EGFR was the most frequently mutated gene among female and lifelong never-smoker patients, whereas TP53 mutation was prevalent in other patients, which can be readily confirmed using our cohort explorer for the TGCA LUAD cohort (shown in Additional file 6: Figure S4).

Variant annotation and Druggability
The variant calling process using the WES Galaxy pipeline produces VCF (variant calling format containing details of variants) and BAM (binary alignment map for aligned reads) files, which are imported to the variant annotation and prioritization module of CGIS. We used Oncotator as the main tool for annotating genomic point mutations and short indels [14]. Since many transcripts can be made from the same gene, transcript selection is an important issue in variant annotation. For example, EGFR chr7:55259515 T > G mutation can be annotated as p.L858R only through proper choice of transcript among many different EGFR transcripts. In an effort to resolve this issue, we use the UniProt's canonical sequence as the reference to collect all transcripts that produce the canonical protein sequence in translation. We further added transcripts concordant with all clinically actionable variants in MyCancerGenome [12]. Resulting list of transcripts was provided to Oncotator [14] with the command line option of (−c) to make these transcripts as primary annotation targets. An example of variant annotation results is shown in Fig. 2a.
Drugs targeting specific variants of the patient are of prime interest. As listed in Table 1, we compiled various resources on cancer drugs for targeted therapies both in clinical usage and in preclinical development. Specifically, we categorized drugs into three groups -1) in-house curated drugs for actionable targets which include the FDA-approved drugs, 2) drugs reported in PubMed abstracts obtained from systematic text mining and manual curation, and 3) OncoKB [11] drugs that classified drugs in four levels of reliability according to clinical applicability. We carefully characterized (potential) clinically relevant alterations and assigned available drugs to somatic mutations at the variant and gene levels. For the in-house curated drugs for actionable targets, we included the FDA-approved drugs, drugs in clinical trials referenced by highly reliable sources such as MyCancer-Genome [12], IntOGen [15], Handbook of targeted cancer therapy [16], and manual searches in the New England Journal of Medicine journal. Drugs from text mining were obtained from VarDrugPub [17] that identified the variant-gene-drug relations in all the PubMed abstracts using a machine learning method.
We further provide filtering utility to select genes of known importance in cancer as well as variants based on patient frequency and functional impact (Fig. 2a). The list of known cancer genes was obtained from the Cancer Gene Census of COSMIC (616 genes) [10]. Users may also select the cancer drug targets in clinical practice (26 genes) that were curated by MD Anderson personalized cancer medicine Knowledgebase [13]. These two sets of cancer genes may be the prime targets of personalized treatment and can be focused by the checkbox filtering as shown in Fig. 2a.
It is often the case that users want to examine the details of specific mutation. We provide three interactive plots for efficient variant exploration. The mutation distribution plot (Fig. 2b) shows the mutation spot on the gene structure with functional domains. Mutation frequency among TCGA patients with the same cancer type is shown in the needle plot format. We also show the read alignment plot (Fig. 2c) so that users can check the validity of mutation calls and allele frequencies. To implement this feature without carrying the large-sized BAM file, our NGS pipeline creates a reduced BAM file that contained the read alignments near the mutation points only. Lastly, we support the co-mutation plot to examine the landscape of somatic mutations and CNVs (Additional file 7: Figure S3). Mutations in a specific patient can be readily compared with the cohort population such as the TCGA data.  In sum, our variant annotation and prioritization scheme based on knowledge of cancer genes and targeted drugs provides an efficient way of scrutinizing clinical relevance of somatic variants in a given cancer type.

Patient stratification and survival analysis
Proper stratification of patients is the most fundamental concept of targeted precision medicine. We implemented two most commonly used methods of grouping patients based on mutation and gene expression data. Survival analysis of resulting patient groups can be carried out interactively to facilitate hypothesis test of survival benefit for clinicians.

Mutual exclusivity among driver mutations based on signaling networks
In tumor, not one but several alternative driver alterations in different genes can lead to similar downstream events. A key observation is that when a member of a substitutive gene set is altered, the selection pressure on the other members is diminished or even nullified. As a result, the mutation pattern of alternative driver genes appears almost mutually exclusive among different patients. We use Mutex program [18] to identify mutually exclusive set of genes with a common downstream effect on the signaling network and implemented survival analysis for altered vs. unaltered patient groups. An example of the TP53 signaling module targeting HIF1A gene is shown in Fig. 3a, taking TCGA LUAD as the patient cohort. Note that the gene alteration includes both somatic mutations and CNVs here.

Patient grouping by gene expression signatures
DNA sequencing will not be sufficient to optimally select patients for all classes of targeted therapy. In fact, other types of high-throughput technologies, including RNA sequencing, DNA methylation profiling, and small RNA profiling, are being extensively used to identify cancer subtypes and to further improve our understanding of their biological mechanisms. RNA sequencing is the closest to the clinical applications [19]. For example, OncotypeDX based on expression profile of 21 genes predicts accurately recurrence of early-stage ER-positive breast cancer, demonstrating the possibility of molecular prognosis [20]. We implemented a scheme to sort out patients according to the risk score based on expression value of pre-defined genes (Fig. 3b). The score was derived from the average expression value of 103 genes that defined the metastatic subgroup in our in-house study. Patients in the TCGA LUAD cohort were ranked by the score in the waterfall plot, and we defined the highest and lowest 60 patients as high and low score groups respectively. The difference in the overall survival rate between two groups indicates that the corresponding signature genes may have prognostic value in lung adenocarcinoma. Notably, the list of scoring genes and threshold for defining patient groups are provided by users interactively. Thus the system is flexible enough test diverse clinical hypotheses.

Altered key pathways
The eventual development of acquired resistance has been a near universal observation with targeted cancer therapy. Even in patient samples where those acquired resistance emerges, alterations often converge on specific gene modules or pathways, suggesting that even these scenarios could Fig. 4 Aberrant key pathways for LUAD. a Mutated genes (BRAF, SETD2 and ARD2 in this case) in the given patient are indicated in thick red border. The background color is determined by the CNAs (gain in read and loss in blue), with the color depth reflecting the frequency of patients who were affected by mutation or CNAs. b A click on a pill icon opens up a window that shows available drugs targeting the gene of interest (BRAF in this example). Drugs are color-coded according to the approval status be managed with drugs or drug combinations that target this biochemical and signaling bottleneck [19]. To address this scenario, we defined and unified the altered key pathways for each cancer type that demonstrate how multiple signaling pathways interact via cross-talk and feedback. An example of altered key pathways is shown in Fig. 4a for lung adenocarcinoma. Note that genes are colored according to the abundance of activating or suppressing aberrations (mutations and CNAs). Drugs targeting each gene in the pathway are also listed to help users search available drugs targeting genes on up-or down-stream path (Fig. 4b).

Conclusion
Our CGIS software was designed both for clinicians seeking for an easy-to-understand report of genomic analysis and for medical scientists who want to explore genomic information to test clinical hypotheses for biomarker development. We integrated ample genomic information from diverse public resources with manual curation if necessary. We also devised and implemented several novel ideas and tools for investigating roles of variants, exploring population cohorts, patient stratification based on genomic data, and drugs based on pathway view. This is just a prototype result of our project and we will continue to develop more features and modules for enhanced function and convenience.