An integrated clinical and genomic information system for cancer precision medicine
BMC Medical Genomicsvolume 11, Article number: 34 (2018)
Increasing affordability of next-generation sequencing (NGS) has created an opportunity for realizing genomically-informed personalized cancer therapy as a path to precision oncology. However, the complex nature of genomic information presents a huge challenge for clinicians in interpreting the patient’s genomic alterations and selecting the optimum approved or investigational therapy. An elaborate and practical information system is urgently needed to support clinical decision as well as to test clinical hypotheses quickly.
Here, we present an integrated clinical and genomic information system (CGIS) based on NGS data analyses. Major components include modules for handling clinical data, NGS data processing, variant annotation and prioritization, drug-target-pathway analysis, and population cohort explorer. We built a comprehensive knowledgebase of genes, variants, drugs by collecting annotated information from public and in-house resources. Structured reports for molecular pathology are generated using standardized terminology in order to help clinicians interpret genomic variants and utilize them for targeted cancer therapy. We also implemented many features useful for testing hypotheses to develop prognostic markers from mutation and gene expression data.
Our CGIS software is an attempt to provide useful information for both clinicians and scientists who want to explore genomic information for precision oncology.
Deep sequencing is about to become a part of clinical tests, but the probabilistic and complex nature of the results makes it vastly different from conventional clinical tests that are deterministic and simple to use without sophisticated informatics analysis. Systematic interpretation of genomic alterations obtained from NGS data remains challenging especially intended for clinical application. In particular, determining clinical and biological significance of each variant in terms of the diagnostic, therapeutic, and prognostic implications for individual patients poses considerable difficulties due to the inconsistency in biological annotations on human genome, variations, and therapeutics from various parties . Furthermore, the complexity in NGS data analysis procedure makes it unrealistic for practicing oncologists to grasp meanings and uncertainties of the results easily without ongoing education in genomics and bioinformatics. Thus, a systematic and easy-to-understand interpretation system with a readily accessible knowledgebase is urgently needed to identify specific genomic alterations and genotype-matched therapeutic options with clinical relevance, the most critical step in implementing precision oncology.
Recently several groups reported implementation of CGISs which addressed computational and clinical issues involved. PathOS is a web-based CGIS incorporating variant filtering, curation and reporting, but it was mostly for targeted (amplicon) gene sequencing and did not include variant-level recommendation of targeted drugs . CVE was developed as an R package to identify drivers, resistance mechanisms and to assess druggability, but lacks support for patient cohort population . Most systems are focused on either NGS data processing and annotation, or information management issues relevant to clinical applications. Thus, it would be desirable to develop a comprehensive information system that supports diverse features helpful for cancer precision medicine not only for clinical service providers but also for medical scientists. Here, we describe a CGIS implementation of such features and discuss the key bioinformatic challenges in software development.
Overview of system and features
The aims of our CGIS software are (1) to provide a clinical report of recommended therapies with full variant-level annotation based on NGS data analysis and (2) to support medical scientists for exploring patient cohort data to test hypotheses for developing patient stratification schemes, molecular biomarkers, and alternative treatment options.
Representative features are as follows in the order of information processing as summarized in Fig. 1:
NGS data processing which includes variant calling from whole exome sequencing (WES) data and expression quantification from whole transcriptome sequencing (WTS, a.k.a. RNA-seq) data. We calculate the somatic single nucleotide variants (SNVs), insertions and deletions (INDELs), and copy number variations (CNVs) using Mutect , Strelka , and EXCAVATOR , respectively. The MapSplice-RSEM [7, 8] pipeline was used for RNA-seq quantification to warrant accuracy in spite of long computation time. Galaxy  pipelines for WES and WTS data processing are shown in Additional file 1: Figure S1 and Additional file 2: Figure S2 respectively. We also provide Galaxy workflow files for WES and WTS data processing in Additional files 3 and 4 respectively so that those files can be imported into another Galaxy server. Additionally, users can upload their own FASTQ files into our BioCloud system for processing NGS data and for getting the various reports described below. Step by step demonstration for this procedure is fully described in Additional file 5.
Import of clinical information from patient’s medical record, which includes de-identification and encryption using standard data model of NCI Clinical Data Elements (https://gdc.cancer.gov/clinical-data-elements).
Variant annotation and prioritization to identify driver alterations or targeted drugs. Genomic alterations were curated at both the gene and variant levels to identify function-affecting variants in cancer genes of the COSMIC database .
Targeted therapy with clinical relevance to obtain “actionable” targets of different significance. Many curated resources were amassed to establish the list of actionable target genes and variants (i.e. cases where the targeted drugs are available clinically).
Pathway view of genomic alterations and available targets. Key pathway genes are manually curated for several cancer types to enhance mechanistic understanding that might lead to alternative therapies.
Patient stratification and survival analysis which facilitate medical scientists to test clinical hypotheses for the purpose of developing diagnostic or prognostic molecular markers. We support patient classification by the mutual exclusivity of somatic mutations and by the gene expression signatures.
Clinical report system to help clinical decision in an easy-to-use GUI format.
High-quality interpretations of individual genomic variants inevitably requires vast amount of information collection, proper data modeling, curation of raw data, and integration to build a comprehensive knowledgebase. BioDataBank is our knowledgebase encompassing gene, protein, gene variants in cancer, population (cohort) data, and drugs for clinical therapy. Table 1 is the list of resources that we integrated to build the BioDataBank. Specifically, cancer gene variants were catalogued from the COSMIC  and TCGA databases. Curated information on targeted drugs in clinical use or in clinical trials were amassed from various databases such as OncoKB , MyCancerGenome , and the Personalized Cancer Medicine Knowledge Base  (see Table 1).
Cohort database and selection of background patients
Patient grouping and management is an essential part of CGIS to identify other patients with similar mutations or gene expression pattern, which can be used to predict the progress of the disease as well as to identify appropriate therapies. For example, identifying patients with similar molecular characteristics makes it possible to interrogate clinical questions like ‘how did the cancer progress?’ and ‘what would be the effective or non-effective treatments?’. Cancer omics data at population scale is also important for patient stratification to identify subtypes on molecular basis. Our cohort database contains the TCGA multi-omics data with clinical information for each patient. It includes SNVs, CNVs, RNA expression data on 1845 tumor and 1929 normal samples across three focused cancer types (breast invasive carcinoma, glioblastoma multiforme, and lung adenocarcinoma) currently. To support researchers to find patient cohorts that meet their study goals, we implemented a filtering scheme to select the patient cohort based on their clinical or molecular features, including histological subtypes, risk factors, mutational features, diagnosis, therapeutic actions, and treatment outcome at an individual patient level. For example, EGFR was the most frequently mutated gene among female and lifelong never-smoker patients, whereas TP53 mutation was prevalent in other patients, which can be readily confirmed using our cohort explorer for the TGCA LUAD cohort (shown in Additional file 6: Figure S4).
Variant annotation and Druggability
The variant calling process using the WES Galaxy pipeline produces VCF (variant calling format containing details of variants) and BAM (binary alignment map for aligned reads) files, which are imported to the variant annotation and prioritization module of CGIS. We used Oncotator as the main tool for annotating genomic point mutations and short indels . Since many transcripts can be made from the same gene, transcript selection is an important issue in variant annotation. For example, EGFR chr7:55259515 T > G mutation can be annotated as p.L858R only through proper choice of transcript among many different EGFR transcripts. In an effort to resolve this issue, we use the UniProt’s canonical sequence as the reference to collect all transcripts that produce the canonical protein sequence in translation. We further added transcripts concordant with all clinically actionable variants in MyCancerGenome . Resulting list of transcripts was provided to Oncotator  with the command line option of (−c) to make these transcripts as primary annotation targets. An example of variant annotation results is shown in Fig. 2a.
Drugs targeting specific variants of the patient are of prime interest. As listed in Table 1, we compiled various resources on cancer drugs for targeted therapies both in clinical usage and in preclinical development. Specifically, we categorized drugs into three groups – 1) in-house curated drugs for actionable targets which include the FDA-approved drugs, 2) drugs reported in PubMed abstracts obtained from systematic text mining and manual curation, and 3) OncoKB  drugs that classified drugs in four levels of reliability according to clinical applicability. We carefully characterized (potential) clinically relevant alterations and assigned available drugs to somatic mutations at the variant and gene levels. For the in-house curated drugs for actionable targets, we included the FDA-approved drugs, drugs in clinical trials referenced by highly reliable sources such as MyCancerGenome , IntOGen , Handbook of targeted cancer therapy , and manual searches in the New England Journal of Medicine journal. Drugs from text mining were obtained from VarDrugPub  that identified the variant-gene-drug relations in all the PubMed abstracts using a machine learning method.
We further provide filtering utility to select genes of known importance in cancer as well as variants based on patient frequency and functional impact (Fig. 2a). The list of known cancer genes was obtained from the Cancer Gene Census of COSMIC (616 genes) . Users may also select the cancer drug targets in clinical practice (26 genes) that were curated by MD Anderson personalized cancer medicine Knowledgebase . These two sets of cancer genes may be the prime targets of personalized treatment and can be focused by the checkbox filtering as shown in Fig. 2a.
It is often the case that users want to examine the details of specific mutation. We provide three interactive plots for efficient variant exploration. The mutation distribution plot (Fig. 2b) shows the mutation spot on the gene structure with functional domains. Mutation frequency among TCGA patients with the same cancer type is shown in the needle plot format. We also show the read alignment plot (Fig. 2c) so that users can check the validity of mutation calls and allele frequencies. To implement this feature without carrying the large-sized BAM file, our NGS pipeline creates a reduced BAM file that contained the read alignments near the mutation points only. Lastly, we support the co-mutation plot to examine the landscape of somatic mutations and CNVs (Additional file 7: Figure S3). Mutations in a specific patient can be readily compared with the cohort population such as the TCGA data.
In sum, our variant annotation and prioritization scheme based on knowledge of cancer genes and targeted drugs provides an efficient way of scrutinizing clinical relevance of somatic variants in a given cancer type.
Patient stratification and survival analysis
Proper stratification of patients is the most fundamental concept of targeted precision medicine. We implemented two most commonly used methods of grouping patients based on mutation and gene expression data. Survival analysis of resulting patient groups can be carried out interactively to facilitate hypothesis test of survival benefit for clinicians.
Mutual exclusivity among driver mutations based on signaling networks
In tumor, not one but several alternative driver alterations in different genes can lead to similar downstream events. A key observation is that when a member of a substitutive gene set is altered, the selection pressure on the other members is diminished or even nullified. As a result, the mutation pattern of alternative driver genes appears almost mutually exclusive among different patients. We use Mutex program  to identify mutually exclusive set of genes with a common downstream effect on the signaling network and implemented survival analysis for altered vs. unaltered patient groups. An example of the TP53 signaling module targeting HIF1A gene is shown in Fig. 3a, taking TCGA LUAD as the patient cohort. Note that the gene alteration includes both somatic mutations and CNVs here.
Patient grouping by gene expression signatures
DNA sequencing will not be sufficient to optimally select patients for all classes of targeted therapy. In fact, other types of high-throughput technologies, including RNA sequencing, DNA methylation profiling, and small RNA profiling, are being extensively used to identify cancer subtypes and to further improve our understanding of their biological mechanisms. RNA sequencing is the closest to the clinical applications . For example, OncotypeDX based on expression profile of 21 genes predicts accurately recurrence of early-stage ER-positive breast cancer, demonstrating the possibility of molecular prognosis . We implemented a scheme to sort out patients according to the risk score based on expression value of pre-defined genes (Fig. 3b). The score was derived from the average expression value of 103 genes that defined the metastatic subgroup in our in-house study. Patients in the TCGA LUAD cohort were ranked by the score in the waterfall plot, and we defined the highest and lowest 60 patients as high and low score groups respectively. The difference in the overall survival rate between two groups indicates that the corresponding signature genes may have prognostic value in lung adenocarcinoma. Notably, the list of scoring genes and threshold for defining patient groups are provided by users interactively. Thus the system is flexible enough test diverse clinical hypotheses.
Altered key pathways
The eventual development of acquired resistance has been a near universal observation with targeted cancer therapy. Even in patient samples where those acquired resistance emerges, alterations often converge on specific gene modules or pathways, suggesting that even these scenarios could be managed with drugs or drug combinations that target this biochemical and signaling bottleneck . To address this scenario, we defined and unified the altered key pathways for each cancer type that demonstrate how multiple signaling pathways interact via cross-talk and feedback. An example of altered key pathways is shown in Fig. 4a for lung adenocarcinoma. Note that genes are colored according to the abundance of activating or suppressing aberrations (mutations and CNAs). Drugs targeting each gene in the pathway are also listed to help users search available drugs targeting genes on up- or down-stream path (Fig. 4b).
Our CGIS software was designed both for clinicians seeking for an easy-to-understand report of genomic analysis and for medical scientists who want to explore genomic information to test clinical hypotheses for biomarker development. We integrated ample genomic information from diverse public resources with manual curation if necessary. We also devised and implemented several novel ideas and tools for investigating roles of variants, exploring population cohorts, patient stratification based on genomic data, and drugs based on pathway view. This is just a prototype result of our project and we will continue to develop more features and modules for enhanced function and convenience.
Availability and requirements
Project name: Clinical and Genomic Information System (CGIS) for cancer precision medicine.
Project home page: http://18.104.22.168
Operating system(s): Platform independent.
Other requirements: Node.js version 4.4.7 or higher, MySQL version 5.6 or higher, D3.js, igv.js.
License: Proprietary (allowed for non-commercial use only).
Any restrictions to use by non-academics: restricted by the license.
Binary Alignment Map
Copy Number Aberration
Copy Number Variation
Insertion or Deletion
Next Generation Sequencing
Sing Nucleotide Variant,
The Cancer Genome Atlas,
Variant Call Format
Whole Exome Sequencing
Whole Transcriptome Sequencing
Ghazani AA, Oliver NM, St. Pierre JP, et al. Assigning clinical meaning to somatic and germ-line whole-exome sequencing data in a prospective cancer precision medicine study. Genet Med. 2017;19:787.
Doig KD, Fellowes A, Bell AH, et al. PathOS: a decision support system for reporting high throughput sequencing of cancers in clinical diagnostic laboratories. Genome Med. 2017;9(1):38.
Mock A, Murphy S, Morris J, Marass F, Rosenfeld N, Massie C. CVE: an R package for interactive variant prioritisation in precision oncology. BMC Med Genet. 2017;10(1):37.
Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31(3):213–9.
Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics. 2012;28(14):1811–7.
Magi A, Tattini L, Cifola I, et al. EXCAVATOR: detecting copy number variants from whole-exome sequencing data. Genome Biol. 2013;14(10):R120.
Wang K, Singh D, Zeng Z, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323.
Afgan E, Baker D, van den Beek M, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016;44(W1):W3–W10.
Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45(D1):D777–83.
Chakravarty D, Gao J, Phillips S, et al. OncoKB: a precision oncology Knowledge Base. JCO Precis Oncol. 2017;1:1–16.
Taylor AD, Micheel CM, Anderson IA, Levy MA, Lovly CM. The path(way) less traveled: a pathway-oriented approach to providing information about precision Cancer medicine on my Cancer genome. Transl Oncol. 2016;9(2):163–5.
Johnson A, Zeng J, Bailey AM, et al. The right drugs at the right time for the right patient: the MD Anderson precision oncology decision support platform. Drug Discov Today. 2015;20(12):1433–8.
Ramos AH, Lichtenstein L, Gupta M, et al. Oncotator: Cancer variant annotation tool. Hum Mutat. 2015;36(4):E2423–9.
Rubio-Perez C, Tamborero D, Schroeder MP, et al. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. Cancer Cell. 2015;27(3):382–96.
Karp DD, Falchook GS. Handbook of targeted cancer therapy. Sacramento: Wolters Kluwer; 2014.
Variant-Gene-Drug Relations Database. http://vardrugpub.korea.ac.kr. Accessed 10 Jul 2017.
Babur Ö, Gönen M, Aksoy BA, et al. Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Genome Biol. 2015;16(1):45.
Hyman DM, Taylor BS, Baselga J. Implementing genome-driven oncology. Cell. 2017;168(4):584–99.
Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast Cancer. N Engl J Med. 2004;351(27):2817–26.
Parker JS, Mullins M, Cheang MCU, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160–7.
Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8.
The publication cost of this article was funded by the Technology Innovation Program of the Ministry of Trade, Industry and Energy, Republic of Korea (10050154).
Availability of data and materials
About this supplement
This article has been published as part of BMC Medical Genomics Volume 11 Supplement 2, 2018: Proceedings of the 28th International Conference on Genome Informatics: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-11-supplement-2.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figure S1. Galaxy workflow for WES data processing. (PNG 221 kb)
Figure S2. Galaxy workflow for WTS data processing. (PNG 111 kb)
Galaxy workflow file (json data format) for WES data processing, it can be imported to another Galaxy server. (GA 53 kb)
Galaxy workflow file (json data format) for WTS data processing, it can be imported to another Galaxy server. (GA 23 kb)
Instruction for users to upload their own FASTQ files into our BioCloud system so that they can process the NGS data and get the various reports described in main script. (PDF 1060 kb)
Figure S4. An example of filtering process to select a patient cohort based on clinical information or properties. A. Selection of female and lifelong never-smoker patients in the TCGA LUAD cohort. (“Cohort Selection” menu is located in left-top side of the page) B. Driver genes were sorted by mutation frequency by clicking the “# Mutations” label at the bottom. The sorting result confirmed that EGFR is the most frequently mutated gene among these patients, whereas TP53 mutation was prevalent in other patients as shown in Additional file 7: Figure S3. (PNG 179 kb)
Figure S3. Cohort explorer for the whole TCGA LUAD cohort and our patient (1) Significant driver genes identified by MutSigCV . Each horizontal bar represents total count of mutations on the corresponding gene in the cohort. Color scheme indicates the coding properties of mutations. (2) The gray bar represents –log10(p-values) of each driver gene. (3) Sample-wise count of mutations with coding properties color-coded. (4) Clinical features of samples. (5) Mutations found in our patient are plotted at left-most side (i.e. the first column). (PNG 120 kb)