Bioinformatics workflows for genomic analysis of tumors from Patient Derived Xenografts (PDX): challenges and guidelines

Bioinformatics workflows for analyzing genomic data obtained from xenografted tumor (e.g., human tumors engrafted in a mouse host) must address several challenges, including separating mouse and human sequence reads and accurate identification of somatic mutations and copy number aberrations when paired normal DNA from the patient is not available. We report here data analysis workflows that address these challenges and result in reliable identification of somatic mutations, copy number alterations, and transcriptomic profiles of tumors from patient derived xenograft models. We validated our analytical approaches using simulated data and by assessing concordance of the genomic properties of xenograft tumors with data from primary human tumors in The Cancer Genome Atlas (TCGA). The commands and parameters for the workflows are available at https://github.com/TheJacksonLaboratory/PDX-Analysis-Workflows.


Introduction
Patient-Derived Xenograft (PDX) models are in vivo preclinical models of human cancer for translational cancer research and personalized therapeutic selection [1][2][3][4][5][6][7]. Previous studies have demonstrated engrafted human tumors retain key genomic aberrations found in the original patient tumor [3,8,9] and that treatment responses of tumor-bearing mice typically reflect the responses observed in patients [6,10]. PDXs have been used successfully as a platform for pre-clinical drug screens [6,7,10], to facilitate the development of potential biomarkers of drug response and resistance [6,7,11], and to select appropriate therapeutic regimens for individual patients [8].
The Jackson Laboratory (JAX) PDX Resource has over 400 PDX models from more than 20 different types of cancer. A schematic summarizing the processes used for model generation, quality control, and characterization process for the resource is shown in Figure 1. Genome characterization of PDX tumors includes the identification of somatic mutations, copy number alterations, and transcriptional profiles. Over 100 of the models have been assessed to date for responses to various therapeutic agents. The integration of results from dosing studies with genomic data for the models has been successfully applied to the identification of novel genomic biomarkers associated with treatment responses [12].
To generate accurate calls for mutations and copy number variants for human tumors engrafted in a mouse host, several challenges had to be addressed. First, because human stroma is replaced by mouse cells and tissues during tumor engraftment, sequence data generated for PDX tumors includes both mouse and human sequences. As the protein-coding regions of the mouse and human genomes are 85% identical on average, there is a high risk of introducing false positive variants in functional regions and erroneous gene expression [13][14][15].
Second, because the tumor material used to create models in the JAX PDX Resource consisted of material that remained following clinical pathology assessment (i.e. material was not collected specifically for xenograft model creation), paired normal samples were not available for the majority of tumor samples used to generate the PDXs. The absence of normal tissue complicates the ability to distinguish germline variants from somatic alterations (point mutations, indels and copy number aberrations) in the tumor [16][17][18][19]. Third, false positive (FP) variants due to errors in sequencing and mapping require additional filtering steps in the computational workflow [20][21][22]. Finally, it has been reported previously that the immunodeficient host mice are susceptible to forming B-cell human lymphomas during engraftment due to Epstein-Barr virus (EBV)associated lymphomagenesis [23][24][25][26][27]. Systematic screening of PDX tumor samples for EBV transformation is an important step in quality assurance for the integrity of PDX repositories.
Here, we describe bioinformatics analysis workflows and guidelines (https://github.com/TheJacksonLaboratory/PDX-Analysis-Workflows) that we developed for the for the analysis of genomic data generated from PDX tumors (http://www.tumor.informatics.jax.org/mtbwi/pdxSearch.do). These workflows incorporated established tools and public databases and were tailored to address the specific challenges mentioned above by tuning parameters and addition of filters. We demonstrate how our methods, using simulated and experimental data, improve the accuracy in the detection of somatic alterations in PDX models. We also developed a classifier based on expression data to systematically identify and filter out EBV transformed samples. Finally, to verify the effectiveness of our workflows, we show the overall concordance of the genomic and transcriptomic profiles of the PDX models in the JAX PDX resource with relevant tumor types from The Cancer Genome Atlas (TCGA).

Workflow for calling somatic point mutations and indels in PDX tumors
A schematic of the variant calling workflow we implemented for human tumors engrafted in mice is shown in Figure 2A and 2B (see Methods).
Preprocessing and removal of mouse reads. Human and mouse DNA reads were classified by Xenome [13], which had shown reliable performance in separate studies [28], and only human reads were used for subsequent variant calling. The percentage of mouse reads within the PDX samples in the JAX resource has a median value of 5.30% (range: 0.00163% -65.1%) ( Figure 3A). Using simulated CTP datasets, we verified that omitting the Xenome step to filter the mouse reads resulted in very low precision ( Figure 3B), i.e. large number of FPs, in the absence of the quality hard filters (Supplementary Table S1). These FPs were due to mouse reads being aligned to the reference genome with mismatches and subsequently called as variants with low quality scores (QD).
While the default thresholds for GATK hard filtering parameters [29] removed a large proportion of the FPs, applying Xenome to filter for human reads yielded superior performance in terms of substantially higher precision, as well as improvement in recall. In addition, Xenome filtering maintained the correlation between the predicted versus actual allele frequencies, which would otherwise decrease with higher mouse contamination (Supplementary Table S2).

Filtering germline variants.
To enhance filtering out germline variants from somatic mutations, we sequenced and analyzed 20 normal blood samples using the CTP targeted panel. As shown in Supplementary Figure S1A and S1B, 87% of the variants identified in normal blood had allele frequencies of 40% -60% or >90% across all the samples, indicating the presence of heterozygous or homozygous common variants, respectively. Ninety-one percent of the variants identified in these 20 samples were annotated in the public germline databases.
4% of these variants were not found in public germline databases, but were recurrent in these normal samples or across the PDX tumors in our collection (Supplementary Figure S1C) and so were added to our list of putative germline variants. Only 5% of all of the variants in the 20 samples were private events.
Based on these observations, the variants in each PDX tumor with an allele frequency of 40% -60% or >90%, and present in either public germline database or our list of putative germline variants (Supplementary Table S3) were filtered out as germline variants (Supplementary Table S3). This was a more conservative approach given that these known germline variants in regions of copy number alterations where the ratio of both alleles were not balanced would not be filtered. Figure 3C shows that the germline filters effectively rectified the estimated somatic mutational load in the PDX tumors (Supplementary Table S5) by about  four-fold reduction (Supplementary Table S4), which was reasonable as a large proportion of the variants were expected to be germline.

Filtering false positives due to systematic errors. Putative somatic variants with
no known effects in cancer that recur across large numbers of PDX samples are potentially FPs arising from sequence assembly based error in the reference the genome, sequencing errors or alignment errors in low mappability regions [30].
To detect these, we filtered out the variants at loci that were recurrently mutated in ≥25% of PDX tumors ( Figure 2C). The distribution of tumor types for each of these recurrently mutated positions (n=52) was highly similar to the overall distribution of tumor types in the PDX resource (Supplementary Figure S2A) with Pearson correlation coefficient >0.9 (Supplementary Figure 2B). This implies that these mutations were systematic errors and were not selected for any tumor type, and thus, biologically irrelevant. Filtering these highly recurrent loci did not significantly reduce the predicted mutational load per tumor ( Figure 3C and Supplementary  Figure S1A). To address the balance of false positive and false negative mutation calls, we "rescued" variants that were initially filtered out based on curated annotations available in the JAX-Clinical Knowledgebase (CKB, https://ckb.jax.org/) [31]. The criteria for rescuing variants included those with 1) known or predicted gain or loss of protein function, 2) potential treatment approach for any cancer type and 3) drug sensitivity and resistance effects in clinical or preclinical studies (Supplementary Table S4). We also included an additional indel caller, Pindel [32], in the workflow in order to increase the sensitivity of indel prediction. As Pindel results contained a large number of FPs, we only included those that were present in the JAX-CKB by the same criteria.
Overall, 127 unique variants from 52 genes (1.03% of the total and 2.21 % of the filtered unique variants detected by the CTP platform) were rescued from 381 PDX CTP samples. Nine of these mutations have been validated to be present in the PDX model ( Figure 3D). Almost all were initially filtered as germline events, as many well-known actionable cancer mutations (e.g. BRAF V600E and KRAS G12C) are present in the dbSNP database and were filtered if they fall within the germline allele frequency. Two other variants that were not called by GATK initially but were detected by Pindel were rescued as they were annotated clinically relevant.  Figure 3D).

Gene expression analysis in PDXs
A schematic overview of the PDX gene expression workflow is provided in Figure   4A (see Methods).

Screening of EBV-associated lymphomas by RNA-Seq expression data.
We observed that the EBV-associated lymphoma tumors that arise in PDX samples display a distinct and highly reproducible expression pattern, regardless of the platforms in which the expression was measured (RNA-Seq, Affymetrix Human  Figure S5). This expression profile was also independent of the tissue of origin of the tumors from which the EBV-associated lymphomas were derived. Given the high similarity in expression profiles, we identified a gene signature based on the most differentially expressed genes between EBV-associated lymphomas and non-EBV-associated tumors (data not shown). Using gene set analysis, we observed that genes associated with Blymphocytes and other immune processes were over-expressed, while cell-to-cell communication and adherence genes were suppressed (data not shown). We developed a classifier that scored each PDX sample based on the expression levels of the genes in the gene signature (Supplementary Table S6). This single score, when applied on RNA-Seq data, was able to effectively distinguish PDX tumors that were either EBV-transformed or originated from human lymphomas from non-lymphoma PDX tumors ( Figure 4B). Overall, 8.5% (32 out of 376) of the nonlymphoma PDX samples with RNA-Seq data in the PDX resource progressed to EBV-associated lymphomas. These tumors were further confirmed to be CD45 positive by immunohistochemistry (IHC) staining, which is the primary tool at JAX to identify PDX tumors that are EBV-transformed.

Copy Number Variant (CNV) analysis in PDXs
A schematic overview of the PDX CNV workflow is provided in Figure 5A  Effect of mouse DNA on CNV calls. We studied the effect of mouse contamination on array data by hybridizing DNA of the NSG mouse on the human SNP array, and observed that the signal intensity from mouse DNA is negligible (Supplementary Figure S6). Samples with higher mouse content are more likely to result in failure of the standard array quality control or the analysis workflow, due to lower amount of human DNA to give sufficient probe signal, thus enabling samples with substantial mouse contamination to be screened out.

Absence of matched normal to call somatic copy number aberrations.
We  Table S7), indicating that the single-tumor CNV analysis was sufficiently robust.

Establishing the appropriate baseline to call copy number gains and losses.
We analyzed the effects of using different baselines for "normal state" to compute copy number gains and losses using a list of significantly amplified and deleted genes from TCGA (Supplementary Figure S8). When the overall cancer genome ploidy was used as the normal baseline, we observed a balance of a larger proportion of the significantly amplified being called copy number gain, and similarly a larger proportion of the significantly deleted genes being called copy number loss among the PDX samples (Supplementary Figure S9). However, more of both significantly amplified and deleted genes were being classified as amplified when copy number aberrations were calculated relative to the diploid state. While the average ploidy could be estimated differently across the samples for the same model, the copy number changes relative to ploidy remained consistent ( Figure 5B and Supplementary Figure S7).

Effects copy number aberrations on expression changes.
We observed that the estimated copy number gains and losses of known oncogenes (n=23) and tumor suppressor genes (n=40) [33], relative to the average ploidy per PDX sample, generally results in expression fold change (relative to the average expression at copy number normal state) in the same direction (Supplementary Table S8) [11,34,35]. Most of these genes show significant over-expression with copy number gain and significant under-expression with copy number loss across the PDX samples (p<0.05) ( Figure 5C and Supplementary Figure S10). This shows that the baselines to call copy number gain and loss, and over and under-expression, were correctly established. This significant observation, however, did not hold when we did a global analysis across all genes instead of selected oncogenes and tumor suppressor genes. This was because many genes were not expressed in the respective tissue types even though they were in regions affect by copy number alterations, and the expression of many genes, despite being non-altered regions, could be regulated by other mutations or epigenetic mechanisms in the tumors.

Comparison of genomic and transcriptomic profiles of PDX models and TCGA patient tumors
Due to the lack of paired-normal samples for the PDX models in the JAX PDX Resource, we were unable to experimentally validate the somatic calls predicted from the various workflows. To determine if the results of our genomic analysis workflows were similar to known somatic profiles of the same tumor type, we compared the overall genomic and transcriptomic profiles for selected tumor types between the JAX PDX resource and patient tumor cohorts in the TCGA.

Frequently mutated genes in primary patient tumors in TCGA detected in the
PDX resource. The distribution of somatic coding non-silent mutational load of the CTP genes for each tumor type was comparable between PDX and TCGA ( Figure 6A). Despite the much smaller sample size for each PDX tumor type, we still observed higher mutational load in colorectal cancer and melanoma.
Nonetheless, the overall mutational load remained higher in PDX tumors, which could be possibly due to the fact that the PDX tumors were sequenced at a higher coverage (>900X) using the CTP targeted panel, and thus more variants were detected per base pair compared to exome sequencing (~100X) of TCGA tumors.
Moreover, known germline variants with allele frequency outside the range of 40% -60% and >90%, possibly due to errors in allele frequency estimation or copy number aberrations at the variant position, as well as private germline variants, were not filtered. The mutations in TCGA were curated with partial experimental validations, hence the mutation count and FP rate were expected to be lower. Given that there were more samples in the TCGA cohorts, we compared the genes that were mutated at 5% frequency with genes that were mutated in at least one sample within the same tumor type in the PDX resource. Almost all genes mutated at high frequencies in TCGA tumors were mutated in PDX tumors, with significant p-values (p < 1´10 -4 ) by Fisher's exact test ( Figure 6B, Supplementary  Table S9). This indicates that the key drivers by mutation within each cancer type were preserved in PDX tumors.  Table S10) were able to independently cluster both TCGA samples and the PDX samples by their tumor types ( Figure 6C). We observed clusters of genes that were highly expressed in specific tumor types in TCGA were recapitulated in the PDX expression data (hypergeometric p-value <  Table S11).

Copy number profiles of primary patient tumors in TCGA recapitulated in PDX
resource. We showed that the frequency of genome-wide copy number aberrations for each tumor type in the PDX resource (Supplementary Table S12  Consistent with the earlier observations, there was a weaker concordance with TCGA data when amplification and deletion was called relative to the diploid state (Supplementary Figure S14B).

Discussion
The application of PDX models in pre-clinical research and personalized therapy requires that the engrafted human tumors are accurately characterized for tumorspecific mutations [3]. The development of bioinformatics workflows to call somatic mutations (SNVs, Indels), copy number aberrations and gene expression from PDX sequencing or array data requires balancing sensitivity and specificity [22,30], especially when paired normal samples for engrafted tumors are not available. Using genomic and transcriptomic data from models in the JAX PDX Resource, we conducted a systematic analysis to address several key data analysis challenges and tailored our workflows to optimize the sensitivity and specificity of the results.
Our recommendations for the somatic mutation calling from PDX DNA sequencing data in the absence of paired-normal samples are as follows: • Remove mouse reads with Xenome (or equivalent) to eliminate variants called from mouse reads mapping to the human reference genome • Filter with germline variant databases to improve somatic mutation calling • Filter highly recurrent mutations to remove false positives arising from sequencing or analysis related errors • Rescue clinically relevant variants which were filtered in the upstream steps as they were likely to be present as important mutations in the tumor Despite implementing multiple filters to remove putative germline and other FP mutations, the mutation rate remains higher in PDX tumor types when compared to TCGA. One possible reason for this difference that is not related to the informatics challenges described in this paper is that many of the human tumor samples used to generate PDX models arose from metastatic lesions and from patients with prior treatment whereas many of the tumor samples used for TCGA were early stage tumors. PDX tumors were thus expected to harbor more mutations due to longer tumor evolution [36,37]. Also, previous studies have noted that PDX engraftment success is better for late stage tumors that are likely to have more aggressive phenotypes than early stage tumors [38,39]. As such, there is a likelihood for biased selection towards such tumor subtypes in the engrafted tumors that are known to harbor more mutations than tumors from early stages.
For evaluation of gene expression differences in individual tumors, matched normal tissue is ideal but not available for PDX models in the JAX PDX Resource.
To compare gene expression among the engrafted tumors, we used expression zscores across all tumor types as the best proxy for calling over-and underexpression. In a subset of PDX samples in which both expression and copy number data are available, we estimated the "normal" expression of each gene with the average expression for samples with normal copy number state, given that sufficient samples are available for the tumor type. While this approach neglects other mechanisms of gene regulation, we were able to better estimate the normal expression for some genes like MYC which tends to be frequently amplified and over-expressed across many tumor types. For copy number, we defined the "normal" state of each PDX tumor using the estimated ploidy to call relative gain and losses as this takes into account errors in ploidy estimation.
As one approach to assessing the results of our genomic characterization workflows, we compared the JAX PDX models with patient cohorts in TCGA at the genomic and transcriptomic level. Other than small differences in genomic mutations, the engrafted PDX tumors reflected the human tumors in copy number variations and gene expression. Using colorectal cancer as an example, we demonstrated that the integration of different data types showed that known perturbed pathways in cancer were altered in a consistent manner across PDX and

Somatic point mutation and indel calling workflow
Preprocessing and removal of mouse reads. DNA sequence data generated from PDX tumors underwent initial data processing as follows: (i) sequence reads with 70% of the bases having a quality score <30 (Q30) were discarded, (ii) bases with quality scores less than Q30 were trimmed from the 3' end of the read, (iii) sequence reads with <70% of bases remain after trimming were discarded, (iv) both reads from pair-end sequencing were discarded if either read was discarded.
If <50% of the total reads remained following the preprocessing steps, the sample was removed from the analysis. Following the initial data processing step described above, mouse reads were identified and filtered out using Xenome v1.0.0 [13]. Only read pairs with both reads classified as human were included in further analyses.
Sequence reads that passed all pre-processing steps were mapped to the reference human genome (build GRCh38.p5 with 262 alternate loci) using the BWA-MEM alignment tool with ALT-Aware mapping (Supplementary Figure S14) [40,41]. Because low sequence coverage leads to poor sensitivity in variant calling, samples with less than 75% of the target region covered at least at ³100X by human reads were excluded from further analysis.  In addition, we verified that these default thresholds were able to detect all the known mutations in the CTP samples [45]. The average number of variants before and after quality filtering across the CTP samples is shown Supplementary Table   S4.

Benchmarking of PDX somatic mutation workflow
To benchmark the PDX somatic mutation workflow, a simulated dataset ( Table S1).
Generation of simulated sequence reads. SeqMaker was used to generate simulated sequencing data based on human genome assembly GRCh38 with varying sequencing depth, read length, duplication rate, sequencing error and base quality range [51]. Reference sequences were extracted from target region of the CTP panel. Sequence reads for 5 samples were simulated using predicted mutations from PDX models of different cancer types from the CTP dataset to represent different spectrum of mutations, with a range of allele frequency to mimic germline and somatic mutations. For each simulated sample, we generated three technical replicates at 500X, 1000X and 1500X coverage.

RNA-Seq expression workflow
Data processing and expression estimation. Prior to alignment to the human transcriptome, sequences from PDX tumors were processed for sequence quality.
Only sequences with base qualities ≥30 over 70 percent of read length were used in downstream analyses. Quality trimmed reads were then analyzed using the default parameters of Xenome v1.0.0 (k=25) [13] Table S10). For each PDX sample, the upper-quantile normalized counts from RSEM of the classifier genes were transformed into z-scores using the mean and standard deviation computed across all PDX samples for each gene.

Classifier for EBV-associated PDX lymphomas
Subsequently, a sign corresponding to the direction of regulation in the classifier table was multiplied to each z-score and the sum of these modified z-scores resulted in a single score for each PDX sample. A classifier score of >3.0 was used to identify a PDX tumor sample as a potential EBV-associated lymphoma.

Copy Number Variant (CNV) workflow
Assessing the effects of mouse DNA on SNP array. DNA of the NSG mouse was hybridized on the Affymetrix SNP 6.0 array, and the signal intensity was extracted from the CEL files using Affymetrix Power Tools (apt-cel-extract). The mouse content for each PDX sample was estimated by the mouse reads proportion computed by Xenome of the mutation calling pipeline for the CTP sequencing of the same PDX sample. [55][56][57] were used to extract the B-allele frequency (BAF) and Annotation of CNV segments. The resultant copy number segments were annotated with loss of heterozygosity (LOH) and log2 ratio of total copy number relative to diploid state (copy number 2) and predicted ploidy from ASCAT. A segment was defined as LOH when the major-allele copy number was ≥ 0.5 and the minor-allele copy number was ≤ 0.1. Gene-level copy number and LOH were estimated by intersecting the genome coordinates of copy number segments with genome coordinates of genes (Ensembl annotation version 84 for genome assembly GRCh38). In cases where a segment boundary was contained within a gene's coordinates, the most conservative (lowest) estimate of copy number was used and the gene was annotated with the number of overlapping segments.

Single-tumor CNV analysis. PennCNV-Affy and Affymetrix Power Tools
Defining copy number gain and loss. The low-level copy number gain or loss of a gene was defined by the log2 ratio of the copy number relative to the average ploidy of the sample or diploid state with a threshold of ±0.4 respectively. We compiled a list of genes with focal copy number aberrations that were significantly amplified (n=273) or deleted (n=820) in the 8 tumor types (Supplementary Table   S8) from the GISTIC 2.0 analysis from the TCGA FireBrowse website (http://firebrowse.org/). Using this set of genes, we compared the proportion of genes that would be classified as gain and loss when using different baselines (diploid state 2 or ASCAT predicted ploidy) for PDX models listed in Supplementary Table S12.
Comparison of copy number aberrations with gene expression. Using annotations from the Cancer Census resource [33] we analyzed the relationship between copy number aberrations and gene expression using a list of 23 oncogenes that are commonly amplified in cancers and a list of 40 tumor suppressor genes that are commonly deleted in cancers. These genes were classified into copy number states of high-level loss (log2(CN/ploidy) < -1), normal (-1 ≤ log2(CN/ploidy) ≤ +1) and high-level gain (log2(CN/ploidy) > +1). The expression fold change of each gene was calculated as the log2(TPM+1) relative to the mean expression across PDX samples with a stringent normal copy number state (-0.4 ≤ log2(CN/ploidy) ≤ 0.4).
The significance of expression changes of each gene for the entire PDX resource with copy number gain or loss relative to the normal state was calculated using the Student's t-Test.

Comparison between PDX and TCGA data
Somatic mutations. We calculated the distribution of mutational load (number of non-silent, coding mutations in exonic regions per sample) of the CTP genes for 6 tumor types with at least 10 models in the PDX resource (colorectal cancer, lung adenocarcinoma, lung squamous cell carcinoma, melanoma, bladder carcinoma and triple-negative breast cancer , Supplementary Table S5). MAF files for somatic mutations based on whole-exome sequencing of the TCGA samples of 6 tumor types [60][61][62][63][64] were obtained from TCGA Data Portal and were used to compute the mutation frequency for CTP genes only. The Fisher's exact test was used to test the significance of overlap of mutated genes between the PDX resource and TCGA patient cohorts for each tumor type. The genes in each PDX resource were considered if they were mutated in at least one sample, while the genes in each TCGA tumor cohort were considered if they were mutated with at least 5% frequency, due to a much larger sample size.    The same set of genes (omitting non-expressed genes) is used to cluster the expression z-score by Hierarchical clustering of PDX RNA-Seq models across different tumor types. Gene sets were identified to be high expression in specific tumor types TCGA and PDX separately and were found to share significant overlap. (D) Correlation frequency of genes that are over-expressed (z-score of log2(TPM+1) > 1, green) or under-expressed (z-score of log2(TPM+1) < -1, orange) across each tumor type between PDX models and TCGA samples. (E) Correlation of frequency of copy number gain (red) or loss (blue) of selected genes frequently amplified or deleted in TCGA tumors predicted by GISTIC analysis for each tumor type between PDX and TCGA datasets. Supplementary Figure S4 This figure shows the benchmarking of the CTP variant calling pipeline using 45 simulated sequencing datasets different samples, sequencing coverages, and mouse DNA content (see Supplementary Table S2) using precision, recall and F1 score based on the input variants for each sample. Complete: variant calling pipeline with all steps included; NoXenome: variant calling pipeline with Xenome omitted; NoAltaware: variant calling pipeline using hg38 reference with alternate sequences but using standard BWA for mapping instead of BWA-ALT-Aware; all: all variants called by the pipeline; pass: variants annotated as "PASS" in the pipeline which pass the hard filters, minimum read depth and minimum alternate allele frequency of the variant.

RNA-
Presence of alternate loci in the genome assembly. The GRCh38.p5 human genome assembly includes 262 regions of alternate loci to account for human chromosomal regions that exhibit sufficient variability to prevent adequate representation by a single sequence [29]. As such, we aligned the reads to both primary and alternate chromosomal reference sequences using BWA-MEM with ALT-aware. When alignment is performed using BWA-MEM only, the recall of the variants is much lower (~30%) than the standard pipeline with or without hard-filtering (Supplementary Figure S14 and Supplementary Table S2). This shows that using an alignment tool not catered for alternate loci mapping reduces the overall sensitivity of the variant calling due to lesser reads being correctly mapped. The correlation of allele frequencies also decreases and the reduction in median allele frequency increases up to 15% (Supplementary  Table S3). Expression fold change of each gene across all PDX samples, defined by fold change of log2(TPM+1) relative to the mean expression of samples with a stringent normal copy number state (-0.4 < log2(CN/ploidy) < 0.4). Here, a higher-level copy number gain and loss is defined as log2(CN/ploidy) > +1 and log2(CN/ploidy) < -1 respectively. The normal copy number state is defined as -1 < log2(CN/ploidy) < +1. Significance in differences in expression by Student's t-test (*: p-value < 0.005, NS: non-significant).  Figure S14 (A) Comparison of frequency of copy number gain (red) or loss (blue) of selected genes frequently amplified or deleted in TCGA tumors predicted by GISTIC analysis for each tumor type between PDX and TCGA datasets using predicted ploidy as a reference state. (B) Comparison of frequency of copy number gain (red) or loss (blue) of genes frequently amplified or deleted in TCGA tumors predicted by GISTIC analysis for each tumor type between PDX and TCGA datasets using diploid as a reference state.  Table S1  This table summarizes the results from the benchmarking studies of the CTP variant calling  pipeline using 45 simulated sequencing datasets different samples, sequencing coverages,  and mouse DNA content.  Supplementary Table S2 This table shows the correlation and difference in median of alternate allele frequencies between input and true positive variants for all the simulated samples. ALL: all variants called by the pipeline; PASS: variants annotated as "PASS" in the pipeline which pass the hard filters, minimum read depth and minimum alternate allele frequency of the variant.