Identification and ranking of recurrent neo-epitopes in cancer

Background Immune escape is one of the hallmarks of cancer and several new treatment approaches attempt to modulate and restore the immune system’s capability to target cancer cells. At the heart of the immune recognition process lies antigen presentation from somatic mutations. These neo-epitopes are emerging as attractive targets for cancer immunotherapy and new strategies for rapid identification of relevant candidates have become a priority. Methods We carefully screen TCGA data sets for recurrent somatic amino acid exchanges and apply MHC class I binding predictions. Results We propose a method for in silico selection and prioritization of candidates which have a high potential for neo-antigen generation and are likely to appear in multiple patients. While the percentage of patients carrying a specific neo-epitope and HLA-type combination is relatively small, the sheer number of new patients leads to surprisingly high reoccurence numbers. We identify 769 epitopes which are expected to occur in 77629 patients per year. Conclusion While our candidate list will definitely contain false positives, the results provide an objective order for wet-lab testing of reusable neo-epitopes. Thus recurrent neo-epitopes may be suitable to supplement existing personalized T cell treatment approaches with precision treatment options.

While not yet fully understood, immune response and recognition of tumor cells containing specific peptides depends critically on the ability of the MHC class I complexes to bind to the peptide in order to present it to a T cell. Neo-antigens can be created by a multitude of processes like aberrant expression of genes normally restricted to immuno-privileged tissues, viral etiology or by tumor specific DNA alterations that result in the formation of novel protein sequences. Furthermore there is now evidence for neo-epitopes generated *Correspondence: dieter.beule@bihealth.de 1 Core Unit Bioinformatics, Berlin Institute of Health Charitéplatz 1 10117 Berlin, Germany 3 Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC) Robert-Rössle-Str. 10 13092 Berlin, Germany Full list of author information is available at the end of the article from alternative splicing [6] and alterations in non-coding regions [7].
With the advent of affordable short read sequencing, comprehensive neo-antigen screening based on whole exome sequencing has become feasible and many cancer immune therapeutic approaches try to utilize detailed understanding of the neo-epitope spectrum to create additional or boost pre-existing T cell reactivity for therapeutic purposes [8,9]. However, in practice the selection and validation of the most promising neo-epitope candidates is a difficult and time-consuming task. The typical approach is based on the private mutational catalogue of the individual patient: exome sequencing data is subjected to bioinformatics analysis and used to predict neoepitopes and their binding affinities to the MHC class I complex. Our study aims to complement this approach by a precision medicine perspective. We search and prioritize neo-epitope candidates which have a high potential for neo-antigen generation and are likely to appear in multiple patients. These neo-antigens hold the potential for development of off the shelf T cell therapies for sub groups of cancer patients. We use epidemiological data to give rough estimates for the expected number of patients in these groups.
Candidate prediction always relies on somatic variant detection workflows and affinity prediction algorithms based on machine learning, see e.g. [10]. Binding prediction far from perfect [11] especially for rarer HLA types, and may also depend on mutational context [12]. Catalogues of the neo-epitope landscape across various cancer entities have been created by various authors [13][14][15]. While neoantigen landscape is diverse and sparse [13], here we provide an unbiased, comprehensive ranking of candidates, defined as neo-epitopes arising from recurrent mutations, predicted to be binding to a specific HLA-1 allele. The candidates are ranked according to the expected number of target patients.

Data sets
Somatic variants for different cancer entities have been determined using matched pairs of tumor and blood whole exome or whole genome sequencing in the TCGA consortium. We downloaded the open-access somatic variants from GDC data release 7.0 [16], consisting of 33 TCGA projects and 10,182 donors in total. Details of the somatic variant calling can be found in [17]. We excluded patients without corresponding entries in the clinical information tables, and 7 projects with less than 100 samples, yielding 9,641 samples covering 26 cancer studies. Figure 1 provides an overview of the complete bioinformatics process, from the GDC somatic single nucleotide variants to the identification of the candidates.

Variant selection
For each sample we selected all single nucleotide variants obtained by the "mutect2" pipeline, that had a "Variant_Type" equal to "SNP", a valid ENSEMBL transcript ID and a valid protein mutation in "HGVSp_Short". From these variants, we selected those with a "Vari-ant_Classification" equal to "Missense_Mutation". We checked that all variants had a "Mutation_Status" equal to (up to capitalisation) "Somatic", that the total depth "t_depth" was the sum of the reference "t_ref_count" and the alternate "t_alt_count" alleles counts, and that the genomics variant length is one nucleotide. To avoid high number of false positives we consider only variants that are supported by at least 5 reads and have a VAF of at least 10%. Furthermore we removed any variant that occurs with more than 1% in any population contained in the ExAC database version 0.31 [18], by coordinates liftover from the GRCh38 to hg19 human genome versions. This way we obtained 26 cancer entity data sets containing a total of 9,641 samples with an overall 1,384,531 variants.

Recurrent protein variant selection
We define recurrence strictly on the protein/amino acid exchange level, i.e. different nucleotide acid variants leading to the same amino acid exchange due to code redundancy will be counted together. Recurrent protein variants are defined within each TCGA study. A protein variant is deemed recurrent when it appears in at least 1% of all the patients in the cohort. As cancer types are only considered when the number of patients involved in the studies is greater than 100, this threshold ensures that every recurrent variant has been observed in at least 2 patients for a given cancer type. To be conservative, the recurrence frequency has been computed using, for the denominator, all patients with clinical information in the study, including those without high-confidence missense SNVs. Using this definition, the total number of recurrent amino acid changes is 1055. A variant recurrent in multiple cancer types is counted multiple times in the above number, the number of unique recurrent variants regardless of the cancer is 869. Additional file 1 shows the most frequent amino acid exchanges across 25 cancer entities, as no variant from project TCGA-KIRC's donors is labeled as recurrent.
Recurrent variants occurring at the same positions (for example when gene's IDH1 codon R132 is mutated to amino acid H, C, G or S) have been merged into 819 variants suitable for comparisons with the cancer hot spots lists [14]. 122 out of the 819 merged variants belong to the set of 470 cancer hotspot variants, and 5 (PCBP1:L100, SPTLC3:R97, EEF1A1:T432, BCLAF1:E163 & TTN:S3271) to the set of presumptive false positives hotspots listed in the supplementary material of [14].

MHC class i binding prediction and epitopes selection
For all recurrent variants identified, we assess in silico their predicted propensity that the amino-acid exchange generates a binding neo-epitope.
A variety of machine learning algorithms have been developed to determine the MHC binding in silico, see ref. [19] for review. Most methods are trained on Immune Epitope Database (IEDB) [20] entries and use allele specific predictors for frequent alleles, while pan-methods are applied to extrapolate to less common alleles. We predicted the MHC class I binding using NetMHCcons [21] v1.1, which predicts peptides IC 50 binding, and classifies these predictions as non-binder, weak and strong binders, based on the relative ranking of binding predictions. As the range of IC 50 binding values strongly depend on the HLA-1 allele [22], we have used the NetMHCcons classification to select our neo-epitope candidates.
For a given recurrent variant and a given HLA-1 type, the epitope prediction pipeline can produce multiple Workflow overview a Overview of the recurrent neo-epitope candidates generation process: TCGA studies are selected for at least 100 donors with clinical annotations. For each of these studies, recurrent strongly supported missense Single-Nucleotide Variants are collected. Neo-epitopes binding to 11 HLA-1 types are predicted, redundancy is removed from that set (see B) and strong binders are retained. b Example of epitope redundancy: the 18 amino-acids long sequence surrounding recurrent variant GLRA3:S274L generates 7 binding neo-epitopes for the type HLA-A*02:01. Our pipeline retains only the strongest predicted binder for a given variant and HLA-1 type pair (the first, with an IC 50 of 8.8 nM in the example). c Number of SNVs occuring in genes classified as Oncogenes or Tumor Suppressors by Vogelstein et al. [28], at various point of the variant selection and neo-epitope selection process overlapping epitopes candidates, differing by their length and/or their position (see Fig. 1B). To remove such size redundancy, only the epitope with the lowest predicted mutant sequence IC 50 is retained. This procedure also removes non-overlapping epitopes, to keep only at most one epitope per recurrent protein variant and HLA-1 type. For comparison we also compute the IC 50 for the respective wild type peptide.

Data QC
To ensure that the proportion of variants caused by technical artifacts is small, we have computed the proportion of SNVs called in poly-A, poly-C, poly-G or poly-T repeats of length greater than 6 have been computed for each data study [23], for unique variants (that occur in only one patient across a project cohort), and for variants that are observed more than once in a cohort (Additional file 3). For comparison, we have computed the expected frequency of such events, assuming that all possible 11mers (the mutated nucleotide at the center, flanked by 5 nucleotides on each side) are equiprobable, regardless of their sequence.
Based on this equiprobable model, we have computed the probability that the number of mutations found in repeat locii is equal to or greater than the observed numbers. When considering variants appearing more than once, this probability is not significant for all studies; when unique variants are considered, those appear in repeat locii significantly more often than expected by chance in 7 out of 26 studies (TCGA-COAD, TCGA-KIRP, TCGA-LIHC, TCGA-READ, TCGA-SKCM, TCGA-TGCT & TCGA-UCAC, significance level set to 0.05 after Benjamini-Hochberg multiple testing correction).

Mice
ABabDII mice (described in detail in [24]) have been used for this study. They are transgenic for entire human TCRα and TCR-β gene loci, as well as for HHD molecule [25] and deficient for the murine Tcr-α and -β chains, as well as for murine β2m and H2-D b genes. The mice used in the study were generated and housed under SPF conditions (caged enriched with bedding material, 3-5 mice/cage, standard light/dark cycle, food and water ad libitum) at the Max-Delbrück-Center animal facility. All animal experiments were approved by the Landesamt für Arbeitsschutz, Gesundheitsschutz und technische Sicherheit, Berlin, Germany.

Generation of mutation-specific t cells in ABabDII mice
For each candidate, 3 ABabDII mice between 8 to 12 weeks old (6 in total) underwent immunisation. They were injected subcutaneously with 100 μg of mutant short peptide (9-10mers, JPT) supplemented with 50 μg CpG 1826 (TIB Molbiol), emulsified in incomplete Freund's adjuvant (Sigma). Repetitive immunizations were performed with the same mixture at least three weeks apart. Mutationspecific CD8 + T cells in the peripheral blood of immunized animals were assessed by intracellular cytokine staining (ICS) for IFNγ 7 days after each boost. All 6 animals were peptide-reactive. The 6 mice were sacrificed for spleen preparation by cervical dislocation after isofluran anasthesia.

Patient number estimates and HLA-1 frequencies
HLA-1 frequency data f h for the U.S. population was retrieved from the Allele Frequency Net Database (AFND) [26]. Frequency data were estimated by averaging the allele frequencies of multiple population datasets from the North American (NAM) geographical region. The major U.S. ethnic groups were included and sampled under the NAM category. Cancer incidence data for the U.S. population (N d ) was retrieved from the GLOBOCAN 2012 project of the International Agency for Research on Cancer, WHO [27].
Assuming that the fraction of a recurrent variant in the U.S. population affected by cancer entity d (r d ) is identical to the observed ratio of that variant in the corresponding TCGA study, the number of patients of HLA-1 type h whose tumor contain the variant is expected to be The summation runs over 18 diseases d for which both the TCGA projects and the cancer incidence data are available.

Recurrent variants and candidates
From the GDC repository [16], we have collected somatic variants for 33 TGCA studies. After removing patients without clinical meta-data, and studies with less than 100 patients, we have selected 1,384,531 high-confidence missense SNPs from 9,641 patients, see methods for details. Using this data, 1,055 variants are deemed recurrent (Additional file 1), as they can be found in more than 1% of the patients in the respective study cohort. These recurrent variants correspond to 869 unique protein changes, as some appear in multiple cancer entities. 77 of the recurrent variants occur in at least 3% of their cohort (43 unique protein changes).
From these 869 unique protein changes, we have generated candidates that are predicted to be strong MHC class I binders in frequent HLA-1 types that we considered for initial selection. 415 (48%) of them lead to a strong binder prediction. In total, there are 772 candidates that are recurrent in a cancer entity cohort, and predicted as binding for a considered HLA-1 type. These candidates are non-redundant among all the 9-, 10-& 11-mers containing the variant: the selection process retains only the peptide sequence with the lowest predicted IC 50 . Figure 1 and Table 1 provide an overview of the variant selection and neo-epitope candidates generation processes, while Additional file 2 lists all neo-epitopes (weak and strong predicted binders) after removing redundancy.
Despite large differences between variant selection protocols, 123 variants deemed recurrent by the above process can be found among the 470 variants identified in the cancer hotspot datasets [14] (Additional file 4). This overlap is strongly dependent on how frequent those variants are observed: there are 54 common variants out of the 61 variants observed more than 10 times over our dataset The 7 studies displayed at the bottom have not been used for the determination of recurrent vairants, as the number of patients is less than 100. The number of strong binders includes all occurrences of neo-epitopes candidates, so a candidate may be counted multiple times when it is predicted to be binding several HLA-1 types

Enrichment in known cancer related genes
We observe that recurrent variants occur substantially more frequently in known cancer-related genes than in other genes (Fig. 1c). Initially approximatively one percent of all observed variants are found in genes that have been described [28,29] as oncogenes (54 genes) or tumor suppressor genes (71 genes). When recurrent unique protein changes are considered, the fraction of known oncogenes or tumor suppressor genes is substantially increased to 13% and 6.5% respectively (a χ 2 test between unique protein changes and unique recurrent variants gives a P value smaller than 10 −16 ). These fractions only marginally increase to 14% and 7% when only the unique protein changes leading to predicted strong binders for frequent HLA-1 types are considered (a χ 2 test between unique recurrent variants and strong binders gives a non-significant P value). Additional file 5 shows a similar enrichment of known cancer-related genes per cohort. We observe that the enrichment is stronger for oncogenes than for tumor suppressors. This might be expected, as activating mutations in oncogenes are mainly distributed on a few protein positions, while loss of function mutations in tumor suppressors are generally distributed more broadly along the protein sequence.
It is interesting to observe that several of the highly prevalent neo-epitope candidates occur in genes that are involved in known immune escape mechanisms: RAC1:P29S is recurrent in study SKCM (melanoma), is predicted to lead to strong binding neo-epitopes for HLA-A*01:01 and HLA-A*02:01, and is reported to up-regulate PD-L1 in melanoma [30]. CTNNB1:S33C is recurrent in studies LIHC (liver hepatocellular carcinoma) and UCEC (uterine corpus endometrial carcinoma), is predicted to lead to strong binding neo-epitopes for HLA-A*02:01, and has been shown to increase the expression of the Wnt-signalling pathway in hepatocellular carcinoma [31], leading to modulation of the immune response [32] and ultimately to tumor immune escape [33]. In a separate study, Cho et al. [34] show that this mutation confers acquired resistance to the drug imatinib in metastatic melanoma. Finally, FLT3:D835Y recurrent in study LAML (acute myeloid leukemia), is predicted to lead to a strong binding neo-epitope for HLA-A*01:01, HLA-A*02:01 and HLA-C*06:02, and following Reiter et al. [35], Tyrosine Kinase Inhibitors promote the surface expression of the mutated FLT3, enhancing FLT3-directed immunotherapy options, as its surface expression is negatively correlated with proliferation.
While the described mechanisms are probably sufficient to explain immune escape in tumor evolution, the candidates could nevertheless be viable targets for adoptive T cell therapy or TCR gene therapy.

Recurrent neo-epitopes in patient populations
Upon assumption of statistical independence, the product of the frequency of a recurrent variant with the frequency of class I alleles in the population and the incidence rates of cancer types provides an estimate for the number of patients that carry that specific candidate. Using the number of newly diagnosed patients per year and HLA-1 frequency in the US population, we are able to compute the expected number of patients for 18 cancer entities for which both cancer census data and a TCGA study are available. The occurrence numbers for individual candidates range from 0 to 2,254 for PIK3CA:H1047R in breast cancer patients of type HLA-C*07:01; Table 2 presents a summary of expected patient numbers for the complete set of candidates. We estimate that, in the US alone, the previously discussed RAC1:P29S mutation might be present in 628 new patients carrying the HLA-A*02:01 allele each year (in 556 melanoma patients and in 72 lung small cell, head & neck or uterine carcinomas patients, see Additional file 6 for details). For the CTNNB1:S33C mutation, the total number of HLA-A*02:01 patients in the US is expected to be 364, from uterine corpus, prostate and liver cancer types. As another example, 115 myeloid leukemia patients in the US are expected to be of type HLA-A*02:01 and carry the FLT3:D835Y mutation. Figure 2 shows the cumulative expected number of patients that carry a specific epitope, and with matching HLA-1 type, for the 50 candidates with the highest expected patients number. The number of patients is derived from the sum over all cancer entities, including those in which the candidate is not recurrent according to our criteria. For example, among newly diagnosed US patients of type HLA-C*04:01, 88 prostate cancer patients are expected to carry the mutation PIK3CA:R88Q, even though its observed frequency in the PRAD study is as low as 0.2%. The data shown in Fig. 2 can be found in Additional file 6.

Accessible patient population
As our current understanding of peptide immunogenicity is still incomplete [36], not all candidates predicted by our pipeline can be expected to trigger an immunogenic response in patients. To further evaluate the usefulness of our results we consider the list of candidates (neoepitope and HLA type pairs) selected form our ranking. Assuming a T cell therapy could be generated for every candidate we can compute the number of patients that would benefit, see methods. Because of imperfections in candidate prediction, not all candidates hold the potential for an effective T cell therapy, and these ineffective candidates can be thus viewed as "false positives". Because it is impossible to create a reliable estimated for the fraction of these false positives due to the complexity of the underlying algorithm and biological process we decided to consider a broad range of possible values from 50% to 95%, cf. Figure 3. Using a subset of 6868 patients for whom HLA types were known, we predict the number of patients for whom such positive response might be expected, as a function of the proportion of "false positives" in our candidates. To estimate the impact of such "false positives", we have randomly flagged 1000 times 337, 539, 607 & 640 candidates as "false positives", which is corresponding to a fraction of about 50%, 80%, 90% and 95% of the total 674 candidates. This procedure left us with 1000 sets of 337, 135, 67 & 34 candidates that were not flagged as "false positives". Figure 3 shows that for a pessimistic 90%

Confirmational evidence
A limited validation of our method was performed in two steps: first, we confirmed that our pipeline was able to identify candidates that have been previously reported as eliciting spontaneous CD8 + T-cell responses in cancer patients in whom the target epitopes were subsequently discovered [37,38]. Both sets together (Additional file 8) contain 37 epitopes, 35 of which could be mapped to an ENSEMBL transcript (33 unique genes). For 27 of these epitopes our pipeline predicted strong binding with the specific HLA-1 type reported in the corresponding wetlab investigations. Another 5 epitopes where predicted as weak binders, some of the latter are also predicted to be strong binders in other HLA-1 types. Our pipeline classified 70% of a set of known tumor neo-antigens as strong binders and another 14% as weak binders. 4 out of 34 unique identifiable variants studied by van Buuren et al. [38] and Fritsch [37] are found among our set of high confidence missense variants, but only one (CTNNB1:S37F) fulfills the 1% recurrence threshold (9 uterine carcinoma patients). This variant was shown to trigger immunological response against HLA-A*24:02 [39], which isn't in the set of alleles that we have systematically tested. However, our prediction show that the same peptide might also be reactive against HLA-C07:02.
Finally, the CDK4:R24C peptide (sequence ACDPHS-GHFV, see Additional file 8) is not predicted to bind to HLA-A*02:01, even though it leads to confirmed T cell response [40], and has been related to cutaneous malignant melanoma and hereditary cutaneous melanoma [41], [42]. Taken together, these results show that our candidate prediction pipeline is able to recapitulate most clinically validated neo-epitopes reported in [38] and [37], and that some of these neo-epitopes occur from recurrent variants.
We have also performed preliminary validation for two candidates: RAC1:P29S & TRRAP:S722F binding to HLA-A*02:01 (Fig. 4). We utilized ABabDII mice, transgenic animals that harbour the human TCRαβ gene loci, a chimeric HLA-A2 gene and are deficient for mouse TCRαβ and mouse MHC I genes. These mice have been shown to express a diverse human TCR repertoire [24,43] and thus mimic human T cell response. They were immunized at least twice with mutant peptides and IFNγ producing CD8 + T cells were monitored in ex vivo ICS analysis 7 days after the last immunization. CD8 + T cells were purified from spleen cell cultures of reactive mice using either IFNγ -capture or tetramer-guided FACSort. Sequencing of specific TCR α and β chain amplicons that were obtained by RACE-PCR revealed that this procedure yields an almost monoclonal CD8 + T cell population (not shown). In both cases, tested neo-antigen candidates lead to T cell reactivity, confirming not only predicted MHC binding by our pipeline but also immunogenicity in vivo in human TCR transgenic mice. Therefore this workflow also allows to generate potentially therapeutic relevant TCRs to be used in the clinics for cancer immunotherapy.

Discussion
By virtue of the underlying mutational processes, the genome architecture and accessibility as well as for functional reasons within the disease process, certain somatic mutations will be present in multiple patients while still being highly specific to the tumor [14]. Using existing cancer studies and neo-epitope binding predictions to MHC class I proteins, we propose a ranking of candidates which mutation occur frequently in observed cancer patient cohorts. The candidates are ranked according to the expected number of target patients. For one candidate, the target patients are defined as those who bear the candidate's mutation, and whose HLA types contain the candidate's. The expected number of target patients is proportional to the HLA type frequency in the population, and to the frequency of the mutation in the cancer cohorts. Taking into account the fact that MHC binding is a necessary but not sufficient condition for T cell activity, and the limitations of MHC binding prediction algorithms, our method provides an objective ranking of neo-epitopes based on recurrent variants, as a basis for the development of off-the-shelf immunotherapy treatments.
Despite numerous mechanisms of immune evasion, neo-epitopes are important targets of endogenous immunity [5]. In some cases at least, it has been shown that they contribute to tumor recognition [44], achieve high objective response (in melanoma, see ref. [45,46]), and a single of them is presumably sufficient for tumor regression [47]. Moreover, positive association has been shown between antigen load and cytolytic activity [48], activated T cells [13] and high levels of the PD-1 ligand [49]. Taken together, these results suggest that neo-epitopes occupy a central role in regulating immune response to cancer, and that this role can be exploited for cancer immunotherapy. Even though the question of negative selection for strong binding neo-epitopes and its relation to other immune evasion mechanisms like HLA loss or PD-L1, CTLA4 dis-regulation is still open [50]. A recent CRISPR screen suggest that more then 500 genes are essential for cancer immunotherapy [51].
Targeting neo-epitopes based on non-recurrent, private somatic variants requires generation of private TCRs or CARs for each individual patient, which is challenging [52]. Successful treatments based on genetically engineered lymphocytes has been shown for epitopes arising from unmutated proteins, i.e. public epitopes: MART-1 and gp100 proteins have been targeted in melanoma cases [53]. In another trial, Robbins et al. [54] have studied longterm follow-up of patients who were treated with TCRtransduced T cells against NY-ESO-1, a protein whose expression is normally restricted to testis, but which is frequently aberrantly expressed in tumor cells. They show that treatment may be effective for some patients. These results show that immune treatments based on public variants can be beneficial, suggesting that similar success may potentially be achieved using candidates based on recurrent variants.
However, targeting such non somatic epitopes presents safety and efficacy concerns [2]. The administration of T cells transduced with MART-1 specific T-cell receptor have led to fatal outcomes [55]. Cross-reactivity of TCR against MAGE-A3 (a protein normally restricted to testis and placenta) caused cardiovascular toxicity [56]. Neoepitopes based on recurrent somatic variants potentially alleviate such problems, as the target sequences are truly restricted to tumor cells.
Our computation of expected targetable patient groups assumes that neither the cancer type nor the patient's mutanome are associated with the patient's HLA-1 alleles. In a recent study, Van den Eyden et al. [50] show that there is little (if any) antigen depletion due to the negative selection pressure from the immune response. Molecular evolution methods applied to somatic mutations show that Fig. 4 Recognition of predicted epitopes by CD8 + T cells. Epitopes for recurrent mutations that have been identified in silico to bind to HLA-A*02:01 using our pipeline were synthesized and used for immunization of human TCR transgenic ABabDII mice. Examples (RAC1:P29S and TRRAP:S722F) of ex vivo ICS analysis of mutant peptide immunized ABabDII mice 7 days after the last immunization are shown. Polyclonal stimulation with CD3/CD28 dynabeads was used as positive control, stimulation with an irrelevant peptide served as negative control (data not shown) nearly all mutations escape negative selection [57]. Taken together, these results suggest that the expected probability of a recurrent variant being present in a patient somatic mutations pool should not be affected (significantly) by the patient's HLA-1 alleles.
The neo-epitope landscape is diverse and sparse [13]. Few neo-epitopes are predicted to be both strong binders and present in multiple patients. In their analysis, Hartmaier et al. [58] estimate that neo-epitopes suitable for precision immuno-therapy might be relevant for about 0.3% of the patients, which is in agreement with our results. However, the absolute number of patients is still considerable, see Table 2. Our study shows that a relatively large number of patients (about 1% of newly diagnosed patients) might benefit from a small library of candidates proven to generate immunological response. These numbers must be compared to "conventional" personalised immunotherapy, where a immunologically active candidate must be identified for each new patient for which efficacy and safety are always unknown. Even if a substantial part of the neo-epitopes we suggest turns out to be false positives due to the limitation of prediction algorithms and understanding of immune response, there is potential to help tens of thousands of patients.

Conclusions
Off the shelf immune treatments can be faster, less costly and safer for individual patients, because each neoepitope based treatment scheme can be reused on hundreds of patients per year. In this respect, they might open the way to supplement existing personalized cancer immune treatments approaches with precision treatment options.
We believe that our ranking provides a rational order for testing for and selecting off the shelf neo-epitope based therapies. Our preliminary in vivo mouse experiments show that this in principle feasible.

Supplementary information
Supplementary information accompanies this paper at https://doi.org/10.1186/s12920-019-0611-7. Additional file 2: Neo-epitope candidates from recurrent variants. Recurrent variants leading to binding (strong and weak binders) neo-epitopes for one of the 11 HLA types considered. Peptide length redundancy has been removed from the variant list, and each variant is listed only once, even if it is recurrent in multiple study cohorts.
Additional file 3: Frequency of Single Nucleotides Variants (SNVs) that fall in a poly-A, poly-C, poly-G or poly-T sequence of length at least 6. The variants that appear only once in the whole study are colored in blue, while the variants that appear more than once are colored in red. The dotted line shows the expected fraction of such variants, if the sequences were all random. Except for the LIHC study, all variants that occur more than once in the cohort are found in difficult-to-sequence regions less than expected by chance.

Additional file 4:
Overlap between recurrent variants and hotspot variants. The overlap is based on the codon position, so that all variants occurring at the same protein sequence position are pooled together. The recurrent variants that match the alternate codon definition in Chang et al. are added to the overlap. The recurrent variants are pooled by codon and sorted by decreasing occurrence frequency in the study. The overlap between hotspots and highly recurrent variants is high, and the common variants fraction decreases when recurrent variants become less frequent. The overlap between recurrent variants and the list of suspected false positive hotspots compiled by Chang et al. ([14]) is very limited. Inset: Venn diagram of the total overlap between the recurrent variants called in this study, and the hotspot variants described in Chang et al. ([14]).

Additional file 5:
Oncogenes and tumor suppressors. Number of variants occurring in genes classified as tumor suppressor genes and oncongenes by ( [28]), for each study. The numbers are given for the full set of variants, among recurrent variants only and among variants leading to neo-epitope candidates. As each protein change is considered only once, the total number of variants is always smaller or equal to the sum over all studies, as protein changes appearing in multiple studies are counted only once in the total.
Additional file 6: Expected number of target patients for each neo-epitope candidates, for the 18 cancer entities with associated epidemiological data. The expected number of patients is the product between the number of new cases, the observed variant frequency and the HLA type frequency in the US population. The total expected number of patients for each candidate is the sum over the expected number of candidates by study.

Additional file 7:
Expected frequency of patients with at least one candidate not labelled as false positive. For each TCGA cohort, we have selected at random 1000 times 50%, 20%, 10% and 5% from the candidates, to conservately model a high rate of false positive within the candidates. From these selected candidates, we have computed the expected frequency of patients with a HLA-1 allele and a mutation matching at least one selected candidate.
Additional file 8: Confirmation Status with Gold Standard Data Set. The protein changes described in van Buuren et al. ( [38]) and Fritsch et al. ( [37]) have been mapped to the ENSEMBL protein set and neo-epitopes have been computed using our standard pipeline. 26 of these epitopes are exactly recovered by the pipeline, for one of them the pipeline predicts a strong binder for a shorter peptide, and 5 of them are predicted to be weakly binding. Column 7 to 10 are copied from ( [37]) and ( [38]).
Additional file 9: ARRIVE checklist concerning the animals used for the experimental validation of the in vivo presentation of two peptides.