A Novel Neoantigen Discovery Approach based on Chromatin High Order Conformation: Mapping the Neoantigen to 3D Genome

The high-throughput sequencing technology has yielded reliable and ultra-fast sequencing for DNA and RNA. For tumor cells of cancer patients, when combining the results of DNA and RNA sequencing, one can identify potential neoantigens that stimulate immune response of the T cells. However, when the somatic mutations are abundant, it is computationally challenging to efficiently prioritize the identified neoantigen candidates according to their ability of activating the T cell immuno-response. Numerous prioritization or prediction approaches have been proposed to address this issue but none of them considers the original DNA loci of the neoantigens from the perspective of 3D genome. Here we retrospect the DNA origins of the immune-positive and non-negative neoantigens in the context of 3D genome and discovered that 1) DNA loci of the immuno-positive neoantigens tend to cluster genome-wise. 2) DNA loci of the immuno-positive neoantigens tend to belong to active chromosomal compartment (compartment A) in some chromosomes. 3). DNA loci of the immuno-positive neoantigens tend to locate at specific regions in the 3D genome. We believe that the 3D genome information will help more precise neoantigen prioritization and discovery and eventually benefit precision and personalized medicine in cancer immunotherapy.


Introduction
In a variety of human malignancies, immunotherapies via boosting the endogenous T cell ability in destroying cancer cells have demonstrated therapeutic efficacy 1 .Based on clinical practices in a substantial fraction of patients, the inference of endogenous T cell with mounted cancer-killing ability is that the T cell receptor (TCR) is able to recognize peptide epitopes that are displayed on major histocompatibility complexes (MHCs) on the surface of the tumor cells.These cancer rejection epitopes may be derived from two origins: the first origin of potential cancer rejection antigens is formed by non-mutated proteins to which T cell tolerance is incomplete for instance, because of their restricted tissue expression pattern; the second origin of potential cancer rejection antigens is formed by peptides that cannot be found from the normal human genome, so-called neoantigens 1 .With the development of genome sequencing, it has been revealed that during cancer initiation and progression, tens to thousands of different somatic mutations are generated.Most of these mutations are passenger mutations, meaning no obvious growth advantage, and are often caused by genomic instability within the tumor cells.A limited number of cancer mutations are driver mutations which interfere with normal cell regulation and help to drive cancer growth and resistance to targeted therapies 2 .Both passenger mutations and driver mutations can be nonsynonymous that alter protein coding sequences, causing tumor to express abnormal proteins that cannot be found in normal cells.When cell metabolize, the proteins possessing abnormal sequences are cut into short peptides, namely epitopes, and are presented on the cell surface by the major histocompatibility complex (MHC, or human leukocyte antigen (HLA) in humans) which have a chance to be recognizable by T cells as foreign antigens 2 .
According to the discoveries mentioned above, in theory therefore, if the potential neoantigens can be identified via sequencing technology, one can synthesize epitope peptides in vitro and validate their efficacy in vivo (cancer cell-line or in mouse model) before clinical practice 1,2 .Indeed, cancers with a single dominant mutation can often be treated effectively by targeting the dominant driver mutation 2,3 .However, when the somatic mutations are abundant, which is the case in most cancer types, it is computationally challenging to efficiently prioritize the identified neoantigen candidates according to their ability of activating the T cell immuno-response 4 .Over the last decades, numerous neoantigen prediction approaches have been proposed to address this issue [5][6][7] .These approaches can be classified into two major categories: the protein 3D structure-based approaches which considers the pMHC and TCR 3D conformation, and the protein sequence-based approaches which consider the amino acid sequence of protein antigens.For the protein 3D structure-based approaches, in some specific cases when high quality pMHC 3D structures are available, molecular dynamic (MD) methods are used to explore the contact affinity of pMHC-TCR complex [8][9][10] , in most cases however, the modelling or simulation by protein docking and threading has to be used due to the lack of high quality pMHC 3D structures.
Most approaches belong to the sequence-based category as there are much larger data sets for training and validation 11,12 and they are usually very efficient to set up 4,13 .
Early sequence-based methods relied on position-specific scoring matrices (PSSMs), such as BIMAS 14 and SYFPEITHI 15 , in which the PSSMs are defined from experimentally confirmed peptide binders of a particular MHC allele 4 .Later, more advanced methods based on machine-learning techniques have been developed to capture and utilize the nonlinear nature of the pMHC-TCR interaction which indeed demonstrated better performance than the PSSM-based methods.Consensus methods that combine multiple tools to obtain more reliable predictions were also developed, such as CONSENSUS 16 and NetMHCcons 17 , which demonstrated better performances; these methods.However, the performance gain is determined by the weighting scheme which cost increased computational power.When considering peptide binding, the large majority of HLA alleles has not been investigated, therefore, there are pan-specific methods, such as NetMHCpan 6,7 , are developed which allow the HLA type independent prioritization of neoantigen.
As one of the widely adopted practices in neoantigen prioritization, NetMHCpan first train a neural network is based on multiple public datasets, and the affinity of a given peptide-MHC considering the polymorphic HLA types HLA-A, HLA-B or HLA-C is then computed according to the trained neural network.NetMHCpan 7 and NetMHCIIpan 18 performance remarkably, even compared to allele-specific approaches 4,19 .However, although several assessments and criteria were proposed in the past aiming at a more fair and effective comparison [19][20][21] , there are no recent independent benchmark studies that can be used to recommend specific tools up to now.More importantly, to the best of our knowledge however, none of the neoantigen prediction methods mentioned above consider the mutation DNA loci of the neoantigens in the perspective of 3D genome, which carries much richer information comparing to the amino acid sequence alone 22 .In this work, we retrospect the DNA origin of the immune-positive and non-negative neoantigens in the context of 3D genome and demonstrate some interesting discoveries.

Neoantigen proximity in individual chromosome (Intra-chromosome)
We generated all peptide pairs between immune-positive peptides and peptide pairs between immune-negative peptides.Then on each chromosome (intra-chromosomal), we generate each pair's contact frequency on IMR90 and hESC Hi-C data.The results are shown in Table 1 and Table 2. Jointly from these results, we found that positive peptides' corresponding DNA loci tend to be more proximate than the negative ones on chr1, chr7, chr10, and chr12, while negative peptides' corresponding DNA loci tend to be more proximate than the positive ones on chromosome chr2, chr5, chr8, chr11, and chr20.Neoantigen proximity in the whole genome (Inter-chromosome) For the inter-chromosomal peptide pairs, both positive and negative, we also collect their contact frequency and calculate the average values.As shown in Figure 1, on both hESC and IMR90 Hi-C data, immune-positive peptide pairs are more proximate to each other comparing to immune-negative peptide pairs.The corresponding P-values are close to zero's.

Neoantigen distribution on active and inactive compartment
For each chromosome, we compute the compartment type (A or B) for each chromosomal region (bin), shown in Figure 2A and 2B.Then we assign positive and negative peptides with their corresponding A/B compartment type.We found that in some chromosome, immune-positive neoantigens tend to be located on compartment A, comparing to immune-negative neoantigens, as shown in Figure 2C.

The radius position distribution of neoantigen
We developed a novel molecular dynamic based approach to model the 3D conformation of the human genome, on both hESC and IMR90 Hi-C data.We then map the positive and negative peptides' corresponding chromosomal loci on the constructed 3D genome and calculate their radius distance to the nucleus center, as shown in Figure 3A.We found that immune-positive peptide's corresponding loci tend to locate closer to the nuclear periphery, comparing to the negative ones, as Figure 3B demonstrates.We then used the radius position as the immunogenicity predictor and found that surprisingly, this feature along can discriminate the immune-positive peptides from the immune-negative peptides, as shown in Figure 3C.

Chromatin 3D modeling
We adopted our previous chromatin 3D modeling method for individual chromosome using molecular dynamic (MD) based approach 26 with resolution 40kb (bin size) for the IMR90 and hESC Hi-C data.For the whole genome 3D modeling, we used 500kb resolution genome-wise Hi-C contact map of IMR90 and hESC and extended and re-developed our previous individual chromosomal modeling method to be able to handle the whole genome data.Once we have the genome-wise 3D structure, we calculate the radius distance of each bin to the center of the nucleus to obtain radius position of each neoantigen peptide.

Discussion
In cancer immune therapy, the neoantigen therapy is a rising and promising topic as it can be genuinely personalized and precise.However, when the somatic mutations are abundant, it is computationally hard to efficiently prioritize the identified neoantigen candidates according to their ability of activating the T cell immuno-response and numerous prioritization or prediction approaches have been proposed to address this issue.However, none of existing approaches considers the original DNA loci of the neoantigens in the 3D genome perspective, to the best of our knowledge.Here, we retrospect the DNA origin of the immune-positive and non-negative neoantigens in the context of 3D genome and discovered that 1) DNA loci of the immuno-positive neoantigens tend to cluster genome-wise.2) DNA loci of the immuno-positive neoantigens tend to belong to active chromosomal compartment (compartment A) in some chromosomes.3).DNA loci of the immuno-positive neoantigens tend to locate at specific regions in the 3D genome.We believe that the 3D genome information will help more precise neoantigen prioritization and discovery and eventually benefit precision and personalized medicine in cancer immunotherapy.

Figure 1 .
Figure 1.Contact frequency distribution comparison between immune-positive and immune-negative peptide pairs on hESC and IMR90 Hi-C data.

Figure 2 .
Figure 2. Neoantigen distribution on active and inactive compartment.(A) Example of compartment A and compartment B submatrix on chromosome 1.Example of compartment A/B values on chromosome 1.(C) Distribution of percentage of compartment A immune-positive and immune-negative peptides.

Figure 3 .
Figure 3.The radius position distribution of neoantigen.(A) Positive and negative corresponding chromosomal loci on 3D genome structure.(B) Radius position distribution of the positive peptides comparing to the negative peptides.(C) ROC curve demonstrating the discriminatal power of radius position in immunogenicity prediction.

Table 1 .
Average contact frequency contact of immune-positive peptide pairs and immue-negative peptide pairs based on IMR90 Hi-C data.

Table 2 .
Average contact frequency contact of immune-positive peptide pairs and immue-negative peptide pairs based on hESC Hi-C data.