Allele-specific disparity in breast cancer

Background In a cancer cell the number of copies of a locus may vary due to amplification and deletion and these variations are denoted as copy number alterations (CNAs). We focus on the disparity of CNAs in tumour samples, which were compared to those in blood in order to identify the directional loss of heterozygosity. Methods We propose a numerical algorithm and apply it to data from the Illumina 109K-SNP array on 112 samples from breast cancer patients. B-allele frequency (BAF) and log R ratio (LRR) of Illumina were used to estimate Euclidian distances. For each locus, we compared genotypes in blood and tumour for subset of samples being heterozygous in blood. We identified loci showing preferential disparity from heterozygous toward either the A/B-allele homozygous (allelic disparity). The chi-squared and Cochran-Armitage trend tests were used to examine whether there is an association between high levels of disparity in single nucleotide polymorphisms (SNPs) and molecular, clinical and tumour-related parameters. To identify pathways and network functions over-represented within the resulting gene sets, we used Ingenuity Pathway Analysis (IPA). Results To identify loci with a high level of disparity, we selected SNPs 1) with a substantial degree of disparity and 2) with substantial frequency (at least 50% of the samples heterozygous for the respective locus). We report the overall difference in disparity in high-grade tumours compared to low-grade tumours (p-value < 0.001) and significant associations between disparity in multiple single loci and clinical parameters. The most significantly associated network functions within the genes represented in the loci of disparity were identified, including lipid metabolism, small-molecule biochemistry, and nervous system development and function. No evidence for over-representation of directional disparity in a list of stem cell genes was obtained, however genes appeared to be more often altered by deletion than by amplification. Conclusions Our data suggest that directional loss and amplification exist in breast cancer. These are highly associated with grade, which may indicate that they are enforced with increasing number of cell divisions. Whether there is selective pressure for some loci to be preferentially amplified or deleted remains to be confirmed.


Background
Higher organisms such as humans are diploid, which means that all chromosomes except the sex chromosomes are present in two copies (2 n). During DNA replication, unique genomic segments of original DNA are normally copied just one time, but in specific situations, these fragments are either not replicated entirely or they are copied more than once. The variation between normal and actual copy number of a DNA segment is referred to as copy number differences, which in normal cells are referred to as copy number variations (CNVs) and in tumour cells as copy number alternations (CNAs) [1]. Such chromosomal variations have been associated with genetic disorders (for CNVs) and the progression of cancer (for CNAs) [2]. The size of CNVs in normal tissues is often shorter than CNAs in tumour tissues, which can cover a large region of the human genome (Mb) [3]. Array-based comparative genomic hybridization (CGH) is used for detection of CNVs and CNAs [4,5].
Recently, new techniques have been developed to study CNVs and CNAs at a higher resolution using high-density arrays of single nucleotide polymorphisms (SNPs) [6]. SNPs are variations in the single-base locus of the genome within members of a species or paired chromosomes of an individual. In general, SNPs have precise positions on the chromosome and two alleles (maternal and paternal), and SNP-based arrays are powerful tools for detection of allele-specific chromosomal aberrations and loss of heterozygocity (LOH) status [7]. Recent advances in the systematic study of genomic alterations in human cancer samples have made it possible to understand this disease better and identify key genes with underlying roles in oncogenesis [8]. In the present study, we describe the development of a numerical algorithm based on Euclidian distances to extract the maximum level of information from SNP array data in blood-tumour pairs, with regard to both chromosomal aberrations and loss of heterozygosity. We apply the algorithm to Illumina 109K SNP array data and show how it can be used to mine information on a single-SNP, cytogenic region as well as in gene lists and at the pathway level. We focus on the disparity of CNAs in tumour samples, which were compared to those in blood in order to identify the directional loss of heterozygosity.

Materials
The 112 samples used here are a subset of a larger patient series consisting of 920 samples, which were collected from breast cancer patients to study the effect of disseminating tumour cells to the blood and bone marrow [9]. The samples were collected at five different hospitals in the Oslo region and have so far been extensively studied on both the clinical and biological levels [9,10].

DNA isolation
Tumour tissues were snap-frozen at -80°C. Samples of mononucleus peripheral blood cells were drawn in tubes and centrifuged. Before freezing, the cell pellet was resuspended to a concentration of 30 × 10 6 cells/ml in a freezing solution containing RPMI added with 10% DMSO and 20% FCS as well as Dnase. The samples were then frozen at -70°C before being stored in liquid nitrogen. Before DNA isolation, 240 μl Prot.K (AB-Applied Biosystems) and 400 μl 1 × PBS without CaMg were added to the frozen samples on ice. The samples were incubated at 55°C until thawed and then transferred after mixing to the vessels in the Applied Biosystems 340A Nucleic Acid Extractor. The remaining process of DNA isolation followed the standard procedures for the extractor using a phenol-chloroform based method. DNA from frozen tumour tissue was extracted according to standard procedures of the Applied Biosystems 340A Nucleic Acid Extractor.
All patients provided informed consent, and the study was approved by the local regional ethical committees.

Genotyping
The genotyping used has been previously described [11], and blood-tumour pairs were genotyped using the Human-1 109K BeadChip array (Illumina, San Diego, CA, USA). For each sample, the corresponding log R ratio (LRR) and B allele frequency (BAF) were extracted for both blood and tumour using the Illumina Beadstudio Genotyping software, which is based on the transformation of two channels' allelic intensity values [7]. Briefly Illumina's Beadstudio software performs normalization on the two raw allelic intensity values (X, Y) for each locus (SNP). X and Y values are respectively denoted as allele A and allele B. The normalized allele intensities are transformed to total signal intensity, R (R = X + Y), and relative allelic signal intensity ratio, theta (θ = arctan(Y/X)(2/π)) [12].Illumina provides two supplementary parameters, LRR and BAF, which refer to the normalized and calibrated form of R and θ. BAF represents the proportion of one SNP allele (B) to the total number of A and B alleles (N B /(N A + N B )) for each given loci. The Illumina Infinium platform detects chromosomal aberrations by comparing the normalized intensity of a subject sample and a reference sample using two modes of analysis: 1) a single-sample mode, which derives reference values from canonical genotyping clusters (0, 0.5, 1) created from clustering on normal reference samples; 2) Paired-sample mode, which makes intensity (R) comparisons between a subject sample and its paired reference sample. Genomic plots of LRR (LRR = log 2 (R subject /R reference )) and BAF of the three canonical clusters are used to detect chromosomal aberrations [1,13,14]. Comparing custom-clustered data to HapMapclustered data, we observe improved scoring using custom clustering. The genotypes for both blood and tumour were therefore extracted using custom clustering [15]. To improve the results of genotyping for those samples that have not performed well, we used the custom-clustered data.

Description of computational algorithm
Pre-processing of data Traditional Array CGH provides information on CNAs based on the Log R intensity ratios. The benefit of the Illumina SNP array analysis is that it provides additional information related to B allele frequency. The Beadstudio software presents information, based on two allelic intensity values, related to both Log R ratio and B allele frequency. To analyze the Illumina output, we propose a computational algorithm consisting of the following steps for each chromosome (illustrated by the flowchart in Additional file 1, Figure S1).
We denote b ij and t ij as the blood sample and tumour sample, respectively. Here, i = 1,..., r and j = 1,..., c represent the number of loci and the number of samples, where r = 109352 and c = 112. Figure 1 illustrates the algorithm below for one locus. The Illumina data were pre-processed by sorting BAF and LRR values by chromosome and subsequently by position. All loci residing on chromosome Y were removed from our calculations. We denote BAF and LRR as × and y, respectively. We used the blood data as a reference and assumed a threshold to divide it into three regions (AA, AB, and BB). Monomorph loci and loci with less than three distinct genotype groups were considered noninformative and removed. The BAF value varies between 0 and 1 and the signal for the three genotypes on the x axis clusters around (AA~0, AB~0.5, BB~1); the assumed threshold covers samples in the following regions: a. AA -homozygous: After filtering, we obtained 96,273 loci which modify index j into j = 1,..., r', where r' is the total number of loci without non-informative data.

Allelic disparity
In the horizontal direction, for each locus of Blood', we obtained mean (μ B i = 1 n n j=1 b ij ) and standard deviation( 2 ). In the next step, heterozygous blood samples were extracted and labelled, as well as the entire tumour. To identify which tumour samples shifted from the heterozygous toward either one of the homozygous regions, we measured the distance (D = x B -x T ) of associated samples in both the x and y axes, where D is the allelic disparity of Blood' and Tumor' samples. We deal only with those samples where For each locus, the frequencies of disparity of individuals were calculated. Here, let n(t AA j ) and n(t BB j ) denote the numbers of samples for j shifted toward AA-and BB-homozygous, respectively. Based on the sample frequencies of each SNP, we defined a score (S) denoting the propensity of each SNP to lose either the B or A allele: We can apply and modify an arbitrary threshold via Score (S).
Furthermore, since we identify the number of samples shifted toward A or B-allele, we defined the percentage of samples shifted toward minor allele (r(t BB j )) , the percentage of samples shifted toward major allele (r(t AA j )) , and the percentage of heterozygous with disparity (r(t AB j )) , where  Figure 1 Position of one blood sample in heterozygous. The allelic disparity, amplification or deletion of one paired sample (blood-tumour) is illustrated in the heterozygous region. The position of one blood sample is denoted by (x 1 ,y 1 ) in the heterozygous region (AB). The same sample in the tumour can be either unaltered or shifted toward minor allele (AA-homozygous-amplification (x 3 ,y 3 ), AA-homozygous-deletion (x 4 , y 4 )), major allele (BB-homozygous-amplification (x 2 ,y 2 ) or BB-homozygous-deletion (x 5 ,y 5 )). B-allele frequency is used for allelic disparity and Log R ratio for amplification or deletion.

AA BB AB
.100 (2) and The percentage of allele-specific disparity is given by Based on the genotypes in blood, we tested for each SNP whether it fulfilled the criteria for Hardy-Weinberg (HWE) equilibrium [19]. Given one degree of freedom and a 0.95 confidence interval (CI), the threshold for the chi-square value is 3.841. Values equal to or below this threshold indicate that a locus is in HWE.

Pathway and network analysis
Ingenuity Pathway Analysis (IPA) [20] version 9.0 (Release Date: 2011-03-22), Content version: 3206 (Release Date: 2011-02-11)) was used to analyse selected sets of genes in order to identify over-represented canonical pathways and networks. IPA core analysis allows us to find interactions between genes and proteins, related networks, functions and canonical pathways in the context of biological processes. The selected gene list was uploaded into IPA. The only filter used for the network analysis was "only consider molecules and/ or relationships where species = Human". The analysis included both direct and indirect relationships as well as endogenous chemicals, and for the network analysis the maximum number of molecules allowed per network was 140. In the network analysis, molecules of interest which interact with other molecules in the Ingenuity Knowledge Base are identified as Network Eligible molecules and serve as "seeds" for generating networks. The networks are scored with a numerical value. The score is based on 1) the number of Network Eligible molecules in the network, 2) size of the network, 3) Network Eligible molecules in the given dataset, and 4) the number of molecules in the IPA database that could potentially be included in the network. The network score is the -log(Fisher's Exact test result), which is based on the hypergeometric distribution. Network function displays the functional annotations associated with a specific network. Top networks represent associated network functions based on a score which reflects the negative logarithm of the p-value, which aggregates the likelihood of the genes in the network being found together due to random chance. Score ≥ 2 was considered significant. For the canonical pathways, the significance of the association between our defined data sets and the canonical pathways are assessed in two ways by the software: 1) the ratio of the number of molecules from our gene lists that map to a given canonical pathway and 2) False Discovery Rate (FDR) (Benjamini and Hochberg [21]) by the negative logarithm of B-H p-values indicating the likelihood that the association between the genes in our gene sets and a specific canonical pathway are explained by chance alone. Top biological functions were assigned to each network by the negative logarithm of B-H p-value, corrected for multiple testing (Source: IPA online manual).

Chi-square test
The statistical chi-square test in R version 2.11.1 was used to estimate the association between the observed level of disparity and, among other, clinical and tumourrelated parameters (TP53 mutation, ER, PR, HER2, Lymph node status, and subtypes of breast cancer).

Cochran-Armitage trend test
The Cochran-Armitage trend test [21] is applied in R version 2.11.1 to detect the linear trend of ordered categorical data in a 2 × 3 contingency table. In this article, we used the Cochran-Armitage trend test for clinical parameters of grade and tumour size.

Detection of amplification and deletion
Based on the log R ratio, amplifications and deletions were detected with the following procedure: 1) LRR for tumour data were smoothed by the piecewise constant function (PCF) algorithm [22,23];2) Obtained PCF values were transformed into non-logarithmic values of smoothed LRR; 3) Mean of samples per locus was calculated for the data obtained in 2) and a histogram of the mean values was made ( Figure 2); 4) To distinguish a proper threshold for amplification, deletion and normal region, three higher peaks were selected around 0.925, 1 and 1.12 in the non-logarithmic smoothed LRR; 5) Based on these three peaks, ±0.07 (logarithmic value of 0.95 and 1.05) was chosen as the threshold, where > 0.07 was classified as amplification and -0.07 ≤ normal ≤ 0.07 a nd < -0.07 as deletion (Figure 1).

Results
In a two-dimensional space, a point is distinguished by two characters × and y; here, × is denoted by B allele frequency and y by Log R ratio (Figure 1). The two-dimensional position of one sample (x 1 ,y 1 ) in the heterozygous region (AB) in the blood (blue circle) is compared to its position in the tumour (red circle) to identify shifts toward AA-homozygous-Amplification (x 3 ,y 3 ), AA-homozygous-Deletion (x 4 ,y 4 ), BB-homozygous-Amplification (x 2 ,y 2 ), or BB-homozygous-Deletion (x 5 ,y 5 ). Horizontal shifting is shown by vector-a (B allele frequency) and vertical with vector-b (Log R ratio). By using the algorithm introduced in the Methods (Description of computational algorithm) on a dataset consisting of genotypes from 112 bloodtumour paired samples collected from breast cancer patients, we extracted information on changes in genotypes (allelic loss, here denoted as allelic disparity). The algorithm was applied separately to each autosomal chromosome as well as the × chromosome. In all, 96,273 loci were left after removing non-informative loci (see Methods on pre-processing of data) distributed across all chromosomes. The number of loci per chromosome varied between 1268 loci on chromosome 21 to 8559 on chromosome 1 (the numbers of SNPs and genes before and after filtering are given in Table 1 for each chromosome).

Overall trends for allelic disparity in the genome
It is assumed that a locus has been subjected to allelespecific disparity if the signal from a heterozygous genotype shifts toward one or the other homozygous genotype at a distance larger than two standard deviations of the mean of the heterozygote for all samples. In order to identify the loci with the highest levels of disparity, we defined a score (S in eq. (1)) based on the sample frequency of asymmetrical shift for each heterozygous genotype, and all loci above or equal to 75 were selected (see Methods on allelic disparity), leaving 13,198 loci (Additional file 2, Figure S2). A full list of the 13,198 loci with information related to the genotypes in blood (number of samples in each genotype, number of samples in homozygote for both minor and major alleles, test for Hardy-Weinberg equilibrium, percent heterozygosity, number of missing genotypes, minor allele frequency, minor and major alleles) are given in Additional file 3, Table S1 as well as the information regarding genotypes in tumour (score (S), percentage and number of disparity toward minor or major allele (in eq. 2 and 3, respectively), percentage of heterozygote with disparity in eq. (4), percentage of allele-specific disparity in eq. (5) and preferential allele).

Chromosomal regions enriched for allelic disparity
In order to identify chromosomal regions with a high degree of disparity, we grouped the 13,198 loci according to cytoband localisation. The 13,198 loci fulfilling the criteria of score 75 or higher were distributed on 778 cytobands with 1-77 loci per cytoband. The cytobands having 30% and higher disparity are highlighted in Figure 3 ( Table 2). Several loci have been previously reported in association with breast cancer.

Single-SNP allelic disparity
To further identify loci with a high degree of disparity, we looked at the single-SNP level (Additional file 3, Table S1). For all SNPs, we counted the number of alleles in our dataset and obtained the actual genotype. A compelling example is the SNP rs325349 (A/G) on chromosome 5 located in CDH9, for which 91.7 percent (n = 11) of the heterozygous samples showed disparity in the tumour and all shift in one direction, toward the major allele (A) (Figure 4 A and 4B). SNP rs543304 located in BRCA2 on chromosome 13 also showed a high level of disparity, where 30 percent of the studied samples were heterozygous with 26 percent displaying disparity (n = 9), the majority toward the minor allele (G, n = 8) (Figure 4 C and 4D). Another example is rs4241189 located in HECW2 on chromosome 2, where 33 percent of the samples were heterozygous, 28 percent (n = 11) showing disparity (primarily toward the minor allele (G), n = 10) (Figure 4 E and 4F).

Allelic disparity in relation to clinical parameters
We observed significantly different frequencies of allelic disparity events according to grade, from well differentiated (G1) to poorly differentiated (G3) breast cancers (p-value < 0.001 (Table 3)). Histologic grade is based on the microscopic appearance of the cancer cells, which are graded into four categories: Grades 1, 2, 3 and 4. Grade 1 tumours are well-differentiated and resemble normal cells; these tumours tend to grow slowly and are considered less aggressive. Grade 2 tumours are moderately differentiated, but grade 3 and 4 tumours tend to differentiate poorly and grow rapidly [24]. Figure 5 shows the trend from well differentiated (G1) toward poorly differentiated (G3) for the disparity level of samples.
Allelic disparity was also found to be associated with HER2, ER, PR and TP53 mutation status. A significant higher level of disparity was found in samples with negative status for either one of the receptors HER2, ER, or PR (Table 3). Samples harbouring mutations in TP53 were also more likely to have a high degree of disparity compared to samples with wild type TP53 (Table 3).
To focus on specific genes in considering the allelic disparity of individual loci, we used only the loci in which the event was observed in 50 percent or more of the samples, leaving 176 loci (in 158 unique genes, indicated in Additional file 3, Table S1). Additional file 4, Table S2 shows loci with high levels of disparity  Summary of all studied loci and genes in each chromosome before filtering of the data as well as the numbers of loci and genes showing disparity in both minor allele (AA) and major allele (BB) after removing the non-polymorphic or non-informative loci. associated with the clinical or molecular parameters such as stage, grade, Her2 and TP53 mutation status. We applied multiple testing corrections. Since the number of loci was 176 (less than 1000), we summarized pvalues in Additional file 4,  Table S2.

Allelic disparity and pathway analysis
In the list of 176 loci that show a high level of allelic disparity, we used IPA to perform a core analysis of the gene set and identify the most significantly associated canonical pathways (CP) as well as over-represented networks ( Table 4). The top bio functions in IPA are shown in Figure 6. Table 4

Allelic disparity and stem cell genes
An overview of the gene sets used in this study is given in Additional file 5, Figure S3. For each gene set, we  We then applied a standard chi-square test to compare the observed level of allelic disparity between the stem cell gene list and the list of non-stem cell genes (n = 13,198), as well as the levels of amplification and deletion. Genes were scored as disparity genes if more than 50% of loci related to a given gene showed disparity (score ≥75) (genes w here t he SNPs were e qual ly distributed were not included, Table 5A). We did not see a significantly higher level of disparity overall in loci associated with the stem cell genes compared to the nonstem cell genes when "collapsed" by associated or closest gene (p-value = 0.8008). We also analysed overall copy number changes between the two gene-sets (stem cell and non-stem cell) and observed that the stem cell genes are more often altered by deletion than by amplification compared to the non-stem cell genes (p-value = 5.41E-04, Table 5B).

Chromosomal regions enriched for allelic disparity
We first identified chromosomal regions which were enriched for allelic disparity. Several of these loci have been previously reported in association with breast cancer. For instance, the genetic variation in 5q11.1 has previously been associated with the risk of breast cancer in multiple GWAS studies [27]. Bergamschi et al. reported 5q11.1 to be more frequently deleted in basallike subtype tumours [28]. In the same study, they also found 3q12.2 loss to be associated with Luminal B tumours [28,11]. Chromosomal aberrations of 3q13. 13 have been detected in several types of cancer, including breast, gastric, colon, and squamous cell carcinomas of head and neck; furthermore, KIAA1524 residing in this region has been associated with an aggressive form of breast cancer [29].

Single-SNP allelic disparity
Next, we further identified the SNPs that most strongly contributed to allelic disparity. Such SNPs were, for instance, SNP rs543304 located in BRCA2, rs325349 located in CDH9, and rs4241189 located in HECW2. Abnormalities or mutations of BRCA2 gene have been linked to increased risk of both familial breast and ovarian cancer [29]. The encoded protein is involved in repairing cell damage and maintaining normal growth of breast cells [29]. CDH9 encodes a type II classical cadherin from the cadherin family. The cadherins are integral membrane proteins important for cell-cell adhesion [30]. HECW2 encodes an E3 ubiquitin-protein ligase and mediates ubiquitination of TP73, a p53 family member regulating cell growth and apoptosis. The enzyme acts to stabilize TP73 and enhance activation of transcription by TP73 [31].  Top three associated network functions and related scores from IPA. Top bio functions contain 1) diseases and disorders, 2) molecular and cellular functions.

Allelic disparity in relation to clinical parameters
Allelic disparity was observed in association to grade (degree of differentiation), HER2, ER, PR and TP53 mutation status in a number of genes. SNP rs1041242 found associated with stage is located in the MYCBP2 gene on chromosome 13 (q22.3). MYCBP2 is known as a probable E3 ubiquitin-protein ligase which mediates ubiquitination and subsequent proteasomal degradation of target proteins. MYCBP2 may function as a facilitator or regulator of transcriptional activation by MYC [29]. Three SNPs in the gene ST7 on the 7q31.1-q31.2 region were also found associated with stage. ST7 has been reported as a tumour-suppressor gene involved in a variety of other human cancers and cell lines derived from breast tumours have been shown to harbor mutations in ST7 [32]. Locus rs3824271 found associated with ER status is located in the CSMD1 gene on chromosome 8p23.2, a region reported to be deleted in up to 50% of breast cancers [33]. Deletion of 8p23.2 has been associated with high histologic grade [34].

Biological pathway analysis (IPA)
The main network functions identified through pathway analysis were lipid metabolism, small molecule biochemistry, genetic disorder, lipid metabolism, cell death, and cellular development. Activation of lipid metabolism has been reported to be an early event in carcinogenesis [35] and many studies at the single-gene level of lipid metabolism have revealed an effect on tumour genesis [36].

Stem cells
One of the most important characteristics of an embryonic stem cell is its ability of self-regeneration and differentiation. If one assumes that the initial targets of deregulation will be among the stem cell genes, then one may expect that they will represent the "oldest" events in an unstable genome, which would be reinforced in every cell division. We therefore hypothesized that the stem cell genes were likely to have a higher level of disparity in the tumour [37,38]. To test this, we extracted the stem cell genes from 13 published gene sets [39][40][41]. This hypothesis, however, was not confirmed, suggesting that the observed allelic disparity is perhaps not merely a function of the number of divisions after a single event but the result of a more complex and subtle evolutionary process.

Conclusion
Breast cancer is one the most frequent malignant and a multifactorial disease affecting women, and thus it is  reasonable to hypothesize that there is an association between multiple genomic rearrangements and tumour sample disparities. Since most tumours diverge from the diploid level, many studies have reported only gains and losses by array CGH. Our data suggest that directional loss and amplification exist in breast cancer. These are highly associated with grade, which may indicate that they are enforced with increasing number of cell divisions. Whether there is a selective pressure for some loci to be preferentially amplified or deleted remains to be confirmed.