Genotyping of human neutrophil antigens (HNA) from whole genome sequencing data

Background Neutrophil antigens are involved in a variety of clinical conditions including transfusion-related acute lung injury (TRALI) and other transfusion-related diseases. Recently, there are five characterized groups of human neutrophil antigen (HNA) systems, the HNA1 to 5. Characterization of all neutrophil antigens from whole genome sequencing (WGS) data may be accomplished for revealing complete genotyping formats of neutrophil antigens collectively at genome level with molecular variations which may respectively be revealed with available genotyping techniques for neutrophil antigens conventionally. Results We developed a computing method for the genotyping of human neutrophil antigens. Six samples from two families, available from the 1000 Genomes projects, were used for a HNA typing test. There are 500 ~ 3000 reads per sample filtered from the adopted human WGS datasets in order for identifying single nucleotide polymorphisms (SNPs) of neutrophil antigens. The visualization of read alignment shows that the yield reads from WGS dataset are enough to cover all of the SNP loci for the antigen system: HNA1, HNA3, HNA4 and HNA5. Consequently, our implemented Bioinformatics tool successfully revealed HNA types on all of the six samples including sequence-based typing (SBT) as well as PCR sequence-specific oligonucleotide probes (SSOP), PCR sequence-specific primers (SSP) and PCR restriction fragment length polymorphism (RFLP) along with parentage possibility. Conclusions The next-generation sequencing technology strives to deliver affordable and non-biased sequencing results, hence the complete genotyping formats of HNA may be reported collectively from mining the output data of WGS. The study shows the feasibility of HNA genotyping through new WGS technologies. Our proposed algorithmic methodology is implemented in a HNATyping software package with user’s guide available to the public at http://sourceforge.net/projects/hnatyping/.

The HNA-1 antigens are located on the low-affinity Fc-γ receptor IIIb (FCGR3B), CD16b. These receptors bind to the Fc portion of IgG antibodies [7]. The HNA-2 antigen system is located on CD177. The number of neutrophils expressing HNA-2 may increase in pregnancy, infections and myeloproliferative disorders [8]. The HNA-3 antigen system has one antigen, HNA-3a. HNA-3a is expressed by neutrophils, lymphocytes, platelets, endothelial cells, kidney, spleen, and placental cells. The antigen is located in exon 7 of the CLT2 gene (SLC44A2) [9]. The HNA-4 and HNA-5 antigens are located in the β2 integrins. Each antigen system contains only a single antigen. HNA-4a has been located on the αM chain (CD11b) [10]. HNA-5a has been located on the αL integrin unit (CD11a) [10]. Table 1 lists the five antigen systems and their responding genes.
With the invention of polymerase chain reaction (PCR), different PCR-based genotyping assays on HNA have been developed, including sequence-based typing (SBT) [11] as well as PCR-sequence-specific oligonucleotide probes (SSOP) [12], PCR-sequence-specific primers (SSP) [13] and PCR-restriction fragment length polymorphism (RFLP) [14]. Sequence-specific oligonucleotide probes (SSOP) and sequence-specific primers (SSP) were designed to detect variable sequence motifs in PCR-amplified HNA genes, revealing an extensive level of previously detected alleles. Sequence-based typing (SBT) enables all nucleotides to be identified so that it provides a better method for the evaluation of new alleles on HNA genes. PCRbased methods continue to be used in routine antigen detection. However, the next-generation sequencing technology (NGS) enables rapid whole-genome sequencing for more detailed and precise investigation of total variants of genes [15]. It opens the opportunity to redesign genotyping strategies for more effective genetic mapping and genome analysis [16].
In this paper, we developed a bioinformatics tool for typing the HNA antigens from personal human wholegenome sequencing data. The NGS technology was evaluated for its potential in high-resolution HNA typing. Our tool combined with NGS data can produce unambiguous results regardless any new identified variants of the antigen systems.

Implementation
Human whole-genome sequencing with next-generation sequencing usually produces hundreds of gigabases, thus WGS mapping or assembly tools generally requires large memory such that it is intractable to use such tools for specific genotyping. In this study, we developed a lightweight method for the purpose of HNA typing as shown in Figure 1. There are three steps in our genotyping procedure. Firstly, the procedure specifies a set of short DNA  sequences (~200bp) which contain all of the nucleotide variants of antigens. Secondly, the DNA sequences are used to filter WGS datasets. These DNA sequences are only 1/100000 of the human whole genome such that most of the unrelated reads in WGS dataset are removed and only a small set of reads are kept for typing. The final step is the alignment of the filtered reads on to the DNA template. We developed two programs, WgsReadFilter and WgsHnaTyping, for the procedure. These programs only consume small amount of memory to deal with the large WGS datasets. The detail of the procedure is described as following.

The identification of DNA variants for the HNA systems
The nucleotide polymorphisms for recognition of different HNA antigens have been published in literature [3,17,18] ( Table 1). But the previously reported nucleotide polymorphisms are not consistent. For example, the allele for HNA5a was reported to be at ITGAL*2372 by Moritz [3] or at ITGAL*2466 by Xia [18], furthermore, both of these loci are inconsistent with the GenBank sequences (NM_001145056.1 or NM_002209.2) of human ITGAL genes. Consequently, we searched the Single Nucleotide Polymorphism database (dbSNP) to ensure the correctness of the loci for HNA alleles. For the HNA-1 system, there are two sites, 141 and 266, to discriminate HNA-1a, -1b, and -1c antigens. Variants for HNA-2 had not been revealed. Each of the other HNA systems can be recognized with a single nucleotide polymorphism (SNP). Our curated alleles for HNA-3a, -4a, and -5a are identified by the SNP IDs, rs2288904 (HNA3), rs1143679 (HNA4) and rs59554592 (HNA5) and the loci of these SNPs are listed in Table 1. The loci of these SNPs in the human genome (GRCh37.p5) are listed in Table 2. We downloaded +/−100 bp of flanking sequence surrounding the SNPs for all alleles, which is used as DNA templates for filtering and alignment of reads.

The filtering of HNA reads from WGS datasets
To filter the reads, we build a hash table of short keys (default is 14 bp) from the DNA templates stated in the previous section. We examined the non-overlapped occurrences of the keys on each read and the reads with  more than two occurrences of keys are stored for the typing. The filtering process is used to efficiently eliminate most of the unrelated reads to speed up the following alignment process. Only very few reads remain in the output of this step.
The typing of HNA reads by the alignment of the filtered reads The filtered reads were mapped onto the DNA templates in the WgsHnaTyping program. As shown in Figure 2, the mapping procedure executed twice for each DNA template. First, the mapping was performed from the 5′ end to the 3′ end of the template sequence, and then it was performed again from the 3′ end to the 5′ end ( Figure 2A). Then the candidate reads were screened by index keys K L and K R . The prefixes of reads were used as the index key K L when the mapping is from the 5′ end ( Figure 2B). Similarly, the postfixes of reads were used as the index key K L when the mapping is from the 3′ end ( Figure 2C). The default key length was 15 bp. Moreover, the filtered reads were compared with the template sequence by the error detection area. Finally, those reads were aligned if there were less than 1/10 error bases. Genotype calling would then proceed by counting the number of times each allele is observed and using a fixed cut off value. For example, a heterozygous genotype is called if the proportion of the non-reference allele is between 5% and 95%; otherwise, a homozygous genotype would be called. But if the coverage at the SNP site is less than 2, the "unknown" genotype would be called.  The WgsHnaTyping program displays the results of HNA types in a user friendly graphical user interface and also outputs the following two files -(1) an ACE file which recorded the alignment of reads. The ace file can be displayed with the visualization tool UGENE (http://ugene. unipro.ru/), and (2) a TXT file to record the coverage at each allele.

Whole-genome sequencing samples
Six public WGS datasets were downloaded from the European Nucleotide Archive (http://www.ebi.ac.uk/ena/). Table 3 lists the six samples. The WGS datasets were from two pedigree trios: YOR009 and CEPH146 of the 1000Genomes project [19]. YOR009 is an African family. CEPH1463 is an American family from Utah with Northern and Western European ancestry. These DNA samples were isolated from B-lymphocyte cells derived from blood.

Results and discussion
The results of HNA typing for the samples are displayed in both of Figures 3 and 4. These figures show that all of the typing screens from the WgsHnaTyping program in Figure 3. We illustrate the alignment results of alleles using the UGENE program in Figure 4 for the verification of typing results.
Typing result of the HNA1 system In the results, all of the six samples have HNA-1a antigens and only one African was without HNA-1b. Besides, two African samples have HNA-1c which was a rare type for other populations around the world. Figure 4A depicts the consensus of the allele:FCGR3B*141. All of the aligned nucleotides are Guanine and no Cytosine, thus only the HNA-1a type is positive and both the HNA1b and HNA1c types are negative for the sample NA18507. On the contrary, the same location in Figure 4B shows a different situation that both Guanine and Cytosine coexist for the sample NA18506. Combined with the allele (FCGR3B*266) in Figure 4C, it shows that the HNA-1a, HNA-1b and HNA-1c types are all positive. The likely coexistence of 3 alleles (especially FCGR3B*266) for the sample NA18506 may need verification studies at aspects including the specific DNA structure for sequencing base preferences [20], the additional HNA loci on the same chromosome due to the unequal crossingover events [21] and due to the gene duplication events [22], and the heterogeneous neutrophil specimen of mixed clonal lineages. Albeit, the detected coexistence of 3 alleles maybe beneficial for avoiding clinical crisis caused by the false negative results.

Typing result of the HNA3, HNA4 and HNA5 systems
In the typing of HNA-3, -4, -5 systems, there were four homozygous cases of HNA-3aa and two heterozygous cases of HNA-3ab. Similarly, there were three homozygous cases of HNA-5bb and three heterozygous cases of HNA-5ab as well. Besides, all of the six samples were homozygous in the typing of HNA-4. The results showed that the usefulness of genotyping from whole-genome sequencing. It provided an unambiguous analysis for the zygosity of alleles. Figure 4D illustrates a heterozygous case of HNA-5ab. There were sufficient reads aligned which contribute the nucleotides at the SNP 2296C/G.
The SNP confirmed existence of both HNA-5a and HNA-5b antigens in the sample NA18507.

Requirement of sequencing coverage for HNA-Typing
The genomic data from human whole genome sequencing provided an examination of total variance of personal genome including millions of nucleotide differences. However, the identification of significant SNPs is still a challenging task because the number of revealed alleles is gradually increasing. Moreover, the cost of whole genome sequencing is still expensive so far. As a result, we checked the requirement of sequencing coverage for the HNA alleles. We assumed that the identification of undoubted SNPs required at least two occurrences of each haploid for a total of 4 read alignments. We can use the Poisson distribution to compute the probability of a base being sequenced a certain number of times as: Where n is the number of times a base is read, and C stands for coverage. The coverage of WGS dataset was defined as the total bases from the sequences reads divided by the size of human genome. (The estimated diploid genome sizes for human female and male genomes are 6.406G and 6.294G, respectively [23].) Therefore, the probability of the base being sequenced less than 2 times is Finally, the probability of having at less than two occurrences of each haploid can be obtained by calculating P(X 1 ≤ 1 ∪ X 2 ≤ 1). For example, a coverage of 8X will result in a probability of 99.4% for the event which each haploid at a particular locus is sequence at least twice. Table 4 lists the found nucleotides for the critical alleles of HNA antigens and the sequencing coverage. Table 5 lists the probabilities for the event which each haploid at a particular locus is sequence at least twice for coverages 5 through 12.

Conclusions
Our study provides a new approach of genotyping to the HNA systems. The conventional samples, analysed by PCR-based methods, cannot be easily used for testing new variants. However, once there are any new variant such as HNA-1d of the antigen systems discovered and confirmed [21], our bioinformatics tool can be easily modified to explore the WGS datasets again to find the updated and/or undiscovered variants. Our tool may adequately exploit the genotyping advantage of next generation sequencing data either for the HNA systems or for the extended systems such as human leukocyte antigen (HLA) systems.

Availability and requirements
In the HnaTyping software package, both the programs WgsReadFilter and WgsHnaTyping were implemented in C# with the .NET Framework which can be run on 64-bit Windows/Linux. The HnaTyping software with a user manual is available at the Web site: http://sourceforge. net/projects/hnatyping/.