Secure searching of biomarkers through hybrid homomorphic encryption scheme

Background As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. Method We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Result Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Conclusion Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.


Background
The rapid development of genome sequencing technology enables us to access large genome dataset and it looks poised to make a significant breakthrough in medical research. While genomic data can be used for a wide range of applications including healthcare, biomedical research, and direct-to-consumer services, it has numerous special distinguishing features and it can violate personal privacy via genetic disclosure or genetic discrimination [1][2][3]. Due to these potential privacy issues, it should be managed with care. There have been various privacy-enhancing techniques using cryptographic methods as outsourced analysis tools of genomic data. Recently, it has been suggested that we can preserve privacy through homomorphic encryption (HE), which allows computations to be carried out on ciphertexts. Yasuda et al. [4] gave a practical solution to find the location of a pattern in a text by computing multiple Hamming distance values on encrypted data. Lauter et al. [5] gave a solution to privately compute the basic genomic algorithms used in genome-wide association studies.
Homomorphic encryption can be applied to privacypreserving sequence comparison, but it is still impractical for the analysis of entire human genome information. For example, Cheon et al. [6] presented a protocol to compute the edit distance on homomorphically encrypted data but it took about 27 s even on length 8 DNA sequence. It is not easy to efficiently approximate the edit distance over encryption even though the distance to a public human DNA sequence is given [7]. This inefficiency comes from the difficulty of homomorphic evaluation of equality test: Encrypting the inputs bit-wise and computing over the encrypted bits yield expensive computation cost (at least linear in the data bit-length).
In this paper, we suggest an efficient method to securely search a set of biomarkers using hybrid Ring-GSW homomorphic encryption scheme.

Problem setting
The iDASH (Integrating Data for Analysis, 'anonymization' and SHaring) National Center organizes the iDASH Privacy & Security challenge for secure genome analysis. This paper is based on a submission to the task 3 in 2016 iDASH challenge: secure outsourcing of testing for genetic diseases on encrypted genomes. The goal of this task is to privately calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a public cloud service. The requirement is that the entire matching process needs to be carried out using homomorphic encryption so that any information about database and query should not be revealed to the server during computation.
Suppose that the client has a Variation Call Format (VCF) file which contains genotype information such as chromosome number and position in the genome. It also contains some information for each position such as reference and alternate sequences, where each base must be one of SNPs: A, T, G, and C. The client encrypts the information using homomorphic encryption and the server calculates the exact match over the encrypted data. The outcome is the absence/presence of the specified biomarkers, that is, an encryption of 1 if matched; otherwise, an encryption of 0. Finally the client decrypts the result by the secret key of homomorphic encryption.

Practical homomorphic encryption
Fully Homomorphic cryptosystems allow us to homomorphically evaluate any arithmetic circuit without decryption. However, the noise of the resulting ciphertext grows during homomorphic evaluations, slightly with addition but substantially with multiplication. For efficiency reasons, for tasks which are known in advance, we use a more practical Somewhat Homomorphic Encryption (SHE) scheme, which evaluates functions up to a certain complexity. In particular, two techniques are used for noise management of SHE: one is the modulusswitching technique introduced by Brakerski, Gentry and Vaikuntanathan [8], which scales down a ciphertext during every multiplication operation and reduces the noise by its scaling factor. The other is a scale-invariant technique proposed by Brakerski such that the same modulus is used throughout the evaluation process [9].
Let us denote by [·] Q the reduction modulo Q into the interval (−Q/2, Q/2] ∩Z of the integer or integer polynomial (coefficient-wise). For a security parameter λ, we choose an integer M = M(λ) that defines the M-th cyclotomic polynomial M (X). For a polynomial ring R = Z[ X] /( M (X)), set the plaintext space to R t := R/tR for some fixed t ≥ 2 and the ciphertext space to R Q := R/QR for an integer Q = Q(λ). Let χ = χ(λ) denote a noise distribution over the ring R. We use the standard notation a ← D to denote that a is chosen from the distribution D.

The basic scheme
The following is a description of basic homomorphic encryption scheme based on the hardness of (decisional) Ring Learning with Errors (RLWE) assumption, which was first introduced by Lyubashevsky et al. [10]. The assumption is that it is infeasible to distinguish the following two distributions. The first distribution consists of pairs (a i , u i ), where a i and u i are drawn uniformly at random from R Q . The second distribution consists of pairs of the form (a i , b i )=(a i , a i s + e i ) where a i is uniformly random in R Q and s, e i are drawn from the error distribution χ . To improve efficiency for HE, we use sparse secret keys s with coefficients sampled from {0, ±1} as in [11].
and ct = (c 0 , c 1 ), the homomorphic addition is Throughout this paper, we assume that the integer M is a power of two so that N = M/2 and φ M (X) = X N + 1. We adapt the conversion and modulus-switching techniques of [12]. The conversion algorithm changes an RLWE encryption of m = i m i X i into an LWE encryption of its constant term m 0 , and the modulus switching reduces the ciphertext modulus Q down to q while preserving the message. We note that an LWE ciphertext is represented as a vector in Z q for some modulus q, and the decryption procedure is done by an inner product of the ciphertext and the secret key vector.
An RLWE ciphertext ct = (c 0 , c 1 ) has the decryption structure of the form c 0 + c 1 · s = (Q/t) · m + e and its constant term is It can be represented as an inner product of a vec- . Hence the output of the conversion algorithm can be seen as an LWE encryption of m 0 . It is also easy to check that if ct ∈ Z N+1 Q satisfies ct, s = (Q/t)·m+e (mod Q), then the output of LWE.ModSwitch algorithm satisfies ct , s = (q/t) · m + e (mod q) for some e ≈ (q/Q) · e. These techniques have been proposed for an efficient bootstrapping [12], but they will play totally different roles in our application. Finally an LWE ciphertext of modulus q can be decrypted by s as follows.
If ct, s = (q/t)·m+e (mod q) for some small enough e, it returns the correct message m modulo t. More precisely, the decryption procedure works if |te/q| < 1/2.

The Ring-GSW scheme
Gentry et al. [13] suggested a fully homomorphic encryption based on the LWE problem, where the message is encrypted as an approximate eigenvalue of a ciphertext. Ducas and Micciancio [12] described its RLWE variant. The RGSW symmetric encryption scheme consists of the following algorithms.
• RGSW.ParamsGen(·), RGSW.KeyGen(·): Use the same parameter params and secret key s with the basic RLWE scheme. Additionally set the decomposition base B g and exponent d g satisfying B d g g ≥ Q. • RGSW.Enc(m, sk): To encrypt m ∈ R t , pick a matrix a ∈ R 2d g Q uniformly at random, and e ∈ R 2d g Z 2d g ·n with discrete Gaussian distribution χ of parameter ς, and output the ciphertext where b = −a · s + e and the gadget matrix Let WD B g (·) be the decomposition with the base B g , where the dimension of input vector is multiplied by d g through this algorithm. The RGSW encryption of m with respect to the eigenvector 1, s, . . . , B In [14], the hybrid multiplication between RGSW ciphertexts and RLWE ciphertexts has been defined as follows.
• Hybrid.Mult(CT, ct): Given an RGSW ciphertext If CT and ct are RGSW and RLWE encryptions of m and m , respectively, their multiplication ct is a valid RLWE encryption of mm . For convenience, we will denote Hybrid.Mult(CT, ct) algorithm by , i.e., (CT, ct) ∈ R

Privacy-preserving database searching and extraction
Let us consider a database of a set of n tuples. Each tuple consists of pairs (d i , α i ) for i = 1, . . . , n, where d i denotes a data-tag in the domain {0, 1, . . . , T − 1} and α i represents the corresponding value attribute in a plaintext space Z t \{0}.Note that all the tags should be distinct from each other. For instance, in the case of personal information database, α i may be the age of user whose identity number is d i . Given a query tag d from a tag domain and a query value α from a plaintext space, the matching problem is to determine the existence of an index i such that (d, α) = (d i , α i ). Now consider the following simplified search query: select α i if there exists an index i such that d i = d; otherwise zero (⊥). The purpose of this section is to store the database and carry out this search query on the public cloud. The server should learn nothing from encrypted query and any information other than the final result should not be leaked to user. Throughout this work, we will use semi-honest (honest but curious) adversary model, which is a standard assumption for evaluation of homomorphic encryption.
Our main idea is the following encoding method of database suitable for the efficient computation of equality test and extraction: The user encrypts this polynomial with the RLWE public-key encryption scheme and stores the ciphertext ct DB in the server. At the query phase, given a query tag d, the user encrypts the monomial X −d with the RGSW symmetric encryption scheme and sends the ciphertext CT Q to the server. We assume that the RGSW encryption scheme has the same secret key sk as the one of RLWE encryption scheme.
Given two ciphertexts CT Q ← RGSW.Enc X −d and ct DB ← RLWE.Enc(DB(X)), the server first performs their multiplication to obtain an ciphertext, denoted by ct mult = CT Q ct DB . It follows from the previous section that ct mult is a valid RLWE encryption of the polynomial Since we use the cyclotomic polynomial φ M (X) = X N + 1 of power-of-two degree, the polynomial ring R has the property X N = −1. Thus, for any tag d, the constant term of the polynomial DB(X) · X −d is α i if there is some index i satisfying d = d i , otherwise zero. Now the server applies the RLWE.Conv algorithm on ct mult to compute an LWE encryption ct conv of this constant term. This conversion procedure not only prevents the leakage of information that has not been queried but also reduces the size of output ciphertext by half. In addition, the (optional) modulus-switching procedure can be considered to get a ciphertext ct res with a smaller modulus size and reduce the communication cost. Finally the user decrypts this LWE ciphertext and gets the desired value α i or zero (⊥). Algorithm 1 summarizes the procedure of secure search-and-extraction.
Our method can be modified to support a secure comparison of data values using a hash (one-way) function. If hashed values of α i are used as polynomial coefficients, our method will return a hashed value of α i to the user instead of α i . The user may check whether the resulting value and the hashed query value are the same or not without knowing information about database.

Comparison with related work
Equality test has been traditionally considered difficult to perform on homomorphic encryption, because of its large circuit depth [7,15,16]. They evaluate the equality test on each encrypted tuple of database, so at least (n) homomorphic operations are required for searching on database of size n. In addition, Boneh et al. [17] does not protect the database information to the users, that Algorithm 1 Procedure of secure search-and-extraction 1: Database encryption: The data owner encodes the genomic information as DB(X) and submits its encryption to the server: ct DB ← RLWE.Enc(DB(X)).
2: Query encryption: The user encodes the query tag d and sends its encryption to the server: is, the whole database can be recovered by the resulting ciphertext of a query. However, our method is very efficient in parameter size and complexity since it requires only a single hybrid multiplication. One limitation of this method is that the tags d i should be bounded by ciphertext dimension N to construct the encoding polynomial DB(X). Since the dimension N has a significant influence on the performance of HE scheme, too large value of N has an impractical impact on the performance. In the next section, we will describe how to overcome this problem in terms of the application to genomic data.

Secure searching of biomarkers
We return to our main goal of task3: secure outsourcing matching of a set of biomakers to encrypted genomes. We describe how to encode and encrypt the genotype information of VCF file in order to apply the privacypreserving database searching and extraction.
VCF file contains multiple genotype information lines, where each of them consists of a triple (ch i , pos i , SNPs i ) of chromosome number, position, and a sequence of SNP alleles. A chromosome identifier ch ranges from 1 to 22, X, and Y. A non-negative integer pos represents the reference position with the first base having position 1, and SNPs is a r eference or alternate sequence in {A, T, G, C} * . A query from user is also a triple of the same form and we aim to decide absence/presence of this biomarker in the database file.
We represent the sex chromosomes X and Y as 0 and 23, respectively. Then we define an encoding function E : Z × Z → Z by (ch, pos) → d = ch + 24 · pos.
In the following, we describe how to encode the SNPs. For convenience we set the upper bound for the length of SNPs, so let n SNP be the maximal number of reference (or alternate) alleles to be compared between the query genome and user genome in the target database. Each of SNP is represented by two bits as A → 00, T → 01, G → 10, C → 11, and then concatenated with each other. Next we pad with 1 to the left of the bit string in order to express the staring position of SNPs. Finally it is zero-padded into a binary string of length SNP = 2 · n SNP + 1, and we convert it into an integer value, denoted by α i . If a single nucleotide variant at the given locus is not known, then it is encoded as 0-string. For example, 'GC' is encoded as a bit string 1|10|11, which will be represented as an integer 1|10|11 (2) = 27. Now consider the case that we wish to encode the reference and alternate alleles together. Let α ref i and α alt i denote the integer encodings of n SNP reference alleles and n SNP alternate alleles, respectively. Then we define an encoding α i by the concatenation of two encodings, i.e., α i = 2 SNP · α ref i + α alt i as an integer. Table 1 shows the format of database file and illustrates some examples of encoded genomic data.
A database file is encoded as a set of pair (d i , α i ) for i = 1, . . . , n such that d i = E(ch i , pos i ) and α i is the encoded integer of the i-th SNP allele string. Then the encodings d i and α i are regarded as data-tag and value attribute, respectively. The data user constructs a polynomial DB(X) = k c k X k such that The user encrypts the polynomial with the RLWE publickey encryption scheme as described above.
The query genes are also encoded as a pair of integers (d, α), however, we consider only the information of d is encrypted using the RGSW symmetric encryption scheme, that is, the user encrypts the monomial X −d .

Results and discussion
In this section, we explain how to set the parameters and describe our optimization techniques for the implementation. We also present our results using the techniques. The dataset was randomly selected from Personal Genome Project. Our implementation is publicly available on github [18].

How to set parameters
Since all the matching computation is performed on encrypted data in the cloud, the security against a semihonest adversary follows from the semantic security of the underlying HE scheme. The security of the homomorphic encryption scheme relies on the hardness of the RLWE assumption. We derive a lower-bound on the ring dimension as N ≥ λ+110 7.2 · log 2 Q to get λ-bit security level from the security analysis of [11]. Given the ciphertext modulus Q, it follows from the estimation of noise growth during evaluations [12] and decryption condition that we get the upper bound on the plaintext modulus t to ensure the correctness of decryption after computation. So we set t as the largest power-oftwo integer less than the upper bound. If the encodings of the allele strings are too large, we divide them into smaller integers so that each of them is smaller than t. Then we repeat the algorithm to construct the corresponding polynomials of each integer.

Optimization techniques
As we mentioned before, the ring dimension N needs to be larger than the encoded integers d i 's. However, the encoded integers d i from VCF files have bits size about 32, while a dimension N with about 11 ≤ log 2 N ≤ 16 is considered appropriate for implementation of HE schemes to achieve both security and efficiency. Hence direct application of our method to the VCF file would yield an impractical result.
For compression of tag data and its re-randomization, we make the use of a pseudo random number generator H(·) which transforms a tag d i into a pair of two non-negative integers d * i and d † i less than N. Our implementation adopts SHA-3 and extracts log 2 N = 11 bits of the hashed value for each of d * i and d † i . We construct two polynomials by the Algorithm 2. Note that for any 1 ≤ i ≤ n and , . . . , N −1} 2 , the pair of constructed polynomials DB * and DB † satisfy The procedure of database encoding for secure search of biomarkers is described in Algorithm 2. the correctness for the output ciphertext. We take the following parameters for Gadget matrix G: B g = 128 and d g = 5, so that they satisfy the condition B d g g ≥ Q. Each coefficient of the secret key sk is chosen at random from {0, ±1} and we set 64 as the number of nonzero coefficients in the secret key. As in the work of [12], we considered the Gaussian distribution of standard deviation σ = 1.4 to sample random error polynomials.
For the efficiency of homomorphic multiplication, we also used the optimized library for complex FFT, i.e., the Fast Fourier Transform in the West [19]. That is, we use the complex primitive 2N-th root of unity rather than a primitive root in a prime field of order Q. We measure a running time of 0.804 s to set up the FFT environment at dimension 2N = 2 12 . The key generation of two schemes takes about 0.247 ms in total. Table 2 presents the time complexity and storage for the evaluation of secure searching of biomarkers. All the experiments were performed on a single Intel Core i5 running at 2.9 GHz processor. The chosen parameters provide λ = 128 bits of security level.

Conclusions
In this work, we suggested an efficient method to securely search the query tag and extract the corresponding value from a database over hybrid GSW homomorphic encryption scheme. We came up with a solution to the secure outsourcing matching problem by using polynomial encoding and extraction of desired value based on the multiplication of an RGSW ciphertext and an ordinary RLWE ciphertext. And then we applied this method to find a set of biomarkers in DNA sequences.
Our solution shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment. We list a few fascinating open problems to remain. First, we only considered the semi-honest adversary model in this work. Other tools such as homomorphic authenticated scheme may lead to more efficient protocols in the malicious settings. Another issue is to support k multiple queries while maintaining the performance and communication cost less than k times of a single query case. We expect to have much faster performance by enabling a batching method.