 Research
 Open Access
 Published:
Optimized homomorphic encryption solution for secure genomewide association studies
BMC Medical Genomics volume 13, Article number: 83 (2020)
Abstract
Background
GenomeWide Association Studies (GWAS) refer to observational studies of a genomewide set of genetic variants across many individuals to see if any genetic variants are associated with a certain trait. A typical GWAS analysis of a disease phenotype involves iterative logistic regression of a case/control phenotype on a singleneuclotide polymorphism (SNP) with quantitative covariates. GWAS have been a highly successful approach for identifying geneticvariant associations with many poorlyunderstood diseases. However, a major limitation of GWAS is the dependence on individuallevel genotype/phenotype data and the corresponding privacy concerns.
Methods
We present a solution for secure GWAS using homomorphic encryption (HE) that keeps all individual data encrypted throughout the association study. Our solution is based on an optimized semiparallel GWAS compute model, a new ResidueNumberSystem (RNS) variant of the CheonKimKimSong (CKKS) HE scheme, novel techniques to switch between data encodings, and more than a dozen cryptoengineering optimizations.
Results
Our prototype can perform the full GWAS computation for 1,000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node (with 28 cores). Our solution for a smaller dataset was awarded cofirst place in iDASH’18 Track 2: “Secure Parallel Genome Wide Association Studies using HE”.
Conclusions
Many of the HE optimizations presented in our paper are generalpurpose, and can be used in solving challenging problems with large datasets in other application domains.
Background
GenomeWide Association Studies (GWAS) refer to observational studies of a genomewide set of genetic variants across many individuals to see if any genetic variants are associated with a certain trait. When applied to human data, GWAS typically focus on associations between singlenucleotide polymorhisms (SNPs) and a quantitative or dichotomous disease outcome, as well as a number of quantitative covariates. However, the reliance on full genotype and phenotype data across thousands of samples raises major privacy concerns for GWAS, and has limited their applicability.
Recent work has focused on secure multiparty computation algorithms to facilitate privacypreserving GWAS, but this approach requires resourceheavy, continuous interactions between users which is impractical for GWAS studies that are aggregated over months or years. To motivate the cryptographic community, the iDASH’18 Organizing Committee ran a special competition track “Secure Parallel Genome Wide Association Studies using Homomorphic Encryption (HE)” to advance the state of the art in GWAS using HE, which is a noninteractive approach to secure computing.
This paper presents our HEbased solution to GWAS. Our solution is based on an optimized GWAS compute model, a new ResidueNumberSystem (RNS) variant of the CheonKimKimSong (CKKS) HE scheme, novel techniques to switch between data encodings, and more than a dozen cryptoengineering optimizations. The solution can perform the full GWAS computation for 1000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node (with 28 cores).
Related work
Several other RNS variants of the CKKS HE scheme were independently proposed in 2018. These include the work by Cheon et al. [1], the implementation in Microsoft SEAL 3.0 (released in October 2018), and the variants developed by other teams who submitted their GWAS solutions to the iDASH’18 competition, including UCSD [2] and IBM Research.
Methods
Semiparallel approach of Sikorska et al. [3]
Logistic regression is widely used to model binary response data in GWAS. For instance, it can be used to examine the relationship between disease status (control versus real cases) with respect to phenotypes (age, weight, height, etc.) and genotypes (such as SNP variations). Let y_{i} denote the disease status for the i^{th} individual in a sample of size N (y_{i}=1 if the individual is a disease case, and y_{i}=0 otherwise), and \(\left (\vec x_{i}^{\prime }, \vec s_{i}\right)\) be the corresponding predictor, where \(\vec x_{i}^{\prime } \in \mathbb {R}^{K}\) corresponds to the phenotypes and \(\vec s_{i} \in \{0,1,2\}^{M}\) to the genotypes of individual i for a set of K phenotypes and M SNPs. The logistic regression model expresses the relationship between y_{i} and the predictor set \(\left (\vec x_{i}^{\prime }, \vec s_{i}\right)\) in terms of the conditional probability \(Pr\left (Y = y_{i}\vec x_{i}^{\prime }, \vec s_{i}\right)\) of disease, as:
where σ is the logistic function, \(\sigma (x) = \frac {1}{1+\exp {(x)}}\); \(\theta _{0}^{\prime } \in \mathbb {R}, \boldsymbol {\theta }^{\prime } \in \mathbb {R}^{K}\) and \(\boldsymbol {\beta } \in \mathbb {R}^{M}\) are the K+M+1 parameters to be determined. For the sake of simplicity, we adopt the canonical notation, that is, \(\boldsymbol {\theta } \equiv \left (\theta _{0}^{\prime }, \boldsymbol {\theta }^{\prime } \right) \in \mathbb {R}^{K+1}\) and \(\vec x_{i} \equiv \left (1, \vec x_{i}^{\prime }\right) \in \mathbb {R}^{K+1}\) for i=1,…,N.
Assuming that the effect of each SNP is independent of each other, it is possible to formulate it as a set of M independent equations, i.e., decompose the computation into M independent logistic regression cases for K+1 parameters. Sikorska et al. [3] proposed a “semiparallel” approach to speed up the logistic regression in the above scenario. The goal is to avoid looping over each SNP by using a vectorized formulation, which includes optimized vector and matrix operations, that allows performing multiple identical actions over different data in a single operation.
The method relies on the assumption that the covariant parameters θ are nearly the same for all SNPs. This assumption allows the reformulation of fitting N vectors in \(\mathbb {R}^{K+1}\), followed by a onestep calculation for M SNPs at once. Therefore Sikorska’s semiparallel logistic regression consists of 2 stages:

1
Estimate the coefficients of the clinical covariates, \(\boldsymbol {\theta } \in \mathbb {R}^{K+1}\);

2
For each of the M SNPs, estimate the corresponding coefficients \(\hat {\boldsymbol {\beta }}\) and pvalue \(\vec p \in \mathbb {R}^{M}\).
The first stage, the estimation of \(\boldsymbol {\theta }, \hat {\boldsymbol {\theta }}\), was widely addressed in the literature, in particular in the iDASH’17 secure genome analysis competition [4–8].
The second stage, the estimation of the SNPcoefficients \(\hat {\boldsymbol {\beta }}\), approximates the optimization problem by a single NewtonRaphson iteration, leading to
where X is a matrix in \(\mathbb {R}^{N \times (K+1)}\) whose rows are the vectors \(\vec x_{i}, i = 1, \ldots, N\); \(\mathbf {W} \in \mathbb {R}^{N \times N}\) is a diagonal matrix with ω_{ii}=ρ_{i}(1−ρ_{i}) and \(\rho _{i} = \sigma \left (\vec x_{i} \cdot \hat \theta ^{(t)}\right)\) for i=1,…,N; H=X^{⊤}WX in \(\mathbb {R}^{(K+1) \times (K+1)}\); \(\zeta _{i} = \log \left (\frac {\rho _{i}}{1\rho _{i}}\right) + \frac {y_{i}\rho _{i}}{\omega _{ii}}, i = 1, \ldots, N\).
Finally, the zvalue for each parameter β_{j}, for j=1,…,M, is given by \(z_{j} = \frac {\hat {\boldsymbol {\beta }}_{j}}{\epsilon _{j}}\), where \(\epsilon _{j} = \sqrt {\left (\mathbf {C}^{1} \right)_{jj}}\) is the error associated to \(\hat {\boldsymbol {\beta }}_{j}\) and C=S^{⊤}W(S−XH^{−1}(X^{⊤}WS)). A more compact expression of it is
with
where H^{†} denotes the adjoint of H.
Our approximations
To optimize the efficiency of our HE solution, we introduced several approximations to the semiparallel method of Sikorska et al. [3].
Logistic regression
We found that the gradient descent method is adequate for estimating θ. Starting from an initial θ^{(0)}, the gradient descent method at each iteration t updates the estimation of the regression parameters
where α_{t} is the learning rate at the tth iteration. Our numerical experiments suggest that a single iteration of the gradient descent procedure with α_{0}=0.015 and θ^{(0)}=0 provides adequate accuracy. For simplicity, we denote α_{0} as α in the rest of the paper.
Logistic function approximation
We used Chebyshev polynomials to approximate the logistic function [9]. From the analysis we performed, we found that a degree1 approximation σ(x)=0.5+0.15625x provides results with sufficient accuracy. Please refer to “Analysis of our approximations” section for further details.
Approximation of ζ
In order to approximate ζ, we considered a Talyor series expansion around \(p = \frac {1}{2}\):
Matrix inversion and division
Instead of calculating the inverse of the matrix H, Cramer’s rule was used: \(\mathbf {H}^{1} = \frac {\text {adj}(\mathbf {H})}{\det (\mathbf {H})}\), where adj(H) is the adjoint of matrix H and det(H) is its determinant. As the division is an expensive operation, it was deferred to a later stage (after decryption).
pvalue calculation
After computing the zvalues on the server, the pvalue computation is performed on the client as depicted in Algorithm 2.
Full procedure
The approximations described above were used to create an optimized procedure for the server computation (Algorithm 1). Note that line ?? of Algorithm 1 is the closed form for ρ that incorporates the parameter estimation of the logistic regression. Therefore \(\hat {\boldsymbol {\theta }}\) does not appear explicitly in Algorithm 1.
The annotated encrypted procedure is presented in Algorithm 3. It will be referenced throughout the rest of this section.
CKKS scheme
Our solution is based an optimized variant of the CheonKimKimSong scheme [10]. We have developed a DoubleChinese Remainder Theorem (CRT), a.k.a, Residue Number System (RNS), variant of the original scheme. Our variant is based on the same security assumptions as the original scheme, but relies on native 64bit integer arithmetic instead of multiprecision integer arithmetic for better performance and parallelization.
The original CKKS scheme is formulated for cyclotomic polynomial rings \({\mathcal R} = {\mathbb Z}[x]/\left < x^{n} +1 \right >\), where n is a ring dimension that is a power of two (CKKS also supports general cyclotomic rings but they are typically less efficient). The current ciphertext modulus is typically defined as Q_{ℓ}=2^{ℓ}, i.e., the scheme works with residue rings \(\mathcal R_{\ell } = \mathcal R/Q_{\ell } \mathcal R = {\mathbb Z_{2^{\ell }}}[x]/\left < x^{n} +1 \right >\). The algorithms are [10]:

Setup(1^{λ}). For an integer L that coresponds to the largest ciphertext modulus level, given the security parameter λ, output the ring dimension n. Set the small distributions χ_{key},χ_{err}, and χ_{enc} over \(\mathcal R\) for secret, error, and encryption, respectively.

KeyGen. Sample a secret s←χ_{key}, a random a→R_{L}, and error e←χ_{err}. Set the secret key sk←(1,s) and public key \(\textbf {pk} \leftarrow (b,a) \in \mathcal R^{2}_{L}\), where b←−as+e(mod Q_{L}).

KSGen_{sk}(s^{′}). For \(s' \in \mathcal {R}\), sample a random \(a' \leftarrow \mathcal {R}_{2 \cdot L}\) and error e^{′}←χ_{err}. Output the switching key as \(\textbf {swk} \leftarrow \left (b', a'\right) \in \mathcal {R}^{2}_{2L}\), where b^{′}←−a^{′}s^{′}+e^{′}+Q_{L}s^{′}(mod Q_{2L}). Set evk←KSGen_{sk}(s^{2}). Set rk^{(κ)}←KSGen_{sk}(s^{(κ)}).

Enc_{pk}(m). For \(m \in \mathcal {R}\), sample v←χ_{enc} and e_{0},e_{1}←χ_{err}. Output ct←v·pk+(m+e_{0},e_{1})(mod Q_{L}).

Dec_{sk}(ct). For \(\textbf {ct} = (c_{0},c_{1}) \in \mathcal {R}_{\ell }^{2}\), output \(\tilde {m} = c_{0} + c_{1} \cdot s \left (\bmod Q_{\ell } \right)\).

CAdd(ct,c). For \(\textbf {ct} = (b, a) \in \mathcal {R}_{\ell }^{2}\) and \(c \in \mathcal {R}\), output ct_{cadd}←(b+c,a)(mod Q_{ℓ}).

Add(ct_{1},ct_{2}). For \(\textbf {ct}_{1},\textbf {ct}_{2} \in \mathcal {R}_{\ell }^{2}\), output ct_{add}←ct_{1}+ct_{2}(mod Q_{ℓ}).

CMult(ct,c). For \(\textbf {ct} \in \mathcal {R}_{\ell }^{2}\) and \(c \in \mathcal {R}\), output ct_{cmult}←c·ct(mod Q_{ℓ}).

Mult_{evk}(ct_{1},ct_{2}). For \(\textbf {ct}_{i} = (b_{i},a_{i}) \in \mathcal {R}_{\ell }^{2}\), let (d_{0},d_{1},d_{2})=(b_{1}b_{2},a_{1}b_{2}+a_{2}b_{1},a_{1}a_{2})(mod Q_{ℓ}). Output \(\textbf {ct}_{\text {mult}} \leftarrow (d_{0}, d_{1}) + \lfloor Q_{L}^{1} \cdot d_{2} \cdot \textbf {evk} \rceil \left (\bmod Q_{\ell } \right)\).

\(\textsc {Rotate}_{\textbf {rk}^{(\kappa)}}(\textbf {ct},\kappa)\). For \(\textbf {ct} = (b,a) \in \mathcal {R}_{\ell }^{2}\) and rotation index κ, output \(\textbf {ct}_{\text {rotate}} \leftarrow \left (b^{(\kappa)}, 0\right) + \lfloor Q_{L}^{1} \cdot a^{(\kappa)} \cdot \textbf {rk}^{(\kappa)} \rceil \left (\bmod Q_{\ell } \right)\).

ReScale (ct, p). For a ciphertext \(\textbf {ct} \in \mathcal {R}_{\ell }^{2}\) and an integer p, output ct^{′}←⌊2^{−p}·ct⌉(mod (Q_{ℓ}/2^{p})).
The CKKS scheme supports an efficient packing of r (up to n/2) real numbers into a single ciphertext. The encoding and decoding operations are defined as follows:

Encode (w, p). For \(w \in \mathbb {R}^{r}\), output the polynomial \(m \leftarrow \lfloor \phi (2^{p} \cdot \mathbf {w}) \rceil \in \mathcal {R}\).

Decode (m, p). For a plaintext \(m \in \mathcal {R}\), output the polynomial \(\mathbf {w} \leftarrow \phi ^{1}\left (m/2^{p}\right) \in \mathbb {R}^{r}\).
Here, ϕ(x) is a certain complex canonical embedding map, which is similar conceptually to inverse Fourier transform.
Our RNS variant of the CKKS scheme
Our CKKS variant performs all operations in RNS. In other words, the poweroftwo modulus Q_{ℓ}=2^{ℓ} is replaced with \(\prod _{i=1}^{\ell } q_{i}\), where q_{i} are samesize prime moduli satisfying q_{i}≡1 mod 2n (for efficient number theoretic transforms (NTT) that convert nativeinteger polynomials w.r.t. each CRT modulus from coefficient representation to the evaluation one, and vice versa). The primes are chosen to be as close to 2^{p} as possible to minimize the error introduced by rescaling.
The two major changes in our variant compared to the original CKKS scheme deal with rescaling and key switching. We also made two other minor changes. First, we use the ternary random discrete distribution for χ_{key} and χ_{enc} instead of the sparse distributions as the lattice attacks for this case are better studied, and the ternary distribution is included in the HE standard [11]. Second, we do additional scaling of plaintexts and ciphertexts to support the use of RNS (only native integer arithmetic) during encoding/decoding.
Rescaling in RNS
To efficiently perform rescaling in RNS from Q_{ℓ} to Q_{ℓ−1}, we replace the scaling down by 2^{p} with scaling down by q_{ℓ}. We choose all q_{i}, where i∈[L], such that 2^{p}/q_{i} is in the range (1−2^{−ε},1+2^{−ε}), where ε is kept as small as possible. To minimize the cumulative approximation error growth in deeper computations, we also alternate q_{i} w.r.t. 2^{p}. For instance, if q_{1}<2^{p}, then q_{2}>2^{p} and q_{3}<2^{p}, etc.
The new rescaling operation to scale down by one level is defined as

ReScaleRNS (ct). For a ciphertext \(\textbf {ct} \in \mathcal {R}_{\ell }^{2}\), output \(\textbf {ct}' \leftarrow \lfloor q_{\ell }^{1} \cdot \textbf {ct} \rceil \left (\bmod Q_{\ell 1} \right)\).
We derive the procedure for computing \( \lfloor q_{\ell }^{1} \cdot \textbf {ct} \rceil \left (\bmod Q_{\ell 1} \right)\) using the CRT scaling technique proposed in [12]. Consider the following CRT representation of a multiprecision integer \(x \in \mathbb {Z}_{Q_{\ell }}\):
where
Then we can write
After rounding and applying the modulo reduction, the last term is removed yielding
The first term can be directly computed in RNS by summing up the products of x_{i} and \(q_{\ell }^{1} \left (\bmod q_{i}\right)\). For the second term, we precompute the residues of \(\left \lfloor \frac {\tilde {q}_{\ell } q^{*}_{\ell }}{q_{\ell }} \right \rfloor \) and multiply them by the corresponding residues of x_{ℓ} during rescaling. Then we add the fractional part, which has the residue of ⌊x_{ℓ}/q_{ℓ}⌉, i.e., 0 or 1, for each CRT modulus q_{i}. Note that the fractional part is negligibly small and hence can be excluded from the implementation.
The computational complexity of rescaling is determined by the computation in the second term of (2). We first need to run one native inverse NTT for residues w.r.t. q_{ℓ} and then ℓ−1 native NTTs to go back to the evaluation representation. All the computations in the first term of (2) are done directly in evaluation representation. Therefore, each rescaling operation requires ℓ nativeinteger NTTs.
The maximum approximation error introduced by rescaling from ℓ to ℓ−1 is \(\left \vert q_{\ell }^{1} \cdot m  2^{p} \cdot m \right \vert \le 2^{\epsilon } \cdot \left \vert 2^{p} \cdot m \right \vert \).
This procedure can be easily generalized to support scaling down by multiple CRT moduli. This case is similar to the first stage of complex scaling in CRT representation described in Section 2.4 of [12].
Key switching
For key switching, we use the CRT decomposition key switching algorithm that was originally proposed in [13] and improved in [12] for the Brakerski/FanVercauteren (BFV) scheme. The advantages of this technique vs. the one used in the original CKKS scheme (initially proposed for the BrakerskiGentryVaikuntanathan scheme in [14]) are that this technique has lower computational complexity for relatively small numbers of levels (up to 8 or so), and does not require an approximately twofold increase in the ring dimension to support the appropriate lattice security level. Both of these benefits were important for our solution.
The operations of the CKKS scheme that are modified by the key switching procedure are rewritten as:

KSGenRNS_{sk}(s^{′}). For \(s' \in \mathcal {R}\), sample a random \(a_{i}' \leftarrow \mathcal {R}_{L}\) and error ei′←χ_{err}. Output the switching key as \(\textbf {swk} \leftarrow \left \{ \left (b'_{i}, a'_{i}\right)\right \}_{i \in [L]} \in \mathcal {R}^{2 \times L}_{L}\), where \(b^{\prime }_{i} \leftarrow a'_{i} s' + e'_{i} + {\tilde {q}_{i}} \cdot {q_{i}^{*}} \cdot s' \left (\bmod Q_{L} \right)\). Set evk←KSGenRNS_{sk}(s^{2}). Set rk^{(κ)}←KSGenRNS_{sk}(s^{(κ)}).

MultRNS_{evk}(ct_{1},ct_{2}). For \(\textbf {ct}_{i} = (b_{i},a_{i}) \in \mathcal {R}_{\ell }^{2}\), let (d_{0},d_{1},d_{2})=(b_{1}b_{2},a_{1}b_{2}+a_{2}b_{1},a_{1}a_{2})(mod Q_{ℓ}). Decompose d_{2} into its CRT components \(\phantom {\dot {i}\!}[d_{2}]_{q_{i}}\) and output
$$\textbf{ct}_{\text{mult}} \leftarrow (d_{0}, d_{1}) + \sum\limits_{i=1}^{\ell} [d_{2}]_{q_{i}} \cdot \textbf{evk}_{i} \left(\bmod Q_{\ell} \right). $$ 
\(\textsc {RotateRNS}_{\textbf {rk}^{(\kappa)}}(\textbf {ct},\kappa)\). For \(\textbf {ct} = (b,a) \in \mathcal {R}_{\ell }^{2}\), output
$$\textbf{ct}_{\text{rotate}} \leftarrow \left(b^{(\kappa)}, 0\right) + \sum\limits_{i=1}^{\ell} \left[a^{(\kappa)}\right]_{q_{i}} \cdot \textbf{rk}^{(\kappa)}_{i} \left(\bmod Q_{\ell} \right), $$where \(\left [a^{(\kappa)}\right ]_{q_{i}}\) are CRT components of a^{(κ)}.
Each keyswitching operation requires one inverse NTT (ℓ nativeinteger NTTs) to switch d_{2} (or a^{(κ)} for rotation) from evaluation to coefficient representation and then ℓ NTTs (ℓ^{2}−ℓ nativeinteger NTTs) to go back to evaluation representation for each CRT component. Hence, the total complexity in terms of nativeinteger NTTs is ℓ^{2}.
This key switching procedure also supports a second level of decomposition by extracting basew digits in each residue using the procedure described in Appendix B.1 of [13].
Noise estimates
We present here heuristic noise estimates for the RNS variant of CKKS using the canonical embedding norm, which corresponds to the infinity norm for the evaluation of a polynomial \(\mathcal {R}\) at 2n complex roots of unity. For more details on the canonical embedding mapping and norm, the reader is referred to [10]. The main differences between our expressions and those in [10] are due to the use of ternary uniform distribution and a different key switching technique.

Encoding and Encryption. The bound for fresh encryption \(B_{\text {clean}}=6 \sigma \left (4 \sqrt {3} n + \sqrt {n} \right)\), where σ is the standard deviation for error distribution. The decoding is correct as long as 2^{p}>n+2B_{clean}.

Addition. The bound for homomorphic addition B_{add}=B_{1}+B_{2}, where B_{i} is the noise bound for ith ciphertext.

Rescaling. The noise bound for rescaling is \(B_{\text {rescale}} = q_{\ell }^{1} \cdot B + B_{\text {scale}}\), where B is the input noise and \(B_{\text {scale}}=\sqrt {3} \left (12 n + \sqrt {n} \right)\).

Rotation. The noise bound for rotation (key switching) is \(B_{\text {ksw}}=\frac {8}{\sqrt {3}} \cdot n \sigma w \left \lceil \log _{w} {q_{\ell }} \right \rceil \).

Multiplication. If we have two ciphertexts ct_{1} and ct_{2} with \(\left \lVert m_{1} \right \rVert ^{\text {can}}_{\infty } < \nu _{1}\), noise bound B_{1} and \(\left \lVert m_{2} \right \rVert ^{\text {can}}_{\infty } < \nu _{2}\), noise bound B_{2}, respectively, the noise bound B_{mult}=ν_{1}B_{2}+ν_{2}B_{1}+B_{1}B_{2}+B_{ksw}.
In most cases, the parameter selection is determined by the multiplicative depth and the approximation error in rescaling. The approximation error (with about ε bits being “erased” by rescaling) dominates the noise growth of other operations and should be done last (after a multiplication). The only practical exception is when rotations are performed before any multiplications. In this case, the key switching noise may be high if the wbase is large, e.g., comparable to 2^{p} as in the case of CRT decomposition without further digit decomposition of each residue.
Comparison to the RNS variant by Cheon et al. [1]
Both our RNS variant of CKKS and the variant proposed by Cheon et al. [1] work with an RNS basis consisting of nativeinteger primes q_{i} that are close to 2^{p} (with ε bits of precision). In other words, scaling down by 2^{p} is replaced with approximate scaling down by q_{ℓ}. Hence the rescaling approach in both variants is similar. The techniques for the scaling operation itself are different, but the computational complexity of both scaling techniques appears to be the same (requiring ℓ nativeinteger NTTs).
The key switchingprocedure developed in [1] is based on the approach originally proposed for the BrakerskiGentryVaikuntanathan scheme [14], which requires doubling the ciphertext modulus (and roughly doubling the ring dimension). We use the residue/digit decomposition approach originally proposed in [13] and improved in [12]. Our keyswitching technique requires more NTTs but provides better overall performance for relatively “shallow” circuits (our estimates suggest this approach should be faster up to 8 levels or so).
Plaintext encoding
Our solution uses two kinds of plaintext encoding. Initially, X and y are packed in single ciphertexts similar to how it was done in [8]. We denote this as packedmatrix encoding. All matrix products in steps 2 through 8 of Algorithm 3 use the rotationbased SUMROWVEC and SUMCOLVEC procedures from [5]. Later in the algorithm (starting from step 9), the solution switches to singleinteger ciphertexts for X and the vectors and matrices derived from X and y. We call the latter encoding as packedinteger encoding. As a result of this, our matrix operations with the SNPs data (first appearing in step 9) involve only cheap SIMD multiplications and additions of packedinteger and packedrowvector ciphertexts, and do not involve any expensive rotations. All operations before computing on the SNPs data are performed using packedmatrix (single) ciphertexts.
Packedmatrix encoding
The packedmatrix encoding packs a full matrix or vector into a single ciphertext, cloning as many entries as needed to support matrixmatrix and matrixvector products. The cloning makes it possible to minimize the number of computationally expensive rotations in matrixmatrix (vector) products.
We encode/encrypt both X and X^{⊤} to avoid calling transposition in the encrypted domain. We pack \(\mathbf {X} \in \mathbb {R}^{N \times k}\) in a rowwise order, cloning each row k−1 times before going to the next row. Here, we introduce k=K+1 for brevity.
We pack \(\mathbf {X}^{\top } \in \mathbb {R}^{k \times N}\) by taking each element of matrix X (marshalling it in the rowwise order) and cloning it to form a complete row.
Both matrices require N·k^{2} slots.
We pack \(\mathbf {y} \in \mathbb {R}^{N}\) columnwise by cloning yk^{2}−1 times to the right. That is, we have
The resulting vector ρ is represented the same way as y. Both use N·k^{2} slots.
The diagonal matrix W is represented as a vector by extracting the diagonal, and the resulting vector is packed in the same format as ρ.
The SNPs matrix S is encoded either as an array of ciphertexts (when M>n/2) or a single ciphertext (when M≤n/2) without any cloning, i.e., the classical SIMD packing of vectors is used.
Matrices and vectors, such as X and y, can be encoded in a single ciphertext as long as N·k^{2}≤n/2. If this condition does not hold, the packing can be trivially extended to multiple ciphertexts per matrix/vector.
Packedinteger encoding
To support efficient matrix multiplication without rotations, we also encode X as N·k singleinteger ciphertexts. In this case, each entry of X is cloned to all slots of a single ciphertext. We denote such packing of X as X_{1}.
Conversion from packedmatrix to packedinteger encoding
The main bottleneck of our solution is the conversion of vectors from a packedmatrix ciphertext to multiple packedinteger ciphertexts. We have developed and implemented three different methods for performing this conversion. Based on the requirements for performance and scalability, we chose one of these methods for our prototype.
To illustrate the problem and its solutions, we consider the task of converting the packedmatrix singleciphertext encryption of y into N packedinteger ciphertexts. A similar task has to be executed twice in our algorithm for secure GWAS.
Method 1: N⌈logn⌉ rotations
Our first solution can be summarized as follows:

1
Fill all n/2 slots of y by cloning existing N·k^{2} slots. This requires \(\log \left (n/\left (2 \bar {N} \cdot k^{2} \right) \right)\) rotations and additions. The cloning procedure is described in [8]. Here, \(\bar {N} = 2^{\lceil \log N\rceil }\).

2
Run N bit mask multiplications to form N ciphertexts each containing \(n/\left (2 \bar {N}\right)\) cloned values for each component of y. All other slots are zeroed out.

3
Clone existing \(n/\left (2 \bar {N}\right)\) nonzero values to all slots in each of the N ciphertexts. This operation requires N⌈logN⌉ rotations and additions, and is the main bottleneck of the computation.
Method 2: \(\bar {N}\) rotations and ⌈logN⌉ depth increase
The idea of our second solution is to represent the conversion as a binary tree. At each level i of the tree we perform i rotations, 4·i bit mask multiplications, and 2·i additions, getting two output ciphertexts from each input ciphertext. Although this recursive method requires only \(\bar {N}\) rotations, 4\(\bar {N}\) bit mask multiplications, and 2\(\bar {N}\) additions, there is a ⌈logN⌉ depth increase due to bit mask multiplications at each level of the binary tree.
To illustrate this approach, consider a simpler case (the logic would stay the same when we clone y_{i} any number of times):
First rotate by 1 and get
Then multiply both y and Rot_{1}(y) by M_{1}=[101010⋯10] and M_{2}=[010101⋯01], and sum up two possible combinations, yielding
Next compute Rot_{2}(y_{1,1}) and Rot_{2}(y_{1,2}), multiply y_{1,1} and y_{1,2} and their rotations by [110011⋯1100] and [001100⋯0011] for each pair, and sum up four possible combinations. Now there are 4 y_{2,i} items.
We recursively execute this procedure until the end.
Method 3: \(\bar {N}^{2}\) bit mask multiplications and \(\bar {N}\) rotations
Another approach achieving N rotations can be summarized as follows:

1
Fill all n/2 slots of y by cloning existing N·k^{2} slots.

2
Compute \(\bar {N}1\) cheap rotations of the original ciphertext using the hoisting procedure from [15].

3
For each component of y, do \(\bar {N}\) bit mask multiplications (one per rotation) that would extract the component and zero out all other slots.

4
For each component of y, do \(\bar {N}1\) additions of masked ciphertexts.
Although this procedure requires only roughly \(\bar {N}\) cheap rotations, it involves \(\bar {N}^{2}\) bit mask multiplications and additions, which now become the main bottleneck for relatively large values of N.
Comparison of the methods
We implemented all three methods, and carried out both complexity and practical performance comparison.
As N is relatively large (at least 245), \(\bar {N}^{2}\) bit mask multiplications in Method 3 resulted in computation runtimes that are at least 2x3x larger than Method 1 with N⌈logN⌉ rotations. However, Method 3 would be faster for smaller N, e.g., less than 100.
Method 2 is a good option only when the depth increase can be incorporated in the existing circuit without increasing the overall circuit depth. But the scalability of this approach is questionable. The depth increase of ⌈logN⌉ = 8 could not be integrated in the circuit of our solution, and thus we chose Method 1 for our implementation.
Note that in our implementation the depth cost of bit mask multiplication is the same as for homomorphic multiplication, which implies there is room for improvement. Therefore, a more depthefficient bit mask multiplication procedure may result in a significantly better performance for Method 2, possibly superior to that of Method 1.
Minimizing the number of key switching operations
One of the optimization goals for our solution is to reduce the number of key switching operations, which are used both for rotation and relinearization (after homomorphic multiplication). Each such operation has a high computational complexity, i.e., requires ℓ^{2} nativeinteger NTTs. We have optimized our algorithm to minimize the number of key switching operations. For instance, all computations involving encrypted SNPs data require only 16 (k^{2}) key switching operations in total. A great majority of the computations involving encrypted SNPs data use only “cheap” SIMD multiplications and additions, and sparingly rescaling operations.
Multiplications with lazy or no relinearization
In steps 9 through 11 of Algorithm 3, our procedure calls only 16 (k^{2}) relinearizations. In other words, all largedimension SIMD products are performed without relinearization (the ciphertext size is allowed to grow). The procedure calls the relinearization procedure only when multiplying by B_{1} in step 9, which works with the smallest dimension (k) in the chained matrix product. We refer to this deferred relinearization as “lazy” relinearization. Any homomorphic multiplications after this product are performed without a single relinearization, which significantly reduces the runtime of computation.
Use of additions instead of rotations
The packedinteger encoding is introduced in steps 9 through 11 of Algorithm 3 to replace any rotationbased summations over rows/columns with SIMD homomorphic additions. The only places where the rotations are used are to homomorphically convert B,W, and (Wζ^{∗}) from packedmatrix encoding to the packedinteger one. The use of rotationbased summation in the chained product of step 9 would require a substantially larger number of rotations as compared to the conversion of two vectors of size N and one matrix of size k×k.
Minimizing the number of NTTs
Besides key switching, NTTs are used for rescaling. In some cases, expensive rotations can be replaced with hoisted automorphisms from [15], reducing the number of NTTs for multiple rotations of the same ciphertext to the NTT cost of a single rotation. Our solution minimizes the number of rescaling operations and uses hoisted automorphisms where applicable.
Use rescaling sparingly
We use the following techniques to minimize the number of rescaling operations:

When there are homomorphic multiplications followed by aggregation of ciphertexts, such as addition of multiple ciphertexts, we apply rescaling after the aggregation, i.e., we call it once rather than for every homomorphic multiplication.

If there is a benefit in lazy rescaling, e.g., when the number of ciphertexts at the following level is much smaller, we defer rescaling until later. In this case, we have to make sure the depth requirement is not increased, which is true when one of the multiplicands is scaled w.r.t. 2^{p} rather a power of it.

The rescaling operations are not called at the end of computation if skipping them does not increase the multiplicative depth of the circuit.
Hoisted automorphisms
Hoisted automorphisms are useful when multiple rotations of the same ciphertext need to be computed [15]. Our solution encounters this scenario when computing the matrix inversion of H in steps 6 and 7 of Algorithm 3, and hence the hoisted automorphisms are used there in favor of regular rotations.
Minimizing the noise growth and ciphertext modulus
We minimized the noise growth/ciphertext modulus of the computation circuit using the following techniques:

Binary tree multiplication was employed for any chained products of ciphertexts.

Closedform expressions (such as in step 2 of Algorithm 3) were derived to get the maximum benefit from binary tree multiplication.

Binary tree addition for any summation of a large number of ciphertexts was employed to achieve a O(logN) noise growth.

To guarantee that the end result of the computation requires only one nativeinteger polynomials, we multiplied both numerator and denominator by estimated scaling factors (different from 2^{p}). These factors were introduced during bit mask multiplications to avoid any extra depth increase due to this additional scaling.

The maintenance operations of HE, such as key switching and rescaling, were properly ordered to minimize the noise growth. For instance, rescaling was done after the rotations following a multiplication (not before).
Harnessing the CRT ladder
As the circuit evaluation progresses, the number of CRT limbs, i.e., native polynomials in the DoubleCRT structure, gets reduced due to rescaling. For instance, at level ℓ the number of CRT limbs is reduced by L−ℓ as compared to fresh ciphertexts. This provides a speedup in CKKS compared to scaleinvariant schemes, such as BFV. We can further take advantage of the decreasing CRT “ladder” by encrypting plaintexts at the level they are first used and by compressing evaluation keys as the computation progresses. This reduces storage requirements. We also minimize the number of CRT limbs by finding the minimum number of limbs needed for correct result (starting from the end of the computation circuit). Below we provide some examples of how these techniques are applied in our solution.
Encrypt ciphertexts at the level first used
As the SNPs matrix S is first used in step 9 of Algorithm 3 (after 10 levels of computation), we encrypt it using 7 CRT limbs rather than 17 corresponding to the initial ciphertext modulus. This reduces the storage requirements for the SNPs matrix by a factor of 2.4x.
Compress evaluation keys as needed
Same rotation keys are used multiple times throughout the computation. Whenever they are no longer required below a certain level, we compress them to the current level, thus reducing the number of CRT limbs. Note that the rotation keys consume most of the space utilized by public keys in our solution.
Use the lowest number of CRT limbs for ciphertexts
Once the lowest multiplicative depth for the circuit is determined, we choose the actual level for ciphertexts by counting from the end of the circuit (not from the beginning) up to the specific computation. This minimizes the number of CRT limbs used, thus reducing both runtime and storage requirements.
Consider the example of S. If we were to count the level from the beginning of the circuit, we would choose level 8 (to match the level of B_{1}). But we choose 10 instead because the maximum depth of computations from S in step 9 to the end of the circuit is 6. This gives more than 1.5x runtime improvement for the rotations in the conversion from W to W_{1}, which is done immediately before computing W_{1}S. The storage requirement for S is also reduced by roughly a factor of 1.3x.
Matrix inversion
As pointed out earlier, we use Cramer’s rule to compute the matrix inverse of H. The numerator is the adjoint of H while the denominator is the determinant of H. To extract specific components of H, we use cheap rotations (hoisted automorphisms) followed by bit mask multiplications to clear out the values that are not used. As both numerator and denominator contain a lot of common products of the rotations for H, we wrote both of them down in the closed form and compute common products only once. The closed form for the determinant also allows the direct application of binary tree multiplication (3inseries products require a binary depth of 2). The depth cost of these steps is 3 (2 for homomorphic multiplications and 1 for bit mask multiplication).
When computing the determinant and k^{2} components in the adjoint, all homomorphic multiplications are performed without relinearization, and the relinearization is applied at the very end (for each component) after all additions and subtractions are done. This significantly reduces the number of expensive key switching operations when computing the matrix adjoint and determinant.
The procedure for computing the adjoint and determinat also prepares the packedmatrix variant of B for computing ζ^{∗} in step 8 and the packedinteger variant B, i.e., B_{1}, for computing S^{∗} in step 9 by performing appropriate rotations and additions. The final rescaling for the components in the adjoint and determinant is done after all rotations are computed. Otherwise the noise growth in rotations would lead to incorrect results after decryption.
Order of products in matrix chain multiplication
The order of matrix products in matrix chain multiplications has a major effect on the performance of our solution. The two most complex and costly chained matrix products in Algorithm 3 are step 8 (computation of ζ^{∗}) and step 9 (computation of S^{∗}). Typically the matrix chain multiplication problem is an optimization problem that can be solved using dynamic programming. In the case of regular plaintext computations, the goal is usually to minimize the number of element multiplications. In the encrypted solution, additional constraints are introduced, and these constraints can be different depending on the plaintext encoding used, as illustrated below.
In step 8, we work with a chain of single ciphertexts (packed matrix encoding). The constraints for this case can be summarized as follows:

Make sure the outcome of each intermediate product is a single ciphertext. For instance, we cannot have a product where outer dimensions are both N.

The costs of SumRowVec and SumColVec are different. The latter requires a bit mask multiplication, and the number of rotations corresponds either to row or column size. The possible constraints are to minimize the number of rotations and/or minimize the depth of bit mask multiplications.

Minimize the depth of the overall circuit. In other words, the term at highest level should be given special attention. The binary tree multiplication technique should also be properly applied.
In step 9, we work with products of many packedinteger ciphertexts and N SIMDpacked ciphertexts (for each row of matrix S). The guidelines for optimization in this case can be summarized as follows:

Minimize the total number of SIMD multiplications.

Minimize the depth of the overall circuit. In other words, the term at highest level should be given special attention. The binary tree multiplication technique should also be properly applied.
In our solution, the decisions regarding the order of matrix chain multiplication were done by hand. But in a more general case, where the computation circuit is built automatically, one would have to include algorithms for finding the optimal order by solving the appropriate dynamic optimization problem.
Loop parallelization
To benefit from multicore CPU environments, our solution applies loop parallelization at various levels.
At the encryption stage, the parallelization is done for the loop iterating over all individuals (size N, which is at least 245). This implies the encryption runtime should decrease almost linearly with the number of physical cores.
In the computation stage, the following loop parallelizations are applied:

All matrix products in \(\mathbf {X}_{1} \left (\mathbf {B}_{1} \left (\mathbf {X}_{1}^{\top } \, \left (\mathbf {W}_{1} \mathbf {S}\right)\right)\right)\) at step 9 of Algorithm 3 are parallelized over inner dimensions (N or k, depending on the product).

All SIMD products in steps 10 and 11 of Algorithm 3 are parallelized over N.

In matrix inversion, the extraction of k^{2} components of H is parallelized over k^{2}.

In the homomorphic encoding conversion routine of Method 1, the parallelization is applied to the main loop over N.

Loop parallelization is also applied in many places at the level of CKKS and lowerlever ring operations. In the case of NTTs for polynomials in DoubleCRT representation, the parallelization is done over ℓ. In the case of RNS subroutines, the parallelization is applied at the level of polynomial coefficients (dimension n).
Results
Dataset
Our experiments were performed using the training dataset provided by the iDASH 2018 organizers. The training data were extracted from the Personal Genome Project (https://www.personalgenomes.org/us). The dataset includes 245 individuals, 10,643 SNPs, and 3 covariates. We also generated larger datasets for scalability analysis by resampling the original dataset.
Software implementation
We implemented our solution in PALISADE v1.2 [16]. We added our own implementation for the RNS variant of the CKKS scheme to PALISADE. For loop parallelization, we used OpenMP.
Parameter selection
The parameters used are summarized below. According to [11], our parameters correspond to at least 128 bits of security for classical computers.

The size of ciphertext modulus Q_{L} for fresh ciphertexts is 850 bits.

The ring dimension n is 2^{15}=32,768.

The number of CRT limbs in the fresh ciphertext modulus is 17 (L=17), which corresponds to 16 levels in the computation circuit. Each CRT modulus is 50 bits long.

Number of bits p in the plaintext scaling factor of CKKS scheme is 50. For this value of p, the approximation error introduced by each rescaling typically affected up to 25 least significant bits of the encrypted data.

The key switching window matches the size of CRT moduli, i.e., 50 bits.

We use the ternary secret key distribution, i.e., random integers between 1 and 1, as commonly done for BFV.

The error distribution parameter σ is 3.19.
Performance results
Storage requirements
The maximum (initial) storage requirements for the case of N=245; M=10,643; K=3 are summarized in Table 1. The storage requirements take into account that S and X_{1} are first used at ℓ=7 and ℓ=6, respectively. The rotation key size is computed as a sum of space requirements for 16 keys at ℓ=17, 13 at ℓ=12, and 12 at ℓ=9. The relinearization keys are used from the start of the computation (ℓ=17). The sizes of public and secret keys are relatively small: 4.7 and 8.5 MB, respectively.
The encryption storage requirements in practical settings can be reduced by converting homomorphically the encrypted packedmatrix ciphertext X to N packedinteger ciphertexts, i.e., X_{1}, on demand. This can be done as an offline operation, resulting in an approximately 4x reduction in fresh ciphertext size.
Execution time and peak memory utilization
Table 2 reports the runtimes and peak RAM utilization observed for the official iDASH evaluation environment and a 28core server node. The results suggest that it takes about 3.5 minutes and about 10 GB of RAM (all ciphertexts and keys are stored in memory) to evaluate homomorphically the GWAS procedure for 245 individuals, 14,841 SNPs, and 3 covariates on a 4core Amazon instance. The runtime and storage requirements for the case of 1,000 individuals, 131,071 SNPs, and 3 covariates for a modern server computing node (2 x 14 cores) are about 10 minutes and 116 GB, respectively.
Accuracy analysis
We compared the accuracy of the pvalues computed using our HE prototype with a plaintext reference implementation of the semiparallel method proposed by Sikorska et al. [3]. The results for the case of N=245 and M=10,643 are summarized in Fig. 1. The graphs visualize the confusion table when choosing 0.01 as a threshold to classify SNPs as significant or not (depicted as the red lines). It is a loglog plot of the pvalues obtained by the two different approaches. The vertical axes correspond to the semiparallel logistic regression and horizontal axes to the pvalues obtained by the HE computation. The diagonal blue line depicts the case when the two classifiers provide exactly the same pvalue for each input data.
Each quadrant corresponds to one of possible outcomes: true positive (both classify a SNP as significant), false positive (the semiparallel model as not significant and the HE computation as significant), true negative (both classify a SNP as significant) and false negative (the semiparallel model as significant and the HE computation as not significant). The graph shows the true positive rate (TPR), false positive rate (FPR), true negative rate (TPR) and false negative rate (FNR). We use F1 score as a single index to summarize the performance. The graph suggests that the error introduced by our approximation is negligibly small (F1 score of 0.991).
Analysis of our approximations
As described in “Our approximations” section, there are 3 computemodel parameters that affect the approximation error: the highest degree of the Chebyshev polynomials used to approximate the logitfunction, d_{l}; the degree the Taylor expansion of ζ,d_{z}; and the number of iterations, t, for the gradient descent procedure. Clearly, there is a tradeoff between the accuracy of the approximation and the depth of the computation circuit, which determines the computational complexity.
In order to avoid overfitting, we also used other data sets from the Harvard Personal Genome Project [17]. We ran experiments for different conditions reported in the PGP Participant Survey and found that the approximation of ζ has a significant impact on the quality of the results, and is highly sensitive to the choice of cases and disease populations. Therefore, we selected a relatively high degree for the Taylor expansion, d_{z}=8, that provides adequate accuracy for unbalanced populations of up to 10%/90%. Note that the data used for the iDASH competition was relatively balanced.
We found that increasing the number of iterations, t, and the degree d_{l} of the Chebyshev polynomials used to approximate the logitfunction has a relatively minor effect on the accuracy of our solution. As an example, Table 3 shows that the F_{1} score for the pvalue with threshold 0.01 does not significantly change with increase in d_{l}, while the expected computational cost of using a higher depth would be substantial.
Profiling
Table 4 reports the breakdown of runtimes for three different cases. The results for N=245,M=10,643 suggest that the conversion of vectors from the packedmatrix to packedinteger encoding is the bottleneck for the singlethreaded case. However, the conversion procedure parallelizes better (improving by a factor of 15.3x on a 28core machine) than most of the other operations, effectively reducing its contribution from 77% in the singlethreaded experiment to 38% for the 28threaded experiment. The experiments for larger numbers of SNPs imply that the contribution of the conversion procedure further declines as its computational complexity does not depend on M.
As the maximum size of individuals did not exceed 1,024 in our experiments, all operations in Steps 1–8 of Algorithm 3 worked with single ciphertexts, and the runtime of these steps stayed approximately the same for all experiments. At the same time, the contribution of the matrix products involving S (steps 9 through 11) significantly increased (from 15% for N=245,M=10,643 to 68% for N=1,000, M = 131,071).
Discussion
The solution presented in this work was awarded first place (along with another solution from UCSD) in the iDASH’18 competition (Track 2: Secure Parallel Genome Wide Association Studies using Homomorphic Encryption). Hence it represents the state of the art in secure GWAS using homomorphic encryption.
The main limitations of our solution are (1) the need to know the computation and parameters of the semiparallel procedure in advance and (2) the handtuned nature of many optimizations applied to our solution. The first problem can be solved once the bootstrapping for the CKKS scheme becomes more practical. The second challenge can be tackled once automated compilers for homomorphic encryption are developed. Both are open research problems.
Conclusions
The results demonstrate that our solution is able to perform the full GWAS computation homomorphically for 1000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node. Many of the optimizations presented in our paper are generalpurpose and can be applied to solving challenging problems dealing with large datasets in other application domains. The major generalpurpose optimizations include a new RNS variant of the CKKS scheme and multiple methods of homomorphic switching between data encodings.
Availability of data and materials
The iDASH 2018 competition data was only available to registered competition participants.
References
 1
Cheon JH, Han K, Kim A, Kim M, Song Y. A full RNS variant of approximate homomorphic encryption In: Cid C, Jacobson Jr. MJ, editors. Selected Areas in Cryptography – SAC 2018. Cham: Springer: 2019. p. 347–68.
 2
Kim M, Song Y, Li B, Micciancio D. Semiparallel Logistic Regression for GWAS on Encrypted Data. Cryptology ePrint Archive, Report 2019/294. 2019. https://eprint.iacr.org/2019/294.
 3
Sikorska K, Lesaffre E, Groenen PJF, Eilers PHC. Gwas on your notebook: fast semiparallel linear and logistic regression for genomewide association studies. BMC Bioinformatics. 2013; 14(1):166.
 4
Wang X, Tang H, Wang S, Jiang X, Wang W, Bu D, Wang L, Jiang Y, Wang C. iDASH secure genome analysis competition 2017. BMC Med Genet. 2018; 11(4):85.
 5
Han K, Hong S, Cheon JH, Park D. Efficient logistic regression on large encrypted data. IACR Cryptol ePrint Arch. 2018; 2018:662.
 6
Chen H, GiladBachrach R, Han K, Huang Z, Jalali A, Laine K, Lauter K. Logistic regression over encrypted data from fully homomorphic encryption. BMC Med Genet. 2018; 11(4):81.
 7
Kim M, Song Y, Wang S, Xia Y, Jiang X. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR Med Inform. 2018; 6(2):19. https://doi.org/10.2196/medinform.8805.
 8
Kim A, Song Y, Kim M, Lee K, Cheon JH. Logistic regression model training based on the approximate homomorphic encryption. BMC Med Genet. 2018; 11:254.
 9
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes in FORTRAN (2Nd Ed.): The Art of Scientific Computing. Book chapters: “Chebyshev Approximation” (§5.8),“Derivatives or Integrals of a ChebyshevApproximated Function” (§5.9), and “Polynomial Approximation from Chebyshev Coefficients” (5.10). New York, NY, USA: Cambridge University Press; 1992.
 10
Cheon JH, Kim A, Kim M, Song Y. Homomorphic encryption for arithmetic of approximate numbers. In: Advances in Cryptology – ASIACRYPT 2017. Cham: Springer: 2017. p. 409–37.
 11
Chase M, Chen H, Ding J, Goldwasser S, et al. Security of homomorphic encryption. Technical report, HomomorphicEncryption.org, Redmond WA. 2017.
 12
Halevi S, Polyakov Y, Shoup V. An improved RNS variant of the BFV homomorphic encryption scheme In: Matsui M, editor. Topics in Cryptology – CTRSA 2019. Cham: Springer: 2019. p. 83–105.
 13
Bajard JC, Eynard J, Hasan MA, Zucca V. A full RNS variant of FV like somewhat homomorphic encryption schemes. In: SAC 2016. Cham: Springer: 2017. p. 423–42.
 14
Gentry C, Halevi S, Smart N. Homomorphic evaluation of the AES circuit. In: “CRYPTO 2012” LNCS vol.7417. Berlin Heidelberg: Springer Berlin Heidelberg: 2012. p. 850–67. http://eprint.iacr.org/2012/099.
 15
Halevi S, Shoup V. Faster homomorphic linear transformations in helib. In: CRYPTO 2018. Cham: Springer: 2018. p. 93–120.
 16
Polyakov Y, Rohloff K, Ryan GW. PALISADE Lattice Cryptography Library. https://git.njit.edu/palisade/PALISADE. Accessed Aug 2018.
 17
The Harvard Personal Genome Project. https://pgp.med.harvard.edu/. Accessed Mar 2019.
Acknowledgements
We gratefully acknowledge the technical assistance with AWS and parallelization by Liron Liptz and Ofer Itzhaki. We also thank the iDASH’18 Organizing Committee for motivating this research study.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 13 Supplement 7, 2020: Proceedings of the 7th iDASH Privacy and Security Workshop 2018. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume13supplement7.
Funding
Research reported in this publication was supported by National Human Genome Research Institute of the National Institutes of Health under award number 1R43HG010123. Publication costs were funded by Duality Technologies, Inc.
Author information
Affiliations
Contributions
All authors contributed equally to this work. All author(s) have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
AG performed his work for the paper as a consultant for Duality Technogies, Inc. All other authors declare they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Blatt, M., Gusev, A., Polyakov, Y. et al. Optimized homomorphic encryption solution for secure genomewide association studies. BMC Med Genomics 13, 83 (2020). https://doi.org/10.1186/s1292002007199
Published:
DOI: https://doi.org/10.1186/s1292002007199
Keywords
 Cryptography
 Homomorphic encryption
 Genomewide association studies