Optimized homomorphic encryption solution for secure genome-wide association studies

Background Genome-Wide Association Studies (GWAS) refer to observational studies of a genome-wide set of genetic variants across many individuals to see if any genetic variants are associated with a certain trait. A typical GWAS analysis of a disease phenotype involves iterative logistic regression of a case/control phenotype on a single-neuclotide polymorphism (SNP) with quantitative covariates. GWAS have been a highly successful approach for identifying genetic-variant associations with many poorly-understood diseases. However, a major limitation of GWAS is the dependence on individual-level genotype/phenotype data and the corresponding privacy concerns. Methods We present a solution for secure GWAS using homomorphic encryption (HE) that keeps all individual data encrypted throughout the association study. Our solution is based on an optimized semi-parallel GWAS compute model, a new Residue-Number-System (RNS) variant of the Cheon-Kim-Kim-Song (CKKS) HE scheme, novel techniques to switch between data encodings, and more than a dozen crypto-engineering optimizations. Results Our prototype can perform the full GWAS computation for 1,000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node (with 28 cores). Our solution for a smaller dataset was awarded co-first place in iDASH’18 Track 2: “Secure Parallel Genome Wide Association Studies using HE”. Conclusions Many of the HE optimizations presented in our paper are general-purpose, and can be used in solving challenging problems with large datasets in other application domains.


Background
Genome-Wide Association Studies (GWAS) refer to observational studies of a genome-wide set of genetic variants across many individuals to see if any genetic variants are associated with a certain trait. When applied to human data, GWAS typically focus on associations between single-nucleotide polymorhisms (SNPs) and a quantitative or dichotomous disease outcome, as well as a Parallel Genome Wide Association Studies using Homomorphic Encryption (HE)" to advance the state of the art in GWAS using HE, which is a non-interactive approach to secure computing. This paper presents our HE-based solution to GWAS. Our solution is based on an optimized GWAS compute model, a new Residue-Number-System (RNS) variant of the Cheon-Kim-Kim-Song (CKKS) HE scheme, novel techniques to switch between data encodings, and more than a dozen crypto-engineering optimizations. The solution can perform the full GWAS computation for 1000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node (with 28 cores).

Related work
Several other RNS variants of the CKKS HE scheme were independently proposed in 2018. These include the work by Cheon et al. [1], the implementation in Microsoft SEAL 3.0 (released in October 2018), and the variants developed by other teams who submitted their GWAS solutions to the iDASH'18 competition, including UCSD [2] and IBM Research.

Semi-parallel approach of Sikorska et al. [3]
Logistic regression is widely used to model binary response data in GWAS. For instance, it can be used to examine the relationship between disease status (control versus real cases) with respect to phenotypes (age, weight, height, etc.) and genotypes (such as SNP variations). Let y i denote the disease status for the i th individual in a sample of size N (y i = 1 if the individual is a disease case, and y i = 0 otherwise), and x i , s i be the corresponding predictor, where x i ∈ R K corresponds to the phenotypes and s i ∈ {0, 1, 2} M to the genotypes of individual i for a set of K phenotypes and M SNPs. The logistic regression model expresses the relationship between y i and the predictor set x i , s i in terms of the conditional probability Pr Y = y i | x i , s i of disease, as: where σ is the logistic function, σ (x) = 1 1+exp (−x) ; θ 0 ∈ R, θ ∈ R K and β ∈ R M are the K + M + 1 parameters to be determined. For the sake of simplicity, we adopt the canonical notation, that is, θ ≡ θ 0 , θ ∈ R K+1 and x i ≡ 1, x i ∈ R K+1 for i = 1, . . . , N.
Assuming that the effect of each SNP is independent of each other, it is possible to formulate it as a set of M independent equations, i.e., decompose the computation into M independent logistic regression cases for K + 1 parameters. Sikorska et al. [3] proposed a "semiparallel" approach to speed up the logistic regression in the above scenario. The goal is to avoid looping over each SNP by using a vectorized formulation, which includes optimized vector and matrix operations, that allows performing multiple identical actions over different data in a single operation.
The method relies on the assumption that the covariant parameters θ are nearly the same for all SNPs. This assumption allows the reformulation of fitting N vectors in R K+1 , followed by a one-step calculation for M SNPs at once. Therefore Sikorska's semi-parallel logistic regression consists of 2 stages: 1 Estimate the coefficients of the clinical covariates, θ ∈ R K+1 ; 2 For each of the M SNPs, estimate the corresponding coefficientsβ and p-value p ∈ R M .
The second stage, the estimation of the SNP-coefficientŝ β, approximates the optimization problem by a single Newton-Raphson iteration, leading tô where X is a matrix in R N×(K+1) whose rows are the vectors x i , i = 1, . . . , N; W ∈ R N×N is a diagonal matrix with ω ii = ρ i (1 − ρ i ) and ρ i = σ x i ·θ (t) for i = 1, . . . , N; Finally, the z-value for each parameter β j , for is the error associated toβ j and C = S W S − XH −1 X WS . A more compact expression of it is where H † denotes the adjoint of H.

Our approximations
To optimize the efficiency of our HE solution, we introduced several approximations to the semi-parallel method of Sikorska et al. [3].

Logistic regression
We found that the gradient descent method is adequate for estimating θ . Starting from an initial θ (0) , the gradient descent method at each iteration t updates the estimation of the regression parameterŝ where α t is the learning rate at the t-th iteration. Our numerical experiments suggest that a single iteration of the gradient descent procedure with α 0 = 0.015 and θ (0) = 0 provides adequate accuracy. For simplicity, we denote α 0 as α in the rest of the paper.

Logistic function approximation
We used Chebyshev polynomials to approximate the logistic function [9]. From the analysis we performed, we found that a degree-1 approximation σ (x) = 0.5 + 0.15625x provides results with sufficient accuracy. Please refer to "Analysis of our approximations" section for further details.

Approximation of ζ
In order to approximate ζ , we considered a Talyor series expansion around p = 1 2 :

Matrix inversion and division
Instead of calculating the inverse of the matrix H, Cramer's rule was used: det(H) , where adj(H) is the adjoint of matrix H and det(H) is its determinant. As the division is an expensive operation, it was deferred to a later stage (after decryption).

p-value calculation
After computing the z-values on the server, the p-value computation is performed on the client as depicted in Algorithm 2.

Full procedure
The approximations described above were used to create an optimized procedure for the server computation (Algorithm 1). Note that line 2 of Algorithm 1 is the closed form for ρ that incorporates the parameter estimation of the logistic regression. Thereforeθ does not appear explicitly in Algorithm 1.
The annotated encrypted procedure is presented in Algorithm 3. It will be referenced throughout the rest of this section.

CKKS scheme
Our solution is based an optimized variant of the Cheon-Kim-Kim-Song scheme [10]. We have developed a Double-Chinese Remainder Theorem (CRT), a.k.a, Residue Number System (RNS), variant of the original scheme. Our variant is based on the same security assumptions as the original scheme, but relies on native 64-bit integer arithmetic instead of multiprecision integer arithmetic for better performance and parallelization.
The original CKKS scheme is formulated for cyclotomic polynomial rings R = Z[ x] / x n + 1 , where n is a ring dimension that is a power of two (CKKS also supports general cyclotomic rings but they are typically less efficient). The current ciphertext modulus is typically defined as Q = 2 , i.e., the scheme works with residue rings R = R/Q R = Z 2 [ x] / x n + 1 . The algorithms are [10]: • SETUP(1 λ ). For an integer L that coresponds to the largest ciphertext modulus level, given the security parameter λ, output the ring dimension n. Set the small distributions χ key , χ err , and χ enc over R for secret, error, and encryption, respectively. • KEYGEN. Sample a secret s ← χ key , a random a → R L , and error e ← χ err . Set the secret key sk ← (1, s) and public key pk ← (b, a) ∈ R 2 L , where b ← −as + e (modQ L ).
The CKKS scheme supports an efficient packing of r (up to n/2) real numbers into a single ciphertext. The encoding and decoding operations are defined as follows: • ENCODE (w, p). For w ∈ R r , output the polynomial m ← φ(2 p · w) ∈ R. • DECODE (m, p). For a plaintext m ∈ R, output the polynomial w ← φ −1 (m/2 p ) ∈ R r .
Here, φ(x) is a certain complex canonical embedding map, which is similar conceptually to inverse Fourier transform.

Our RNS variant of the CKKS scheme
Our CKKS variant performs all operations in RNS. In other words, the power-of-two modulus Q = 2 is replaced with i=1 q i , where q i are same-size prime moduli satisfying q i ≡ 1 mod 2n (for efficient number theoretic transforms (NTT) that convert native-integer polynomials w.r.t. each CRT modulus from coefficient representation to the evaluation one, and vice versa). The primes are chosen to be as close to 2 p as possible to minimize the error introduced by rescaling.
The two major changes in our variant compared to the original CKKS scheme deal with rescaling and key switching. We also made two other minor changes. First, we use the ternary random discrete distribution for χ key and χ enc instead of the sparse distributions as the lattice attacks for this case are better studied, and the ternary distribution is included in the HE standard [11]. Second, we do additional scaling of plaintexts and ciphertexts to support the use of RNS (only native integer arithmetic) during encoding/decoding.

Rescaling in RNS
To efficiently perform rescaling in RNS from Q to Q −1 , we replace the scaling down by 2 p with scaling down by q . We choose all where is kept as small as possible. To minimize the cumulative approximation error growth in deeper computations, we also alternate q i w.r.t. 2 p . For instance, if q 1 < 2 p , then q 2 > 2 p and q 3 < 2 p , etc.
The new rescaling operation to scale down by one level is defined as We derive the procedure for computing q −1 · ct (modQ −1 ) using the CRT scaling technique proposed in [12]. Consider the following CRT representation of a multiprecision integer x ∈ Z Q : where Then we can write After rounding and applying the modulo reduction, the last term is removed yielding The first term can be directly computed in RNS by summing up the products of x i and q −1 (modq i ). For the second term, we precompute the residues of q q * q and multiply them by the corresponding residues of x during rescaling. Then we add the fractional part, which has the residue of x /q , i.e., 0 or 1, for each CRT modulus q i . Note that the fractional part is negligibly small and hence can be excluded from the implementation.
The computational complexity of rescaling is determined by the computation in the second term of (2). We first need to run one native inverse NTT for residues w.r.t. q and then − 1 native NTTs to go back to the evaluation representation. All the computations in the first term of (2) are done directly in evaluation representation. Therefore, each rescaling operation requires native-integer NTTs.
The maximum approximation error introduced by This procedure can be easily generalized to support scaling down by multiple CRT moduli. This case is similar to the first stage of complex scaling in CRT representation described in Section 2.4 of [12].

Key switching
For key switching, we use the CRT decomposition key switching algorithm that was originally proposed in [13] and improved in [12] for the Brakerski/Fan-Vercauteren (BFV) scheme. The advantages of this technique vs. the one used in the original CKKS scheme (initially proposed for the Brakerski-Gentry-Vaikuntanathan scheme in [14]) are that this technique has lower computational complexity for relatively small numbers of levels (up to 8 or so), and does not require an approximately two-fold increase in the ring dimension to support the appropriate lattice security level. Both of these benefits were important for our solution.
The operations of the CKKS scheme that are modified by the key switching procedure are rewritten as: • KSGENRNS sk (s ). For s ∈ R, sample a random a i ← R L and error e i ← χ err . Output the switching key as swk where a (κ) Each key-switching operation requires one inverse NTT ( native-integer NTTs) to switch d 2 (or a (κ) for rotation) from evaluation to coefficient representation and then NTTs ( 2 − native-integer NTTs) to go back to evaluation representation for each CRT component. Hence, the total complexity in terms of native-integer NTTs is 2 .
This key switching procedure also supports a second level of decomposition by extracting base-w digits in each residue using the procedure described in Appendix B.1 of [13].

Noise estimates
We present here heuristic noise estimates for the RNS variant of CKKS using the canonical embedding norm, which corresponds to the infinity norm for the evaluation of a polynomial R at 2n complex roots of unity. For more details on the canonical embedding mapping and norm, the reader is referred to [10]. The main differences between our expressions and those in [10] are due to the use of ternary uniform distribution and a different key switching technique.
• Encoding and Encryption. The bound for fresh where σ is the standard deviation for error distribution. The decoding is correct as long as 2 p > n + 2B clean . • Addition. The bound for homomorphic addition · nσ w log w q . • Multiplication. If we have two ciphertexts ct 1 and In most cases, the parameter selection is determined by the multiplicative depth and the approximation error in rescaling. The approximation error (with about bits being "erased" by rescaling) dominates the noise growth of other operations and should be done last (after a multiplication). The only practical exception is when rotations are performed before any multiplications. In this case, the key switching noise may be high if the w-base is large, e.g., comparable to 2 p as in the case of CRT decomposition without further digit decomposition of each residue.

Comparison to the RNS variant by Cheon et al. [1]
Both our RNS variant of CKKS and the variant proposed by Cheon et al. [1] work with an RNS basis consisting of native-integer primes q i that are close to 2 p (with bits of precision). In other words, scaling down by 2 p is replaced with approximate scaling down by q . Hence the rescaling approach in both variants is similar. The techniques for the scaling operation itself are different, but the computational complexity of both scaling techniques appears to be the same (requiring native-integer NTTs).
The key switching-procedure developed in [1] is based on the approach originally proposed for the Brakerski-Gentry-Vaikuntanathan scheme [14], which requires doubling the ciphertext modulus (and roughly doubling the ring dimension). We use the residue/digit decomposition approach originally proposed in [13] and improved in [12].
Our key-switching technique requires more NTTs but provides better overall performance for relatively "shallow" circuits (our estimates suggest this approach should be faster up to 8 levels or so).

Plaintext encoding
Our solution uses two kinds of plaintext encoding. Initially, X and y are packed in single ciphertexts similar to how it was done in [8]. We denote this as packedmatrix encoding. All matrix products in steps 2 through 8 of Algorithm 3 use the rotation-based SUMROWVEC and SUMCOLVEC procedures from [5]. Later in the algorithm (starting from step 9), the solution switches to singleinteger ciphertexts for X and the vectors and matrices derived from X and y. We call the latter encoding as packed-integer encoding. As a result of this, our matrix operations with the SNPs data (first appearing in step 9) involve only cheap SIMD multiplications and additions of packed-integer and packed-row-vector ciphertexts, and do not involve any expensive rotations. All operations before computing on the SNPs data are performed using packed-matrix (single) ciphertexts.

Packed-matrix encoding
The packed-matrix encoding packs a full matrix or vector into a single ciphertext, cloning as many entries as needed to support matrix-matrix and matrix-vector products. The cloning makes it possible to minimize the number of computationally expensive rotations in matrix-matrix (vector) products.
We encode/encrypt both X and X to avoid calling transposition in the encrypted domain. We pack X ∈ R N×k in a row-wise order, cloning each row k − 1 times before going to the next row. Here, we introduce k = K +1 for brevity.
We pack X ∈ R k×N by taking each element of matrix X (marshalling it in the row-wise order) and cloning it to form a complete row.
Both matrices require N · k 2 slots. We pack y ∈ R N column-wise by cloning y k 2 − 1 times to the right. That is, we have The resulting vector ρ is represented the same way as y. Both use N · k 2 slots.
The diagonal matrix W is represented as a vector by extracting the diagonal, and the resulting vector is packed in the same format as ρ.
The SNPs matrix S is encoded either as an array of ciphertexts (when M > n/2) or a single ciphertext (when M ≤ n/2) without any cloning, i.e., the classical SIMD packing of vectors is used.
Matrices and vectors, such as X and y, can be encoded in a single ciphertext as long as N · k 2 ≤ n/2. If this condition does not hold, the packing can be trivially extended to multiple ciphertexts per matrix/vector.

Packed-integer encoding
To support efficient matrix multiplication without rotations, we also encode X as N · k single-integer ciphertexts. In this case, each entry of X is cloned to all slots of a single ciphertext. We denote such packing of X as X 1 .

Conversion from packed-matrix to packed-integer encoding
The main bottleneck of our solution is the conversion of vectors from a packed-matrix ciphertext to multiple packed-integer ciphertexts. We have developed and implemented three different methods for performing this conversion. Based on the requirements for performance and scalability, we chose one of these methods for our prototype.

Algorithm 3
Annotated HE Computation (all scalars, vectors, and matrices are encrypted except for α and constants) ENCRYPTED INPUTS: X, X , X 1 , y, S ENCRYPTED OUTPUTS: z 2 den , z num 1: α ← 0.015 plaintext constant 2: ρ ← 0.15625α · X X (y − 0.5) + 0.5 ∈ R N adds 3 levels (taking into account the summation depth increase); we use the packed X here instead of X ; D=3.
SIMD squaring in computing S * S * ; adds 2 levels; D=16. 11: z num ← (Wζ * ) 1 S * ∈ R 1×m first product is SIMD multiplication; we use the index 1 here to denote the conversion of the packed-matrix ciphertext into N packed-integer ciphertexts; D=16.
NOTE: HM is homomorphic multiplication; D is current depth; subscript 1 denotes packed-integer encoding.
To illustrate the problem and its solutions, we consider the task of converting the packed-matrix single-ciphertext encryption of y into N packed-integer ciphertexts. A similar task has to be executed twice in our algorithm for secure GWAS.

Method 1: N log n rotations
Our first solution can be summarized as follows: 1 Fill all n/2 slots of y by cloning existing N · k 2 slots.
This requires log n/ 2N · k 2 rotations and additions. The cloning procedure is described in [8].
Here,N = 2 log N . 2 Run N bit mask multiplications to form N ciphertexts each containing n/ 2N cloned values for each component of y. All other slots are zeroed out. 3 Clone existing n/ 2N non-zero values to all slots in each of the N ciphertexts. This operation requires N log N rotations and additions, and is the main bottleneck of the computation.

Method 2:N rotations and log N depth increase
The idea of our second solution is to represent the conversion as a binary tree. At each level i of the tree we perform i rotations, 4 · i bit mask multiplications, and 2 · i additions, getting two output ciphertexts from each input ciphertext. Although this recursive method requires onlyN rotations, 4N bit mask multiplications, and 2N additions, there is a log N depth increase due to bit mask multiplications at each level of the binary tree.
To illustrate this approach, consider a simpler case (the logic would stay the same when we clone y i any number of times): y 1 y 2 y 3 · · · y N−2 y N−1 y N .
We recursively execute this procedure until the end.

Method 3:N 2 bit mask multiplications andN rotations
Another approach achieving N rotations can be summarized as follows: 1 Fill all n/2 slots of y by cloning existing N · k 2 slots. 2 ComputeN − 1 cheap rotations of the original ciphertext using the hoisting procedure from [15]. Although this procedure requires only roughlyN cheap rotations, it involvesN 2 bit mask multiplications and additions, which now become the main bottleneck for relatively large values of N.

Comparison of the methods
We implemented all three methods, and carried out both complexity and practical performance comparison.
As N is relatively large (at least 245),N 2 bit mask multiplications in Method 3 resulted in computation runtimes that are at least 2x-3x larger than Method 1 with N log N rotations. However, Method 3 would be faster for smaller N, e.g., less than 100.
Method 2 is a good option only when the depth increase can be incorporated in the existing circuit without increasing the overall circuit depth. But the scalability of this approach is questionable. The depth increase of log N = 8 could not be integrated in the circuit of our solution, and thus we chose Method 1 for our implementation.
Note that in our implementation the depth cost of bit mask multiplication is the same as for homomorphic multiplication, which implies there is room for improvement. Therefore, a more depth-efficient bit mask multiplication procedure may result in a significantly better performance for Method 2, possibly superior to that of Method 1.

Minimizing the number of key switching operations
One of the optimization goals for our solution is to reduce the number of key switching operations, which are used both for rotation and relinearization (after homomorphic multiplication). Each such operation has a high computational complexity, i.e., requires 2 native-integer NTTs. We have optimized our algorithm to minimize the number of key switching operations. For instance, all computations involving encrypted SNPs data require only 16 (k 2 ) key switching operations in total. A great majority of the computations involving encrypted SNPs data use only "cheap" SIMD multiplications and additions, and sparingly rescaling operations.

Multiplications with lazy or no relinearization
In steps 9 through 11 of Algorithm 3, our procedure calls only 16 (k 2 ) relinearizations. In other words, all large-dimension SIMD products are performed without relinearization (the ciphertext size is allowed to grow). The procedure calls the relinearization procedure only when multiplying by B 1 in step 9, which works with the smallest dimension (k) in the chained matrix product. We refer to this deferred relinearization as "lazy" relinearization. Any homomorphic multiplications after this product are performed without a single relinearization, which significantly reduces the runtime of computation.

Use of additions instead of rotations
The packed-integer encoding is introduced in steps 9 through 11 of Algorithm 3 to replace any rotation-based summations over rows/columns with SIMD homomorphic additions. The only places where the rotations are used are to homomorphically convert B, W, and (Wζ * ) from packed-matrix encoding to the packed-integer one. The use of rotation-based summation in the chained product of step 9 would require a substantially larger number of rotations as compared to the conversion of two vectors of size N and one matrix of size k × k.

Minimizing the number of NTTs
Besides key switching, NTTs are used for rescaling. In some cases, expensive rotations can be replaced with hoisted automorphisms from [15], reducing the number of NTTs for multiple rotations of the same ciphertext to the NTT cost of a single rotation. Our solution minimizes the number of rescaling operations and uses hoisted automorphisms where applicable.

Use rescaling sparingly
We use the following techniques to minimize the number of rescaling operations: • When there are homomorphic multiplications followed by aggregation of ciphertexts, such as addition of multiple ciphertexts, we apply rescaling after the aggregation, i.e., we call it once rather than for every homomorphic multiplication. • If there is a benefit in lazy rescaling, e.g., when the number of ciphertexts at the following level is much smaller, we defer rescaling until later. In this case, we have to make sure the depth requirement is not increased, which is true when one of the multiplicands is scaled w.r.t. 2 p rather a power of it. • The rescaling operations are not called at the end of computation if skipping them does not increase the multiplicative depth of the circuit.

Hoisted automorphisms
Hoisted automorphisms are useful when multiple rotations of the same ciphertext need to be computed [15]. Our solution encounters this scenario when computing the matrix inversion of H in steps 6 and 7 of Algorithm 3, and hence the hoisted automorphisms are used there in favor of regular rotations.

Minimizing the noise growth and ciphertext modulus
We minimized the noise growth/ciphertext modulus of the computation circuit using the following techniques: • Binary tree multiplication was employed for any chained products of ciphertexts. • Closed-form expressions (such as in step 2 of Algorithm 3) were derived to get the maximum benefit from binary tree multiplication.
• Binary tree addition for any summation of a large number of ciphertexts was employed to achieve a O(log N) noise growth. • To guarantee that the end result of the computation requires only one native-integer polynomials, we multiplied both numerator and denominator by estimated scaling factors (different from 2 p ). These factors were introduced during bit mask multiplications to avoid any extra depth increase due to this additional scaling. • The maintenance operations of HE, such as key switching and rescaling, were properly ordered to minimize the noise growth. For instance, rescaling was done after the rotations following a multiplication (not before).

Harnessing the CRT ladder
As the circuit evaluation progresses, the number of CRT limbs, i.e., native polynomials in the Double-CRT structure, gets reduced due to rescaling. For instance, at level the number of CRT limbs is reduced by L − as compared to fresh ciphertexts. This provides a speedup in CKKS compared to scale-invariant schemes, such as BFV. We can further take advantage of the decreasing CRT "ladder" by encrypting plaintexts at the level they are first used and by compressing evaluation keys as the computation progresses. This reduces storage requirements. We also minimize the number of CRT limbs by finding the minimum number of limbs needed for correct result (starting from the end of the computation circuit). Below we provide some examples of how these techniques are applied in our solution.

Encrypt ciphertexts at the level first used
As the SNPs matrix S is first used in step 9 of Algorithm 3 (after 10 levels of computation), we encrypt it using 7 CRT limbs rather than 17 corresponding to the initial ciphertext modulus. This reduces the storage requirements for the SNPs matrix by a factor of 2.4x.

Compress evaluation keys as needed
Same rotation keys are used multiple times throughout the computation. Whenever they are no longer required below a certain level, we compress them to the current level, thus reducing the number of CRT limbs. Note that the rotation keys consume most of the space utilized by public keys in our solution.

Use the lowest number of CRT limbs for ciphertexts
Once the lowest multiplicative depth for the circuit is determined, we choose the actual level for ciphertexts by counting from the end of the circuit (not from the beginning) up to the specific computation. This minimizes the number of CRT limbs used, thus reducing both runtime and storage requirements.
Consider the example of S. If we were to count the level from the beginning of the circuit, we would choose level 8 (to match the level of B 1 ). But we choose 10 instead because the maximum depth of computations from S in step 9 to the end of the circuit is 6. This gives more than 1.5x runtime improvement for the rotations in the conversion from W to W 1 , which is done immediately before computing W 1 S. The storage requirement for S is also reduced by roughly a factor of 1.3x.

Matrix inversion
As pointed out earlier, we use Cramer's rule to compute the matrix inverse of H. The numerator is the adjoint of H while the denominator is the determinant of H. To extract specific components of H, we use cheap rotations (hoisted automorphisms) followed by bit mask multiplications to clear out the values that are not used. As both numerator and denominator contain a lot of common products of the rotations for H, we wrote both of them down in the closed form and compute common products only once. The closed form for the determinant also allows the direct application of binary tree multiplication (3-in-series products require a binary depth of 2). The depth cost of these steps is 3 (2 for homomorphic multiplications and 1 for bit mask multiplication).
When computing the determinant and k 2 components in the adjoint, all homomorphic multiplications are performed without relinearization, and the relinearization is applied at the very end (for each component) after all additions and subtractions are done. This significantly reduces the number of expensive key switching operations when computing the matrix adjoint and determinant.
The procedure for computing the adjoint and determinat also prepares the packed-matrix variant of B for computing ζ * in step 8 and the packed-integer variant B, i.e., B 1 , for computing S * in step 9 by performing appropriate rotations and additions. The final rescaling for the components in the adjoint and determinant is done after all rotations are computed. Otherwise the noise growth in rotations would lead to incorrect results after decryption.

Order of products in matrix chain multiplication
The order of matrix products in matrix chain multiplications has a major effect on the performance of our solution. The two most complex and costly chained matrix products in Algorithm 3 are step 8 (computation of ζ * ) and step 9 (computation of S * ). Typically the matrix chain multiplication problem is an optimization problem that can be solved using dynamic programming. In the case of regular plaintext computations, the goal is usually to minimize the number of element multiplications. In the encrypted solution, additional constraints are introduced, and these constraints can be different depending on the plaintext encoding used, as illustrated below.
In step 8, we work with a chain of single ciphertexts (packed matrix encoding). The constraints for this case can be summarized as follows: • Make sure the outcome of each intermediate product is a single ciphertext. For instance, we cannot have a product where outer dimensions are both N. • The costs of SUMROWVEC and SUMCOLVEC are different. The latter requires a bit mask multiplication, and the number of rotations corresponds either to row or column size. The possible constraints are to minimize the number of rotations and/or minimize the depth of bit mask multiplications. • Minimize the depth of the overall circuit. In other words, the term at highest level should be given special attention. The binary tree multiplication technique should also be properly applied.
In step 9, we work with products of many packedinteger ciphertexts and N SIMD-packed ciphertexts (for each row of matrix S). The guidelines for optimization in this case can be summarized as follows: • Minimize the total number of SIMD multiplications.
• Minimize the depth of the overall circuit. In other words, the term at highest level should be given special attention. The binary tree multiplication technique should also be properly applied.
In our solution, the decisions regarding the order of matrix chain multiplication were done by hand. But in a more general case, where the computation circuit is built automatically, one would have to include algorithms for finding the optimal order by solving the appropriate dynamic optimization problem.

Loop parallelization
To benefit from multi-core CPU environments, our solution applies loop parallelization at various levels.
At the encryption stage, the parallelization is done for the loop iterating over all individuals (size N, which is at least 245). This implies the encryption runtime should decrease almost linearly with the number of physical cores.
In the computation stage, the following loop parallelizations are applied: • All matrix products in X 1 B 1 X 1 (W 1 S) at step 9 of Algorithm 3 are parallelized over inner dimensions (N or k, depending on the product). • All SIMD products in steps 10 and 11 of Algorithm 3 are parallelized over N.
• In matrix inversion, the extraction of k 2 components of H is parallelized over k 2 . • In the homomorphic encoding conversion routine of Method 1, the parallelization is applied to the main loop over N. • Loop parallelization is also applied in many places at the level of CKKS and lower-lever ring operations. In the case of NTTs for polynomials in Double-CRT representation, the parallelization is done over . In the case of RNS subroutines, the parallelization is applied at the level of polynomial coefficients (dimension n).

Dataset
Our experiments were performed using the training dataset provided by the iDASH 2018 organizers. The training data were extracted from the Personal Genome Project (https://www.personalgenomes.org/us). The dataset includes 245 individuals, 10,643 SNPs, and 3 covariates. We also generated larger datasets for scalability analysis by re-sampling the original dataset.

Software implementation
We implemented our solution in PALISADE v1.2 [16]. We added our own implementation for the RNS variant of the CKKS scheme to PALISADE. For loop parallelization, we used OpenMP.

Parameter selection
The parameters used are summarized below. According to [11], our parameters correspond to at least 128 bits of security for classical computers.
• The size of ciphertext modulus Q L for fresh ciphertexts is 850 bits. • The ring dimension n is 2 15 = 32, 768.
• The number of CRT limbs in the fresh ciphertext modulus is 17 (L = 17), which corresponds to 16 levels in the computation circuit. Each CRT modulus is 50 bits long. • Number of bits p in the plaintext scaling factor of CKKS scheme is 50. For this value of p, the approximation error introduced by each rescaling typically affected up to 25 least significant bits of the encrypted data. • The key switching window matches the size of CRT moduli, i.e., 50 bits. • We use the ternary secret key distribution, i.e., random integers between -1 and 1, as commonly done for BFV. • The error distribution parameter σ is 3.19.

Storage requirements
The maximum (initial) storage requirements for the case of N = 245; M = 10, 643; K = 3 are summarized in Table 1. The storage requirements take into account that S and X 1 are first used at = 7 and = 6, respectively. The rotation key size is computed as a sum of space requirements for 16 keys at = 17, 13 at = 12, and 12 at = 9. The relinearization keys are used from the start of the computation ( = 17). The sizes of public and secret keys are relatively small: 4.7 and 8.5 MB, respectively.
The encryption storage requirements in practical settings can be reduced by converting homomorphically the encrypted packed-matrix ciphertext X to N packedinteger ciphertexts, i.e., X 1 , on demand. This can be done as an offline operation, resulting in an approximately 4x reduction in fresh ciphertext size. Table 2 reports the runtimes and peak RAM utilization observed for the official iDASH evaluation environment and a 28-core server node. The results suggest that it takes about 3.5 minutes and about 10 GB of RAM (all ciphertexts and keys are stored in memory) to evaluate homomorphically the GWAS procedure for 245 individuals, 14,841 SNPs, and 3 covariates on a 4-core Amazon instance. The runtime and storage requirements for the case of 1,000 individuals, 131,071 SNPs, and 3 covariates for a modern server computing node (2 x 14 cores) are about 10 minutes and 116 GB, respectively.

Accuracy analysis
We compared the accuracy of the p-values computed using our HE prototype with a plaintext reference implementation of the semi-parallel method proposed by Sikorska et al. [3]. The results for the case of N = 245 and M = 10, 643 are summarized in Fig. 1. The graphs visualize the confusion table when choosing 0.01 as a threshold to classify SNPs as significant or not (depicted as the red lines). It is a log-log plot of the p-values obtained by the two different approaches. The vertical axes correspond to the semi-parallel logistic regression and horizontal axes to the p-values obtained by the HE computation. The diagonal blue line depicts the case when the two classifiers provide exactly the same p-value for each input data.
Each quadrant corresponds to one of possible outcomes: true positive (both classify a SNP as significant), false positive (the semi-parallel model as not significant and the HE computation as significant), true negative (both classify a SNP as significant) and false negative (the semi-parallel model as significant and the HE computation as not significant). The graph shows the true positive rate (TPR), false positive rate (FPR), true negative rate (TPR) and false negative rate (FNR). We use F1 score as a single index to summarize the performance. The graph suggests that the error introduced by our approximation is negligibly small (F1 score of 0.991).

Analysis of our approximations
As described in "Our approximations" section, there are 3 compute-model parameters that affect the approximation error: the highest degree of the Chebyshev polynomials used to approximate the logit-function, d l ; the degree the Taylor expansion of ζ , d z ; and the number of iterations, t, for the gradient descent procedure. Clearly, there is a trade-off between the accuracy of the approximation and the depth of the computation circuit, which determines the computational complexity.
In order to avoid over-fitting, we also used other data sets from the Harvard Personal Genome Project [17]. We ran experiments for different conditions reported in the PGP Participant Survey and found that the approximation of ζ has a significant impact on the quality of the results, and is highly sensitive to the choice of cases and disease populations. Therefore, we selected a relatively high degree for the Taylor expansion, d z = 8, that provides adequate accuracy for unbalanced populations of up to 10%/90%. Note that the data used for the iDASH competition was relatively balanced.  . 1 Accuracy of our encrypted computing prototype w.r.t the plaintext reference implementation [3] We found that increasing the number of iterations, t, and the degree d l of the Chebyshev polynomials used to approximate the logit-function has a relatively minor effect on the accuracy of our solution. As an example, Table 3 shows that the F 1 score for the p-value with threshold 0.01 does not significantly change with increase in d l , while the expected computational cost of using a higher depth would be substantial. Table 4 reports the breakdown of runtimes for three different cases. The results for N = 245, M = 10, 643 suggest that the conversion of vectors from the packedmatrix to packed-integer encoding is the bottleneck for the single-threaded case. However, the conversion procedure parallelizes better (improving by a factor of 15.3x on a 28-core machine) than most of the other operations, effectively reducing its contribution from 77% in the single-threaded experiment to 38% for the 28-threaded Table 3 F 1 score as a function of the degree d l of the Chebyshev polynomials used to approximate the logit-function at d z = 8 and t = 1 As the maximum size of individuals did not exceed 1,024 in our experiments, all operations in Steps 1-8 of Algorithm 3 worked with single ciphertexts, and the runtime of these steps stayed approximately the same for all experiments. At the same time, the contribution of the matrix products involving S (steps 9 through 11) significantly increased (from 15% for N = 245, M = 10, 643 to 68% for N = 1, 000, M = 131,071).

Discussion
The solution presented in this work was awarded first place (along with another solution from UCSD) in the iDASH'18 competition (Track 2: Secure Parallel Genome Wide Association Studies using Homomorphic Encryption). Hence it represents the state of the art in secure GWAS using homomorphic encryption.
The main limitations of our solution are (1) the need to know the computation and parameters of the semiparallel procedure in advance and (2) the hand-tuned nature of many optimizations applied to our solution. The first problem can be solved once the bootstrapping for the CKKS scheme becomes more practical. The second Table 4 Runtime profiling on the 28-core node; time in seconds; numbers in header row denote step #'s in Algorithm 3; numbers in parentheses are for the single-threaded experiment; → denotes the conversion from packed-matrix to packed-integer encoding

Conclusions
The results demonstrate that our solution is able to perform the full GWAS computation homomorphically for 1000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node. Many of the optimizations presented in our paper are general-purpose and can be applied to solving challenging problems dealing with large datasets in other application domains. The major general-purpose optimizations include a new RNS variant of the CKKS scheme and multiple methods of homomorphic switching between data encodings.