Privacy-preserving logistic regression training

Background Logistic regression is a popular technique used in machine learning to construct classification models. Since the construction of such models is based on computing with large datasets, it is an appealing idea to outsource this computation to a cloud service. The privacy-sensitive nature of the input data requires appropriate privacy preserving measures before outsourcing it. Homomorphic encryption enables one to compute on encrypted data directly, without decryption and can be used to mitigate the privacy concerns raised by using a cloud service. Methods In this paper, we propose an algorithm (and its implementation) to train a logistic regression model on a homomorphically encrypted dataset. The core of our algorithm consists of a new iterative method that can be seen as a simplified form of the fixed Hessian method, but with a much lower multiplicative complexity. Results We test the new method on two interesting real life applications: the first application is in medicine and constructs a model to predict the probability for a patient to have cancer, given genomic data as input; the second application is in finance and the model predicts the probability of a credit card transaction to be fraudulent. The method produces accurate results for both applications, comparable to running standard algorithms on plaintext data. Conclusions This article introduces a new simple iterative algorithm to train a logistic regression model that is tailored to be applied on a homomorphically encrypted dataset. This algorithm can be used as a privacy-preserving technique to build a binary classification model and can be applied in a wide range of problems that can be modelled with logistic regression. Our implementation results show that our method can handle the large datasets used in logistic regression training.


Introduction
Logistic regression is a popular technique used in machine learning to solve binary classification problems. It starts with a training phase during which one computes a model for prediction based on previously gathered values for predictor variables (called covariates) and corresponding outcomes. The training phase is followed by a testing phase that assesses the accuracy of the model. To this end, the dataset is split into data for training and data for validation. This validation is done by evaluating the model likelihood of a homeowner defaulting on a mortgage or a credit card transaction being fraudulent.
As all machine learning tools, logistic regression needs sufficient training data to construct a useful model. As the above examples show, the values for the covariates and the corresponding outcomes are typically highly sensitive, which implies that the owners of this data (either people or companies) are reluctant to have their data included in the training set. In this paper, we solve this problem by describing a method for privacy preserving logistic regression training using somewhat homomorphic encryption. Homomorphic encryption enables computations on encrypted data without needing to decrypt the data first. As such, our method can be used to send encrypted data to a central server, which will then perform logistic regression training on this encrypted input data. This also allows to combine data from different data owners since the server will learn nothing about the underlying data.

Related work
Private logistic regression with the aid of homomorphic encryption has already been considered in [1,2], but in a rather limited form: both papers assume that the logistic model has already been trained and is publicly available. This publicly known model is then evaluated on homomorphically encrypted data in order to perform classification of this data without compromising the privacy of the patients. Our work complements these works by executing the training phase for the logistic regression model in a privacy-preserving manner. This is a much more challenging problem than the classification of new data, since this requires the application of an iterative method and a solution for the nonlinearity in the minimization function.
Aono et al. [3] also explored secure logistic regression via homomorphic encryption. However, they shift the computations that are challenging to perform homomorphically to trusted data sources and a trusted client. Consequently, in their solution the data sources need to compute some intermediate values, which they subsequently encrypt and send to the computation server. This allows them to only use an additively homomorphic encryption scheme to perform the second, easier, part of the training process. Finally, they require a trusted client to perform a decryption of the computed coefficients and use these coefficients to construct the cost function for which the trusted client needs to determine the minimum in plaintext space. Their technique is based on a polynomial approximation of the logarithmic function in the cost function and the trusted client applies the gradient descent algorithm as iterative method to perform the minimization of the cost function resulting from the homomorphic computations. Our method does not require the data owners to perform any computations (bar the encryption of their data) and determines the model parameters by executing the minimization directly on encrypted data. Again this is a much more challenging problem.
In [4] Xie et al. construct PrivLogit which performs logistic regression in a privacy-preserving but distributed manner. As before, they require the data owners to perform computations on their data before encryption to compute parts of a matrix used in the logistic regression. Our solution starts from the encrypted raw dataset, not from values that were precomputed by the centers that collect the data. In our solution all computations that are needed to create the model parameters, are performed homomorphically.
Independently and in parallel with our research, Kim et al. [5] investigated the same problem of performing the training phase of logistic regression in the encrypted domain. Their method uses a different approach than ours: firstly, they use a different minimization method (gradient descent) compared to ours (a simplification of the fixed Hessian method), a different approximation of the sigmoid function and a different homomorphic encryption scheme. Their solution is based on a small adaptation of the input values, which reduces the number of homomorphic multiplications needed in the computation of the model. We assumed the dataset would be already encrypted and therefore adaptations to the input would be impossible. Furthermore, they tested their method on datasets that contain a smaller number of covariates than the datasets used in this article.

Contributions
Our contributions in this paper are as follows: firstly, we develop a method for privacy preserving logistic training using homomorphic encryption that consists of a low depth version of the fixed Hessian method. We show that consecutive simplifications result in a practical algorithm, called the simplified fixed Hessian (SFH) method, that at the same time is still accurate enough to be useful. We implemented this algorithm and tested its performance and accuracy on two real life use cases: a medical application predicting the probability of having cancer given genomic data and a financial application predicting the probability that a transaction is fraudulent. Our test results show that in both use cases the model computed is almost as accurate as the model computed by standard logistic regression tools such as the ones present in Matlab.

Logistic regression
Logistic regression can be used to predict the probability that a dependent variable belongs to a class, e.g. healthy or sick, given a set of covariates, e.g. some genomic data.
In this article, we will consider binary logistic regression, where the dependent variable can belong to only two possible classes, which are labelled {±1}. Binary logistic regression is often used for binary classification by setting a threshold for a given class up front and comparing the output of the regression with this threshold. The logistic regression model is given by: where the vector β = (β 0 , . . . , β d ) are the model parameters, y the class label (in our case {±1}) and the vector Because logistic regression predicts probabilities rather than classes, we can generate the model using the log likelihood function. The training of the model starts with a training dataset (X, and corresponding observed class y i ∈ {−1, 1}. The goal is to find the parameter vector β that maximizes the log likelihood function: When the parameters β are determined, they can be used to classify new data vectors in which 0 < τ < 1 is a variable threshold which typically equals 1 2 .

Datasets
As mentioned before, we will test our method in the context of two real life use cases, one in genomics and the other in finance.
The genomic dataset was provided by the iDASH competition of 2017 and consists of 1581 records (each corresponding to a patient) consisting of 103 covariates and a class variable indicating whether or not the patient has cancer. The challenge was to devise a logistic regression model to predict the disease given a training data set of at least 200 records and 5 covariates. However, for scalability reasons the solution needed to be able to scale up to 1000 records with 100 covariates. This genomic dataset consists entirely of binary data.
The financial data was provided by an undisclosed bank that provided anonymized data with the goal of predicting fraudulent transactions. Relevant data fields that were selected are: type of transaction, effective amount of the transaction, currency, origin and destination, fees and interests, etc. This data has been subject to preprocessing by firstly representing the non-numerical values with labels and secondly computing the minimum and maximum for each of the covariates and using these to normalise the data by computing x−x min x max −x min . The resulting financial dataset consists of 20,000 records with 32 covariates, containing floating point values between 0 and 1.

The FV scheme
Our solution is based on the somewhat homomorphic encryption scheme of Fan and Vercauteren [6], which can be used to compute a limited number of additions and multiplications on encrypted data. The security of this encryption scheme is based on the hardness of the ring learning with error problem (RLWE) introduced by Lyubashevsky et al. in [7]. The core objects in the FV scheme are elements of the polynomial ring The plaintext space of the FV scheme is the ring R t for t > 1 a small integer modulus and the ciphertext space is R q × R q for an integer modulus q t. For a ∈ R q , we denote by [ a] q the element in R obtained by applying [ ·] q to all its coefficients a i , with [ a i ] q = a i mod q given by a representative in −q 2 , q 2 . The FV scheme uses two probability distributions on R q : one is denoted by χ key and is used to sample the secret key of the scheme, the other is denoted χ err and will be used to sample error polynomials during encryption. The exact security level of the FV scheme is based on these probability distributions, the degree D and the ciphertext modulus q and can be determined using an online tool developed by Albrecht et al. [8].
Given parameters D, q, t and the distributions χ key and χ err , the core operations are then as follows: • KeyGen: the private key consists of an element s ← χ key and the public key pk = (b, a) is computed as a ← R q uniformly at random and b =[ −(as + e)] q with e ← χ err . • Encrypt(pk, m): given m ∈ R t , sample error polynomials e 1 , e 2 ∈ χ err and u ∈ χ key and compute c 0 = m + bu + e 1 and c 1 = au + e 2 with = q/t , the largest integer smaller than q t . The ciphertext is then c = (c 0 , c 1 ).
• Decrypt(sk, c): computem =[ c 0 + c 1 s] q , divide the coefficients ofm by and round and reduce the result into R t .
Computing the sum of two ciphertexts simply amounts to adding the corresponding polynomials in the ciphertexts. Multiplication, however, requires a bit more work and we refer to [6] for the precise details.
The relation between a ciphertext and the underlying plaintext can be described as [ c 0 +c 1 s] q = m+e, where e is the noise component present in the ciphertext. This also shows that if the noise e grows too large, decryption will no longer result in the original message, and the scheme will no longer be correct. Since the noise present in the resulting ciphertext will grow with each operation we perform homomorphically, it is important to choose parameters that guarantee correctness of the scheme. Knowing the computations that need to be performed up front enables us to estimate the size of the noise in the resulting ciphertext, which permits the selection of suitable parameters.

w-NIBNAF
In order to use the FV scheme, we need to transform the input data into polynomials of the plaintext space R t . To achieve this, our solution makes use of the w-NIBNAF encoding, because this encoding improves the performance of the homomorphic scheme. The w-NIBNAF encoding is introduced in [9] and expands a given number θ with respect to a non-integral base 1 < b w < 2. By replacing the base b w by the variable X, the method encodes any real number θ as a Laurent polynomial: A final step then maps this Laurent polynomial into the plaintext space R t and we refer the reader to [9] for the precise details.
The w-NIBNAF encoding is constructed such that the encoding of a number will satisfy two conditions: the encoding has coefficients in the set {−1, 0, 1} and each set of w consecutive coefficients will have no more than one non-zero coefficient. Both conditions ensure that the encoded numbers are represented by very sparse polynomials with coefficients in the set {−1, 0, 1}, which can be used to bound the size of the coefficients of the result of computations on these representations. In particular, this encoding results in a smaller plaintext modulus t, which improves the performance of the homomorphic encryption scheme. Since larger values for w increase the sparseness of the encodings and hence reduce the size of t even more, one would like to select the value for w to be as large as possible. However, similar to encryption one has to consider a correctness requirement for the encoding. More specifically, decoding of the final polynomial should result in the correct answer, hence the base b w and consequently also the value of w should be chosen with care.

Privacy preserving training of the model Newton-Raphson method
To estimate the parameters of our logistic regression model, we need to compute the parameter vector β that maximizes Eq. (2). Typically, one would differentiate the log likelihood equation with respect to the parameters, set the derivatives equal to zero and solve these equations to find the maximum. The gradient of the log likelihood function l(β), i.e. the vector of its partial derivatives [ ∂l/ ∂β 0 , ∂l/∂β 1 , . . . , ∂l/∂β d ] is given by: In order to estimate the parameters β, this equation will be solved numerically by applying the Newton-Raphson method, which is a method to numerically determine the zeros of a function. The iterative formula of the Newton-Raphson method to calculate the root of a univariate function f (x) is given by: with f (x) the derivative of f (x). Since we now compute with a multivariate objective function l(β), the (k + 1) th iteration for the parameter vector β is given by: with ∇ β l(β) as defined above and H(β) = ∇ 2 β l(β) the Hessian of l(β), being the matrix of its second partial derivatives H i,j = ∂ 2 l/∂β i ∂β j , given by:

Homomorphic logistic regression
The downside of Newton's method is that exact evaluation of the Hessian and its inverse are quite expensive in computational terms. In addition, the goal is to estimate the parameters of the logistic regression model in a privacypreserving manner using homomorphic encryption, which will further increase the computational challenges. Therefore, we will adapt the method in order to make it possible to compute it efficiently in the encrypted domain.
The first step in the simplification process is to approximate the Hessian matrix with a fixed matrix instead of updating it every iteration. This technique is called the fixed Hessian Newton method. In [10], Böhning and Lindsay investigate the convergence of the Newton-Raphson method and show it converges if the Hessian H(β) is replaced by a fixed symmetric negative definite matrix B (independent of β) such that H(β) ≥ B for all feasible parameter values β, where " ≥ " denotes the Loewner ordering. The Loewner ordering is defined for symmetric matrices A, B and denoted as A ≥ B iff their difference A − B is non-negative definite. Given such B, the Newton-Raphson iteration simplifies to Furthermore, they suggest a lower bound specifically for the Hessian of the logistic regression problem, which is defined asH = − 1 4 X T X and demonstrate that this is a good bound. This approximation does not depend on β, consequently it is fixed throughout all iterations and it only needs to be computed once as desired. Since the Hessian is fixed, so is its inverse, which means it only needs to be computed once.
In the second step, we will need to simplify this approximation even more, since inverting a square matrix whose dimensions equal the number of covariates (and thus can be quite large), is nearly impossible in the encrypted domain. To this end, we replace the matrixH by a diagonal matrix for which the method still converges. The entries of the diagonal matrix are simply the sums of the rows of the matrixH, so our new approximationH of the Hessian becomes: To be able to use this approximation as lower bound for the above fixed Hessian method we need to assure ourselves it satisfies the condition H(β) ≥H. As mentioned before we already know from [10] that H(β) ≥ −1 4 X T X, so it is sufficient to show that −1 4 X T X ≥H, which we now prove more generally. Proof By definition of the matrix B, we have that C = A−B has the following entries: for i = j we have C i,j = A i,j and C i,i = − n k=1,k =i A i,k . In particular, the diagonal elements of C are minus the sum of the off-diagonal elements on the i-th row. We can bound the eigenvalues λ i of C by Gerschgorin's circle theorem [11], which states that for every eigenvalue λ of C, there exists an index i such that Note that by construction of C we have that C i,i = j =i |C ij |, and so every eigenvalue λ satisfies |λ − C i,i | < C i,i for some i. In particular, since C i,i ≥ 0, we conclude that λ ≥ 0 for all eigenvalues λ and thus that A ≥ B.
Our approximationH for the Hessian also simplifies the computation of the inverse of the matrix, since we simply need to invert each diagonal element separately. The inverse will be again computed using the Newton-Raphson method: assume we want to invert the number a, then the function f (x) will be equal to 1 x − a and the iteration is given by x k+1 = x k (2 − ax k ). For the Newton-Raphson method to converge, it is important to determine a good start value. Given the value range of the input data and taking into account the dimensions of the training data, we estimate a range of the size of the number we want to invert. This results in an estimation of the order of magnitude of the solution that is expected to be found by the Newton-Raphson algorithm. By choosing the initial value of our Newton-Raphson iteration close to the constructed estimation of the inverse, we can already find an acceptable approximation of the inverse by performing only one iteration of the method.
In the third and final step, we simplify the non-linearity coming from the sigmoid function. Here, we simply use the Taylor series: extensive experiments with plaintext data showed that approximating σ y i β T x i by 1 2 + y i β T x i 4 is enough to obtain good results. The combination of the above techniques finally results in our simplified fixed Hessian (SFH) method given in Algorithm 1. We implemented the SFH algorithm in Matlab and verified the accuracy for a growing number of iterations. One can see from Algorithm 1 that each iteration requires 5 homomorphic multiplications, so performing one iteration is quite expensive. In addition, Table 1 indicates that improving the accuracy significantly requires multiple iterations. We will therefore restrict our experiments to one single iteration. Table 2 shows the confusion matrix of a general binary classifier.

Accuracy of the SFH method
From the confusion matrix, we can compute the true positive rate (TPR) and the false positive rate (FPR) which are given by TPR = #TP #TP + #FN and FPR = #FP #FP + #TN .
By computing the TPR and FPR for varying thresholds 0 ≤ τ ≤ 1, we can construct the receiver operating characteristic curve or ROC-curve. The ROC-curve is constructed by plotting the (FPR, TPR) pairs for each possible value of the threshold τ . In the ideal situation there would exists a point with (FPR, TPR) = (0, 1), which would imply that there exists a threshold for which the model classifies all test data correctly.
The area under the ROC-curve or AUC-value will be used as the main indicator of how well the classifier works. Since our SFH method combines several approximations, we need to verify the accuracy of our model first on unencrypted data and later on encrypted data. For well chosen system parameters, there will be no difference between accuracy for unencrypted vs. encrypted data since all computations on encrypted data are exact.
The first step is performed by comparing our SFH method with the standard logistic regression functionality of Matlab. This is done by applying our method with all its approximations to the plaintext data and comparing the result to the result of the "glmfit" function in Matlab. The function b = glmfit(X, y, distr) returns a vector  b of coefficient estimates for a generalized linear model of the responses y on the predictors in X, using distribution distr. Generalized linear models unify various statistical models, such as linear regression, logistic regression and Poisson regression, by allowing the linear model to be related to the response variable via a link function. We use the "binomial" distribution, which corresponds to the "logit" link function and y a binary vector indicating success or failure to compute the parameters of the logistic regression model with "glmfit". From Figs. 1 and 2 one can see that the SFH method classifies the data approximately as well as "glmfit" in Matlab, in the sense that one can always select a threshold that gives approximately the same true positive rate and false positive rate. One can thus conclude that the SFH method, with all its approximations, performs well compared to the standard Matlab method, which uses much more precise computations. By computing the TPR and FPR for several thresholds, we found that the approximations of our SFH method shifts the model a bit such that we need a slightly larger threshold to get approximately the same TPR and FPR as for the Matlab model. Since the ideal situation would be to end up with a true positive rate of 1 and false positive rate of 0, we see from Fig. 1 that for the genomics dataset both models are performing rather poorly. The financial fraud use case is, however, much more amenable to binary classification as shown in Fig. 2. The main conclusion is that our SFH method performs almost as well as standard methods such as those provided by Matlab.

Implementation details and performance
Our implementation uses the FV-NFLlib software library [12] which implements the FV homomorphic encryption scheme. The system parameters need to be selected taking into account the following three constraints: 1 the security of the somewhat homomorphic FV scheme, 2 the correctness of the somewhat homomorphic FV scheme, 3 the correctness of the w-NIBNAF encoding.
The security of a given set of system parameters can be estimated using the work of Albrecht, Player and Scott [13] and the open source learning with error (LWE) hardness estimator implemented by Albrecht [8]. This program estimates the security of the LWE problem based on the following three parameters: the degree D of the polynomial ring, the ciphertext modulus q and α = √ 2πσ q where σ is the standard deviation of the error distribution χ err . The security estimation is based on the best known attacks for the learning with error problem. Our system parameters are chosen to be q = 2 186 , D = 4096 and σ = 20 (and thus α = √ 2πσ q ) which results in a security of 78 bits.
As explained in the section on the FV scheme, the error in the ciphertext encrypting the result, should be small enough to enable correct decryption. By estimating the infinity norm of the noise we can select parameters that keep this noise under the correctness bound and in particular, we obtain an upper bound t max of the plaintext modulus. Similarly, to ensure correct decoding, the coefficients of the polynomial encoding the result must remain smaller than the size of the plaintext modulus t. This condition results in a lower bound on the plaintext modulus t min .
It turns out that these bounds are incompatible for the chosen parameters, so we have to rely on the Chinese Remainder Theorem to decompose the plaintext space into smaller parts that can be handled correctly. The plaintext modulus t is chosen as a product of small prime numbers t 1 , t 2 , . . . , t n with ∀i ∈ {1, . . . , n} : t i ≤ t max and t = n i=1 t i ≥ t min , where t max is determined by the correctness of the FV scheme and t min by the correctness of the w-NIBNAF decoding. The CRT then gives the following ring isomorphism: and instead of performing the training algorithm directly over R t , we compute with each of the R t i 's by reducing the w-NIBNAF encodings modulo t i . The resulting choices for the plaintext spaces are given in Table 3.
Since we are using the Chinese Remainder Theorem, each record will be encrypted using two (for the financial fraud case) or three (for the genomics case) ciphertexts. As such, a time-memory trade off is possible depending on the requirements of the application. One can choose to save computing time by executing the algorithm for the different ciphertexts in parallel; or one can choose to save memory by computing the result for each plaintext space R t i consecutively and overwriting the intermediate values of the computations in the process.
The memory required for each ciphertext is easy to estimate: a ciphertext consists of 2 polynomials of R q = Z q [ X] /(X D + 1), so its size is given by 2D log 2 q which is ≈ 186kB for the chosen parameter set. Due to the use of the CRT, we require T (with T = 2 or T = 3) ciphertexts to encrypt each record, so the general formula for the encrypted dataset size is given by: with T the number of prime factors used to split the plaintext modulus t and d+1 (resp. N) the number of covariates (resp. records) used in the training set.
The time complexity of our SFH method is also easy to estimate, but one has to be careful to perform the operations in a specific order. If one would naively compute the matrixH by first computingH and subsequently summing each row, the complexity would be O Nd 2 . However, the formula of the k-th diagonal element ofH is given by −1  it is more efficient to first sum all the rows of X and then perform a matrix vector multiplication with complexity O(Nd). This complexity is clearly visible in the tables, more specifically in Tables 4 and 5 for the genomic use case, and  Tables 6 and 7 for the financial use case. All these tables show a linear growth of the computation time for a growing number of records or covariates as expected by the chosen order of the computations in the implementation.
In Tables 4 and 5 we see that often the AUC value of the SFH model is slightly higher than the AUC value of the glmfit model. However, as mentioned before both models perform poorly on this dataset. Since our SFH model contains many approximations we expect it to perform slightly worse than the "glmfit" model. Only slightly worse because Figs. 1 and 2 already showed that the SFH models classifies the data almost as well as the "glmfit" model. This is consistent with the results for the financial dataset shown in Tables 6 and 7, which we consider more relevant than the results of the genomic dataset due to the fact that both models perform better on this dataset.

Discussion
The experiments of this article show promising results for the simple iterative method we propose as an algorithm to compute the logistic regression model. A first natural question is whether this technique is generalizable to other machine learning problems. In [14], Böhning describes how to adapt the lower bound method to make it applicable to multinomial logistic regression, it is likely this adaption will also apply to our SFH technique and hence our SFH technique can most likely also be applied to construct a multinomial logistic regression model. In  The number of testing records is for each row equal to the total number of input records (20,000) minus the number of training records the case of neural networks we can refer to [15]; in order to construct the neural network one needs to rank all the possibilities and only keep the best performing neurons for the next layer. Constructing this ranking homomorphically is not straightforward and not considered at all in our algorithm, hence neural networks will require more complicated algorithms. When we look purely at the performance of the FV homomorphic encryption scheme, we might consider a residue number system (RNS) variant of the FV scheme as described in [16] to further improve the running time of our implementation. One could also consider single instruction multiple data (SIMD) techniques as suggested in [17] or look further into a dynamic rescaling procedure for FV as mentioned in [6]. These techniques will presumably further decrease the running time of our implementation, which would render our solution even more valuable.

Conclusions
The simple, but effective, iterative method presented in this paper allows one to train a logistic regression model on homomorphically encrypted input data. Our method can be used to outsource the training phase of logistic regression to a cloud service in a privacy preserving manner. We demonstrated the performance of our logistic training algorithm on two real life applications using different numeric data types. In both cases, the accuracy of our method is only slightly worse than standard algorithms to train logistic regression models. Finally, the time complexity of our method grows linearly in the number of covariates and the number of training input data points.