High Performance Logistic Regression for Privacy-Preserving Genome Analysis

In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure Multi-Party Computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.

Abstract-In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training Machine Learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from Machine Learning and cryptography. When collaboratively training Machine Learning models with the cryptographic technique named secure Multi-Party Computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of Machine Learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure Machine Learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand.
Our setup involves secure Two-Party Computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression model, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao's garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. To the best of our knowledge, we present the fastest existing secure Multi-Party Computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.
For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 seconds in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition.
Index Terms-Logistic regression, Gradient descent, Machine Learning, Secure Multi-Party Computation, Gene expression data I. BACKGROUND

A. Introduction
Machine Learning (ML) has many applications in the biomedical domain, such as medical diagnosis and personalized medicine. Biomedical data sets are typically characterized by high dimensionality, i.e. a high number of features such as lab test results or gene expression values, and low sample size, i.e. a small number of training examples corresponding to e.g. patients or tissue samples. Adding to these challenges, valuable training data is often split between parties (data owners) who cannot openly share the data because of privacy regulations and concerns. Due to these concerns, privacypreserving solutions, using techniques such as secure Multi-Party Computation (MPC), become important so that this data can still be used to train ML models, perform a diagnosis, and in some cases even derive genomic diagnoses [25].
We tackle the problem of training a binary classifier on high dimensional gene expression data held by different data owners, while keeping the training data private. This work is directly inspired by Track 4 of the iDASH 2019 secure genome analysis competition 1 . The iDASH competition is a yearly international competition for participants to create and implement privacy-preserving protocols for applications with genomic data. The goal is in evaluating the best-known secure methods and advancing new techniques to solve real-world problems in handling genomic data. In the 2019 edition there were a total of four different tracks, where Track 4 invited participants to design MPC solutions for collaborative training of ML models originating from multiple data owners. One of the Track 4 competition data sets consists of 470 training examples (records) with 17,814 numeric features, while the other consists of 225 training examples with 12,634 numeric features. An initial 5-fold cross-validation analysis in the clear, i.e. without any encryption, indicated that in both cases logistic regression (LR) models are capable of yielding the level of prediction accuracy expected in the competition, prompting us to investigate MPC-based protocols for secure LR training.
The competition requirements implied the existence of multiple data owners who each send their training example(s) in an encrypted or secret shared form to data processors (computing nodes), as illustrated in Figure 1. The honest-but-curious data processors are not to learn anything about the data as they engage in computations and communications with each other. At the end, they disclose the trained classifier -in our case, the coefficients of the LR model -to the data owners. Since the data processors cannot learn anything about the values in the data set, this implies that our protocol is applicable in a wide range of scenarios, independently of how the original data is split by ownership. Our protocol works in scenarios where the data is horizontally partitioned, i.e. when each data owner has different records of the data, such as data belonging to different patients. It also works in scenarios where the data is vertically partitioned, i.e. when each data owner has different features of the data, such as the expression values for different genes.
The main novelty points of our solution for private LR training over a distributed data set are: (i) a new protocol for securely computing the activation function that avoids the use of full-fledged secure comparison protocols; (ii) a novel method for bit decomposing secret shared integers and bundling their instantiations; and (iii) several cryptographic engineering enhancements that together with the novel protocol for the activation function gave us the fastest privacypreserving LR implementation in the world when run in local area networks (LANs). In summary, we designed a concrete solution for fast secure training of a binary classifier over gene expression data that meets the strict security requirements of the iDASH 2019 competition. For our largest data set, we train a model that requires over 7 billion secure multiplications and the training completes in about 26.9 seconds in a LAN. This paper significantly expands over a preliminary version of this result [13], presented at a workshop without formal proceedings. In this version we have a formal description of all protocols, security proofs and improved running times.
We first discuss below our work as compared to others. In the Section Methods, we present preliminary information on MPC, describe the secure subprotocols that are building blocks for our secure LR training protocol, and finally describe the protocol itself. In the Section Results we describe details of our implementation and runtime results for the overall protocol and microbenchmarks for our secure activation function protocol. We experimentally compare our solution with the stateof-the-art SecureML approach [28], demonstrating substantial runtime improvements. In the Section Discussion, we note possible future work to improve and extend our results, and finally in the Section Conclusions we present our summary remarks.

B. Related Work
A variety of efforts have previously been made to train LR classifiers in a privacy-preserving way.
One scenario that was considered in previous works [4], [8], [27] is the setting in which a data owner holds the data while another party (the data processor), such as a cloud service, is responsible for the model training. These solutions usually rely on homomorphic encryption, with the data owner encrypting and sending their data to the data processor who performs computations on the encrypted data without having to decrypt it.
When the data is held by multiple data owners, they can either execute an MPC protocol among themselves to train the model, or delegate the computation to a set of data processors that run a MPC protocol. It is the latter setting that we follow in this paper.
Existing MPC approaches to secure LR differ in the numerical optimization algorithms used for LR training and in the cryptographic primitives leveraged [21], [28], [29], [38].
The SPARK protocol [21] uses additive homomorphic encryption (Paillier cryptosystem) and uses Newton-Raphson as the numerical optimization algorithm to find the values of the weights that maximize the log-likelihood. The SPARK protocol can use the actual logistic function without approximating it at the cost of the plaintext data being horizontally partitioned and seen by the data processors. The two protocols from [29] rely on the Newton-Raphson method, both approximate the logistic function, and both use additive secret sharing. The first protocol includes the use of Yao's garbled circuits to compute the approximation of the logistic function, while the second protocol uses a Taylor approximation and Euler's method. The PrivLogit method [38] uses Yao's garbled circuits and Paillier encryption; their protocol uses the Newton-Raphson method and a constant Hessian approximation to speed up computation. However, this protocol relies on the plaintext data being horizontally partitioned and seen by the data processors, which, like the work in [21], would not align with the iDASH 2019 competition requirements. We also point out a protocol secure against active adversaries from SecureNN [35] for computing a ReLu. While we compute a different function (clipped ReLu), we share a similar idea that using the most significant bit of an input can tell us the output of the function.
The work closest to ours is SecureML [28], which was the fastest protocol for privately training LR models based on secure MPC prior to our work. SecureML separates the data owners from the data processors, and uses mini-batch gradient descent. The main novelty points of SecureML are a clipped ReLu activation function, a novel truncation protocol, and a combination of garbled circuits and secret sharing based MPC in order to obtain a good trade-off between communication, computation and round complexities. The SecureML protocol is evaluated on a data set with up to 5,000 features, whileto the best of our knowledge -the existing runtime evaluation of all other approaches for MPC based LR training is limited to 400 features or less [21], [29], [38]. Like our solution, the SecureML protocol is split into an offline and online phase (the offline phase can be executed before the inputs are known and is responsible for generating multiplication triples). The SecureML solution is based on two servers, while our solution is based on three servers, namely a party who pre-computes so-called multiplication triples in the off-line stage, and two parties who actively compute the final result. If we exclude the preprocessing/off-line stage from SecureML and exclude the pre-distribution of triples in our solution, we are left with protocols that work in exactly the same setting. We compare the runtime of both solutions in the Section Results, showing that our implementation is substantially faster.
A preliminary version of this work appeared in a workshop without formal proceedings [13]. This paper is a substantially longer and detailed description that includes security proofs, detailed comparison with the state-of-the-art, and improved running times.

A. Logistic Regression
Logistic regression is a common Machine Learning algorithm for binary classification. The training data D con- containing the values of m input attributes for example d, and t d ∈ {0, 1} is the ground truth class label. Each x d,i for i ∈ {1, 2, ..., m} is a real number value.
As illustrated in Figure 2(a), we train a neuron to map the x d 's to the corresponding t d 's, correctly classifying the examples. The neuron computes a weighted sum of the inputs (the values of the weights are learned during training) and subsequently applies an activation function to it, to arrive at the output o d = f (w 0 ·x d,0 +w 1 ·x d,1 +· · ·+w n ·x d,n ), which is interpreted as the probability that the class label is 1. Note that, as is common in neural network training, we extend the input attribute vector with a dummy feature x d,0 which has value 1 for all x d 's. The traditionally used activation function for LR is the sigmoid function σ(z) = 1 1+e −z . Since the sigmoid function σ requires division and evaluation of an exponential function, which are expensive operations to perform in MPC, we approximate it with the activation function ρ from [28], which is shown in Figure 2 For training, we use the full gradient descent based algorithm shown in Algorithm 1 to learn the weights for the LR model. On line 3, we choose not to use early stopping 2 because in that case the number of iterations would depend on the values in the training data, hence leaking information [29]. Instead, we use a fixed number of iterations during training.

B. Our scenario
In the scenario considered in this work the data is not held by a single party that performs all the computation, but distributed by the data owners to the data processors in such 2 This is a technique that uses a metric, such as the accuracy on a held-out validation data set, to check when a model starts to overfit and will then stop training at that point.
way that each data processor does not have any information about the data in the clear. Nevertheless, the data processors would still like to compute a LR model without leaking any other information about the data used for the training. To achieve this goal, we will use techniques from MPC.
Our setup is illustrated in Figure 1. We have multiple data owners who each hold disjoint parts of the data that is going to be used for the training. This is the most general approach and covers the cases in which the data is horizontally partitioned (i.e. for each training sample d = (x d , t d ), all the data for d is held by one of the data owners), vertically partitioned (for each feature, the values of that feature for all training samples are held by one of the data owners), and even arbitrary partitions. There are two data processors who collaborate to train a LR model using secure MPC protocols, and a trusted initializer (TI) that predistributes correlated randomness to the data processors in order to make the MPC computation more efficient. The TI is not involved in any other part of the execution, and does not learn any data from the data owners or data processors.
We next present the security model that is used and several secure building blocks, so that afterwards we can combine them in order to obtain a secure LR training protocol.

C. Security Model
The security model in which we analyze our protocol is the Universal Composability (UC) framework [5] as it provides the strongest security and composability guarantees and is the gold standard for analyzing cryptographic protocols nowadays. Here we will only give a short overview of the UC framework (for the specific case of two-party protocols), and refer interested readers to the book of Cramer et al. [9] for a detailed explanation.
The main advantage of the UC framework is that the UC composition theorem guarantees that any protocol proven UCsecure can also be securely composed with other copies of itself and of other protocols (even with arbitrarily concurrent executions) while preserving its security. Such guarantee is very useful since it allows the modular design of complex protocols, and is a necessity for protocols executing in complex environments such as the Internet.
The UC framework first considers a real world scenario in which the two protocol participants (the data processors from Figure 1, henceforth denoted Alice and Bob) interact between themselves and with an adversary A and an environment Z (that captures all activity external to the single execution of the protocol that is under consideration). The environment Z gives the inputs and gets the outputs from Alice and Bob. The adversary A delivers the messages exchanged between Alice and Bob (thus modeling an adversarial network scheduling) and can corrupt one of the participants, in which case he gains the control over it. In order to define security, an ideal world is also considered. In this ideal world, an idealized version of the functionality that the protocol is supposed to perform is defined. The ideal functionality F receives the inputs directly from Alice and Bob, performs the computations locally following the primitive specification and delivers the outputs directly to Alice and Bob. A protocol π executing in the real world is said to UC-realize functionality F if for every adversary A there exists a simulator S such that no environment Z can distinguish between: (1) an execution of the protocol π in the real world with participants Alice and Bob, and adversary A; (2) and an ideal execution with dummy parties (that only forward inputs/outputs), F and S.
This work like the vast majority of the privacy-preserving machine learning protocols in the literature considers honestbut-curious, static adversaries. In more detail, the adversary chooses the party that he wants to corrupt before the protocol execution and he also follows the protocol instructions (but tries to learn additional information). We consider the trusted initializer model, in which a trusted initializer functionality F D TI pre-distributes correlated randomness to Alice and Bob. 3 A trusted initializer has been often used to enable highly efficient solutions both in the context of privacy-preserving machine learning [14], [10], [22], [12], [31] as well as in other applications, e.g., [32], [19], [18], [24], [34], [11]. Simplifications: In our proofs the simulation strategy is simple and will be described briefly: all the messages look uniformly random from the recipient's point of view, except for the messages that open a secret shared value to a party, but these ones can be easily simulated using the output of the respective functionalities. Therefore a simulator S, having the leverage of being able to simulate the trusted initializer functionality F D TI in the ideal world, can easily perform a perfect simulation of a real protocol execution; therefore making the real and ideal worlds indistinguishable for any environment Z. In the ideal functionalities the messages are public delayed outputs, meaning that the simulator is first asked whether they should be delivered or not (this is due to the modeling that the adversary controls the network scheduling). This fact as well as the session identifications are omitted from our functionalities' descriptions for the sake of readability.

D. Secret Sharing Based Secure Multi-Party Computation
Our MPC solution is based on additive secret sharing over a ring Z q = {0, 1, . . . , q − 1}. When secret sharing a value x ∈ Z q , Alice and Bob receive shares x A and x B , respectively, that are chosen uniformly at random in Z q with the constraint that x A + x B = x mod q. We denote the pair of shares by [[x]] q . All computations are modulo q and the modular notation is henceforth omitted for conciseness. Note that no information of the secret value x is revealed to either party holding only one share. ] q + c. The secure multiplication of secret shared values (i.e., z = xy) cannot be done locally and involves communication between Alice and Bob. To obtain an efficient secure multiplication solution, we use the multiplication triples technique that was originally proposed by Beaver [3]. We use a trusted initializer to pre-distribute the multiplication triples (which are a form of correlated randomness) to Alice and Bob. We use the same protocol π DMM for secure (matrix) multiplication of secret shared values as in [12], [15] and denote by π DM the protocol for the special case of multiplication of scalars and π IP for the inner product. As shown in [12] the protocol π DMM (described in Protocol 2) UC-realizes the distributed matrix multiplication functionality F DMM in the trusted initializer model.
Functionality F DMM F DMM runs with Alice and Bob and is parametrized by the size q of the ring Z q and the dimensions (i, j) and (j, k) of the matrices.
Input: Upon receiving a message from Alice/Bob with its shares of X q and Y q , verify if the share of X is in Z i×j q and the share of Y is in Z j×k q . If it is not, abort. Otherwise, record the shares, ignore any subsequent message from that party and inform the other party about the receipt.
Output: Upon receipt of the shares from both parties, reconstruct X and Y from the shares, compute Z = XY and create a secret sharing Z q to distribute to Alice and Bob: a corrupt party fixes its share of the output to any chosen matrix and the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint.
The protocol is parametrized by the size q of the ring Z q and the dimensions (i, j) and (j, k) of the matrices.

E. Converting to Fixed-Point Representation
Each data owner initially needs to convert their training data to integers modulo q so that they can be secret shared. As illustrated in Figure 3, each feature value x ∈ R is converted into a fixed point approximation of x using a two's complement representation for negative numbers. We define this new value as Q(x) ∈ Z q . This conversion is shown in Equation (1): Specifically, when we convert Q(x) into its bit representation, we define the first a bits from the right to hold the fractional part of x, and the next b bits to represent the nonnegative integer part of x, and the most significant bit (MSB) to represent the sign (positive or negative). We define λ to represent the total number of bits such that the ring size q is defined as q = 2 λ . It is important to choose a λ that is large enough to represent the largest number x that can be produced during the LR protocol, and therefore λ should be chosen to be at least 2(a + b) (see Truncation). It is also important to choose a b that is large enough to represent the maximum possible value of the integer part of all x's (this is dependent on the data). This conversion and bit representation is shown in Figure 3.

F. Truncation
When multiplying numbers that were converted into a fixed point representation with a fractional bits, the resulting product will end up with a more bits representing the fractional part. For example, a fixed point representation of x and y, for x, y > 0, is x · 2 a and y · 2 a , respectively. The multiplication of both these terms results in xy · 2 2a , showing that now 2a bits are representing the fractional part, which we must scale back down to xy·2 a to do any further computations. In our solution, we use the two-party local truncation protocol for fixed point representations of real numbers proposed in [28] that we will refer to as π trunc . It does not involve any messages between the two parties, each party simply performs an operation on its own local share. This protocol almost always incurs an error of at most a bit flip in the least-significant bit. However, with probability 2 a+1−λ , where a is the number of fractional bits, the resulting value is completely random.
When this truncation protocol is performed on increasingly large data sets (in our case we run over 7 billion secure multiplications), the probability of an erroneous truncation becomes a real issue -an issue not significant in previous implementations. There are two phases in which truncation is performed: (1) when computing the dot product (inner product) of the current weights vector with a training example in line 7 of Algorithm 1, and (2) when the weight differentials (∆w i ) are adjusted in line 9 of Algorithm 1. If a truncation error occurs during (1), the resulting erroneous value will be pushed into a reasonable range by the activation function and incur only a minor error for that round. If the error occurs during (2), an element of the weights vector will be updated to a completely random ring element and recovery from this error will be impossible. To mitigate this in experiments, we make use of 10-12 bits of fractional precision with a ring size of 64 bits, making the probability of failure 1 2 53 < p < 1 2 51 . The number of truncations that need to be performed is also reduced in our implementation by waiting to perform truncation until it is absolutely required. For instance, instead of truncating each result of multiplication between an attribute and its corresponding weight, a single truncation can be performed at the end of the entire dot product.
Additional error is incurred on the accuracy by the fixed point representation itself. Through cross-validation with an in-the-clear implementation, we determined that 12 bits of fractional precision provide enough accuracy to make the output accuracy indistinguishable between the secure version and the plaintext version.

G. Conversion of Sharings
For efficiency reasons, in some of the steps for securely computing the activation function we use secret sharings over Z 2 , while in others we use secret sharings over Z 2 λ . Therefore we need to be able to convert between the two types of secret sharings.
We use the two-party protocol from [12] for performing the bit-decomposition of a secret-shared value [[x]] 2 λ to shares x i 2 , where x λ · · · x 1 is the binary representation of x. It works like the ripple carry adder arithmetic circuit based on the insight that the difference between the sum of the two additive shares held by the parties and an "XOR-sharing" of that sum is the carry vector. As proven in [12], the bit-decomposition protocol π decomp (described in Protocol 3) UC-realizes the bitdecomposition functionality F decomp .
Functionality F decomp F decomp runs with Alice and Bob and is parametrized by the bit-length λ of the value x being converted from additive sharings [[x]] 2 λ in Z 2 λ to additive bitwise sharings x i 2 in Z 2 such that x = x λ · · · x 1 .
Input: Upon receiving a message from Alice or Bob with its share of [[x]] 2 λ , record the share, ignore any subsequent messages from that party and inform the other party about the receipt.
Output: Upon receipt of the inputs from both parties, reconstruct the value x = x λ · · · x 1 from the shares, and for i ∈ {1, . . . , λ} distribute new sharings x i 2 of the bit x i . Before the output deliver, the corrupt party fix its shares of the output to any desired value. The shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraints.

Input : [[x]] 2 λ
Output: x i 2 , where x λ · · · x 1 is the binary representation of x. 1 All distributed multiplications are over Z 2 and the required correlated randomness is pre-distributed by the trusted initializer. 2 Let a denote Alice's share of x, which corresponds to a bit string a λ . . . a 1 . Similarly, let b denote Bob's share of x, which corresponds to a bit string b λ . . . b 1 . Define the secret sharings y i 2 as the pair of shares (a i , b i ) In our implementation we use a highly parallelized and optimized version of the bit-decomposition protocol π decomp in order to improve the communication efficiency of the overall solution. The optimizations are described in the Appendix.
The opposite of a secure bit-decomposition is converting from bit sharing to an additive sharing over a larger ring. In our secure activation function protocol, we require securely converting a bit sharing to an additive sharing in 2 λ . This is done using the protocol π 2to2 λ from [31] (described in Protocol 4) that UC-realizes the secret sharing conversion functionality F 2to2 λ .
Functionality F 2to2 λ F 2to2 λ is parametrized by the bit-length λ of the ring in which the output is shared.
Input: Upon receiving a message from Alice/Bob with her/his share of x 2 , record the share, ignore any subsequent messages from that party and inform the other party about the receipt.
Output: Upon receipt of the inputs from both parties, reconstruct x, then create and distribute to Alice and Bob the secret sharing [[x]] 2 λ . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint.

H. Secure Activation Function
We propose a new protocol that evaluates ρ from Figure  2(b) directly over additive shares and does not require full secure comparisons, which would have been more expensive. Instead of doing straightforward comparisons between z, 0.5 and −0.5, we derive the result through checking two things: (i) whether z = z + 1/2 is positive or negative; (ii) whether z ≥ 1. Both checks can be performed without using a full comparison protocol.
When z is bit decomposed, the most significant bit is 0 if z is non-negative and 1 if z is negative. In fact, if out of the λ bits, the a lowest bits are used to represent the fractional component and the b next bits are used to represent the integer component, then the remaining λ − a − b bits all have the same value as the most significant bit. We will use this fact in order to optimize the protocol by only performing a partial bitdecomposition and deducting whether z is positive or negative from the (a + b + 1)-th bit.
In the case that z is negative, the output of ρ is 0. But, if z is positive, we need to determine whether z ≥ 1 in order to know if the output of ρ should be fixed to 1 or to z . A positive z is such that z ≥ 1 if and only if at least one of the b bits corresponding to the integer component of z representation is equal to 1, therefore we only need to analyze those b bits to determine if z ≥ 1.
Our secure protocol π ρ is described in Protocol 5. The AND operation corresponds to multiplications in Z 2 . By the application of De Morgan's law, the OR operation is performed using the AND and negation operations. The successive multiplications can be optimized to only take a logarithmic number of rounds by using well-known techniques.
The activation function protocol π ρ UC-realizes the activation function functionality F ρ . The correctness can be checked by inspecting the three possible cases: (i) if z > 1/2, then pos = 1 and geq1 = 1 (since at least one of the bits representing the integer component of z+1/2 will have a value 1). The output is thus [[2 a ]] 2 λ (the fixed-point representation of 1); if −1/2 ≤ z < 1/2, then pos = 1 and geq1 = 0, and therefore the output will be [[z ]] 2 λ , which is the fixedpoint representation of z + 1/2; if z < −1/2, then pos = 0 and the output will be a secret sharing representing zero as expected. The security follows trivially from the UC-security of the building blocks used and the fact that no secret sharing is opened. Output: Upon receipt of the inputs from both parties, reconstruct z, compute the result of the activation function ρ(z), and then create and distribute to Alice and Bob the secret sharing [[ρ(z)]] 2 λ (using the fixed-point representation). Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint.

I. Secure Logistic Regression Training
We now present our secure LR training protocol that uses a combination of the previously mentioned building blocks.
Notice that in the full gradient descent technique described in Algorithm 1, the only operations that cannot be performed fully locally by the data processors, i.e. on their own local shares, are: • The computation of the inner product in line 7 Protocol 5: Secure Protocol π ρ for Computing the Activation Function ρ. Constraints: all values in Z 2 λ are representations of fixed point approximations of real numbers s.t. the lowest a bits represent the fractional component, the next b bits represent the integer component and λ > a + b. Further, a negative value x is represented as 2 λ − |x|.
Input : Our secure LR training protocol π LR−Training (described in Protocol 6) shows how the secure building blocks described before can be used to securely compute these operations. The inner product is securely computed using π IP on line 5, and since this involves multiplication on numbers that are scaled to a fixed-point representation, we truncate the result using π trunc . The activation function is securely computed using π ρ on line 6. The multiplication of t d − o d with x d,i is done using secure multiplication with batching on line 11. Since this also involves multiplication on numbers that are scaled, the result is truncated using π trunc in line 14. A slight difference between the full gradient descent technique described in Algorithm 1 and our protocol π LR−Training , is that instead of updating ∆w i after every evaluation of the activation function, we batch together all activation function evaluations before computing the ∆w i . Since the activation function requires a bit-decomposition of the input, we can now make use of the efficient batch bit-decomposition protocol batch-π decompOPT (see Appendix) within the activation function protocol π ρ .
The LR training protocol π LR−Training UC-realizes the logistic regression training functionality F LR−Training . The correctness is trivial and the security follows straightforwardly from the UC-security of the building blocks used in π LR−Training . Output: Upon receipt of the inputs from both parties, locally perform the same computational steps as π ρ using the secret sharings. Let [[w]] be the resulting vector. Before the deliver of the output shares, a corrupt party can fix the shares that it will get, in which case the other shares are adjusted accordingly to still sum to w. The output shares are delivered to the parties. learning rate η; number of iterations n iter . All secret sharings in the description of this protocol are in Z 2 λ and thus we simplify the notation to [[·]]. Output: [[w]] for a vector of weights w i that minimize the sum of squared errors over the training data The following steps describe end-to-end how to securely train a LR classifier: 1) The TI sends the correlated randomness needed for efficient secure multiplication to the data processors.
Note that while our current implementation has the TI continuously sending the correlated randomness, it is possible for the TI to send all correlated randomness as the first step, and therefore can leave and not be involved during the rest of the protocol. . Each data processor sends their shares to all of the data owners, who can then combine the shares to learn the weights of the LR model.
J. Cryptographic Engineering Optimizations 1) Sockets and Threading: A single iteration of the LR protocol is highly parallelizable in three distinct segments: (1) computing the dot products between the current weights and the data set, (2) computing the activation of each dot product result, and (3) computing the gradient and updating the weights. In each of these phases, a large number of computations are required, but none have dependencies on others. We take advantage of this by completing each of these phases with thread pools that can be configured for the machine running the protocol. We implemented the proposed protocols in Rust; with Rust's ownership concept, it is possible to yield results from threads without message passing or reallocation. Hence, the code is constructed to transfer ownership of results at each phase back to the main thread to avoid as much inter-process communication as possible. Additionally, all threads complete socket communications by computing all intermediate results directly in the socket buffer by implementing the buffer as a union of byte array and unsigned 64-bit integer array. This buffer is allocated on the stack by each thread which circumvents the need for a shared memory block while also avoiding slower heap memory. The implementation of this configuration reduced running times significantly based on our trials.
Further, all modular arithmetic operations are handled implicitly with the Rust API's Wrapping struct which tells the ALU to ignore integer overflow. As long as the size of the ring over which the MPC protocols are performed is selected to align with a provided primitive bit width (i.e. 8, 16, 32, 64, 128) it is possible to omit computing the remainder of arithmetic with this construction.

III. RESULTS
We implemented the protocols from the Methods section in Rust 4 and experimentally evaluated them on the BC-TCGA and GSE2034 data sets of the iDASH 2019 competition. Both data sets contain gene expression data from breast cancer patients which are normal tissue/non-recurrence samples (negative) or breast cancer tissue/recurrence tumor samples (positive) [37]. We trained LR models on both data sets with a learning rate η = 0.001. We use a fixed number of iterations for each data set: 10 iterations for the BC-TCGA data set and 223 iterations for the GSE2034 data set. The accuracy of the resulting models, evaluated with 5-fold cross-validation, is presented in Table I, along with the average runtime for training those models. It is important to note that these are the same accuracies that are obtained when training in the clear, i.e. there is no accuracy loss in the secure version.
We used integer precision b = 15, fractional precision a = 12 and ring size λ = 64 (these choices were made based on A previous version of this implementation was submitted to the iDASH 2019 Track 4 competition. 9 of the 67 teams who entered Track 4 completed the challenge. Our solution was one of the 3 solutions who tied for the first place. Our implementation trained on all of the features for both data sets (no feature engineering is done), and generated a model that gave the highest accuracy, with runtimes that were well within the competition's limit of 24 hours. The implementation presented in the current work is further optimized in relation to the iDASH version and achieves far better runtimes.
We note that while SecureML differs from our work in their setup and cryptographic primitives, it shares many similarities to ours and reports a fast runtime such that we find it valuable as a standard to compare to. While SecureML does not originally use a TI to predistribute the multiplication triples, it would be easy to adapt their result to use a TI for that purpose. Therefore, in order to have a fair comparison, we compare our protocol runtime against only their online runtime (thus excluding their offline runtime). We evaluated our implementation's runtime against SecureML's implementation by running their implementation on the same AWS machines using the same data sets (see Table II for runtime comparisons). For both data sets, our online phase runs faster than SecureML's online phase which trains BC-TCGA in 12.73 seconds and GSE2034 in 49.95 seconds.
We then compare online microbenchmark computation times. For the computation of the activation function, our run of the SecureML code reported around 0.057 ms to 0.059 ms for 1 activation, while our implementation completes 1024 evaluations in around 30 ms (0.029 ms per activation function). This makes our secure activation function implementation nearly twice as fast as SecureML's. Additionally, it eliminates the overhead of switching between Yao gates and additive secret sharing. Furthermore, our activation function runs more efficiently (per evaluation) the more evaluations of it need to be computed, due to the design of the batch bit-decomposition protocol. This is illustrated in Table III where the calculated runtime per evaluation (runtime divided by number of evaluations) decreases as the number of evaluations increase.

IV. DISCUSSION
Our runtime experiments on securely training a LR model show that it is feasible to train on data that includes a large number of attributes, as is common with genomic data. Given the high dimensionality of the genomic data, an interesting direction for future work would be the design of MPC protocols for privacy-preserving feature reduction. If any kind of feature reduction is used, it would result in a decrease in secure training runtime with a possibility for a slight decrease in the accuracy. We demonstrate this by choosing (in the clear) 54 features of the BC-TCGA data set that were part of the 76-gene signature described in [36]. Training on these 54 features, we get a 5-fold cross-validation accuracy of 98.93% (training on all features produced 99.58%), and the average secure training time (of three runs) is 0.51 seconds, which is about a 2 second decrease from training on all 17,814 features. The genes in the GSE2034 data set are not labeled in a way where we can map them to the 76-gene signature to test the accuracy for a reduced number of features, but we test the runtime of training on 76 attributes and we get an average of 6.71 seconds, which is about a 20 second decrease from training on all 12,634 features. This shows that if feature reduction can be performed, runtimes can be improved while still being able to produce an accurate trained model.
Our main contribution is the proposal of the fastest implementation and protocol for privacy-preserving training of LR models. Our novelty points are the new protocol for privately evaluating the activation function ρ which can be computed using only additive shares and MPC protocols, without using a protocol for secure comparison. We use ρ as an approximation of the sigmoid function σ since that is what is traditionally used in LR training, but σ is also used as an activation function in neural networks. Therefore, our fast secure protocol for computing ρ can also result in faster neural network training. While training neural networks are out of the scope of this paper, we note that our results can be applicable to those types of ML models as well.

V. CONCLUSIONS
In this paper, we have described a novel protocol for implementing secure training of LR over distributed parties using MPC. Our protocol and implementation present several novel points and optimizations compared to existing work, including: (i) a novel protocol for computing the activation function that avoids the use of full-fledged secure comparison protocols; (ii) a series of cryptographic engineering optimizations to improve the performance.
With our implementation, we can train on the BC-TCGA data set with 17,814 features and 375 samples with 10 iterations in 2.52 seconds, and we can train on the GSE2034 data set with 12,634 features and 179 samples with 223 iterations in 26.90 seconds. A less optimized version of this implementation won first place at the iDASH 2019 Track 4 competition when considering accuracy and efficiency. Our solution is particularly efficient for LANs where we can perform 1024 secure computations of the activation function in about 30 ms. To the best of our knowledge, ours is the fastest protocol for privately training logistic regression models over local area networks.

Optimization of π decomp
Overview and Previous Work: The functionality F decomp (described in Section Methods) is easily realized as an adder circuit that takes as inputs each bit of the additive shares of a secret sharing [[x]] 2 λ in a large ring Z 2 λ and outputs an "XOR-sharing" of the secret Naively, this carry vector can be obtained with linear communication complexity by means of ripple carry addition, as is described in Protocol 3. But, it is possible to achieve logarithmic communication complexity and even constant complexity [33] (though with worse performance than the logarithmic version for all reasonable bit lengths).
The highest performing realization of F decomp for realistic bit lengths is based on a speculative adder circuit [12] in which at each layer the next set of carry bits are computed twice; once for each case that the previous carry bit had been 0 and 1. This protocol has log(λ) + 2 rounds of communication and requires a total data transfer of 4λ log(λ) + 6λ bits.
We propose a new, highly optimised protocol based on a matrix composition network that reduces the number of communication rounds by 1 (or 2, in special cases) and requires a small fraction of the aforementioned data transfer cost.
Matrix composition network: To sum the binary numbers a and b, the i-th bit is given by In an alternate view, the carry can be seen to depend on two signals which in turn depend on a and b. Generate (g i = a i b i ) creates a new carry bit at the i-th position, and Propogate (p i = a i ⊕b i ) perpetuates the previous carry bit, if it exists. In this representation, s i = p i ⊕ c i−1 and c i = g i + p i c i−1 . This sum-of-products form of the expression for c i lends itself to a matrix representation When matrices in the form of M i are composed, the lower entries remain unchanged. This implies that Therefore, to compute all c i , it is sufficient to compute the set of all matrix compositions Note that it is not necessary to compute the λ-th carry bit because s λ depends on c λ−1 . Treating the carry-in to the 1st bit as the vector (0, 1), all c i can be derived implicitly from the upper right-hand entry of M 1.i (here, M 1,i denotes the matrix composed of all matrices M 1 through M i , consecutively).
From the MPC perspective, this matrix composition requires two Z 2 multiplications: p i+1 p i and p i+1 g i as seen in the equation below. The OR operation (+), which usually requires multiplication in MPC, is reduced to XOR based on the observation that p i+1 and g i+1 cannot both be true for a given i.
The entire set of matrix compositions can be realized in a logarithmic depth network by, at the i-th layer, computing all compositions M 1.j that require fewer than 2 i−1 compositions. To set up conditions to allow us to minimize the total data transfer, the constraint is added that each M 1.j should be the composition of the "largest" matrix from the previous layer, M 1.2 i−2 , with the remainder M 2 i−2 +1.j . If M 2 i−2 +1.j doesn't exist in the network, it is added recursively following the same set of constraints. Figure 4 shows an example with λ = 17. This network is hereafter referred to as ComposeNet p where p is the highest order bit to decompose. The protocol description that follows considers only the case where p = λ, though the protocol functions the same for any p ≤ λ. For instance, in Protocol 3, when using π decomp to find the MSB of a secret, it is sufficient to set p = a + b + 1. . This corresponds to one communication round and 2λ bits of data transfer.
A call to ComposeNet λ has communication complexity corresponding to the depth of the network, log(λ − 1) , and λ 2 multiplications over Z 2 per layer, with fewer on the final layer when λ − 1 is not a power of 2. However, due to the fact that the matrices at each node of ComposeNet λ are reused extensively and known to not change value, the Beaver Triples used to mask the matrices can be desgined to contain redundancies to minimise the data transfer at each layer [28]. By re-using correlated randomness where information leakage is not possible, only λ 2 −(2 i−i −1) masks need to be transferred at depth i, for i > 0. At depth 0, there are λ masks; one for each matrix. Each matrix mask is 2 bits (one for each of the Propogate and Generate bits), so the total data transfer is 2λ + 2 log(λ−1) −1 i=1 ( λ 2 + 1 − 2 i−1 ). The recombination phase after ComposeNet λ is computed has only local computations and thus contributes nothing to the complexity.
Combining all phases, we see that π decompOPT has a communication cost of log(λ − 1) + 1 and a total data transfer cost of 4λ + 2 log(λ−1) −1 i=1 ( λ 2 + 1 − 2 i−1 ) bits. Comparing with the speculative adder's performance, the number of communication rounds is decreased by 1 in all cases and 2 in the case that λ − 1 is a power of 2. The total data transfer cost has roughly 1 3 the data transfer rate of the previous work at λ = 8, 16. For higher all bit lengths, the ratio quickly converges near 1 4 . Implementation and Batching: ComposeNet λ can be implemented efficiently as a set of index pairs that correspond to the positions of the Propogate and Generate bits that need to be combined at each layer. Once per layer, all products p i+1 p i , p i+1 g i can be computed in a single call to π DM by taking the bitwise product between the concatenations p i+1 ||p i+1 , p i ||g i and splitting the result.
Extending to the case that many values need to be bit decomposed at the same time (as in Protocol 6), a vector of inputs can be decomposed "in parallel" by taking vertical slices over the Generate and Propogate bits of each element and re-packing them into a transposed form. In this way, each layer of ComposeNet λ can operate on a vector of matrices (represented as two lists of bit slices) to produce a vector of matrix compositions. This method has no effect on the number of rounds of communication and the total data transfer scales linearly with the length of the input vector.