Logistic regression
Logistic regression is a common Machine Learning algorithm for binary classification. The training data D consists of training examples \(d = (\varvec{x}_d,t_d)\) in which \(\varvec{x}_d=\langle x_{d,1},x_{d,2},\ldots ,x_{d,m}\rangle\) is an m-dimensional numerical vector, containing the values of m input attributes for example d, and \(t_d\in \{0,1\}\) is the ground truth class label. Each \(x_{d,i}\) for \(i \in \{ 1,2,\ldots ,m\}\) is a real number value.
As illustrated in Fig. 2a, we train a neuron to map the \(\varvec{x}_d\)’s to the corresponding \(t_d\)’s, correctly classifying the examples. The neuron computes a weighted sum of the inputs (the values of the weights are learned during training) and subsequently applies an activation function to it, to arrive at the output \(o_d = f(w_0\cdot x_{d,0}+w_1\cdot x_{d,1}+\cdots +w_n\cdot x_{d,n})\), which is interpreted as the probability that the class label is 1. Note that, as is common in neural network training, we extend the input attribute vector with a dummy feature \(x_{d,0}\) which has value 1 for all \(\varvec{x}_d\)’s. The traditionally used activation function for LR is the sigmoid function \(\sigma (z)=\frac{1}{1+e^{-z}}\). Since the sigmoid function \(\sigma\) requires division and evaluation of an exponential function, which are expensive operations to perform in MPC, we approximate it with the activation function \(\rho\) from [2], which is shown in Fig. 2b.
For training, we use the full gradient descent based algorithm shown in Algorithm 1 to learn the weights for the LR model. On line 3, we choose not to use early stoppingFootnote 2 because in that case the number of iterations would depend on the values in the training data, hence leaking information [9]. Instead, we use a fixed number of iterations during training.
Our scenario
In the scenario considered in this work the data is not held by a single party that performs all the computation, but distributed by the data owners to the data processors in such way that each data processor does not have any information about the data in the clear. Nevertheless, the data processors would still like to compute a LR model without leaking any other information about the data used for the training. To achieve this goal, we will use techniques from MPC.
Our setup is illustrated in Fig. 1. We have multiple data owners who each hold disjoint parts of the data that is going to be used for the training. This is the most general approach and covers the cases in which the data is horizontally partitioned (i.e. for each training sample \(d=(\varvec{x}_d,t_d)\), all the data for d is held by one of the data owners), vertically partitioned (for each feature, the values of that feature for all training samples are held by one of the data owners), and even arbitrary partitions. There are two data processors who collaborate to train a LR model using secure MPC protocols, and a trusted initializer (TI) that predistributes correlated randomness to the data processors in order to make the MPC computation more efficient. The TI is not involved in any other part of the execution, and does not learn any data from the data owners or data processors.
We next present the security model that is used and several secure building blocks, so that afterwards we can combine them in order to obtain a secure LR training protocol.
Security model
The security model in which we analyze our protocol is the universal composability (UC) framework [12] as it provides the strongest security and composability guarantees and is the gold standard for analyzing cryptographic protocols nowadays. Here we will only give a short overview of the UC framework (for the specific case of two-party protocols), and refer interested readers to the book of Cramer et al. [13] for a detailed explanation.
The main advantage of the UC framework is that the UC composition theorem guarantees that any protocol proven UC-secure can also be securely composed with other copies of itself and of other protocols (even with arbitrarily concurrent executions) while preserving its security. Such guarantee is very useful since it allows the modular design of complex protocols, and is a necessity for protocols executing in complex environments such as the Internet.
The UC framework first considers a real world scenario in which the two protocol participants (the data processors from Fig. 1, henceforth denoted Alice and Bob) interact between themselves and with an adversary \(\mathcal {A}\) and an environment \(\mathcal {Z}\) (that captures all activity external to the single execution of the protocol that is under consideration). The environment \(\mathcal {Z}\) gives the inputs and gets the outputs from Alice and Bob. The adversary \(\mathcal {A}\) delivers the messages exchanged between Alice and Bob (thus modeling an adversarial network scheduling) and can corrupt one of the participants, in which case he gains the control over it. In order to define security, an ideal world is also considered. In this ideal world, an idealized version of the functionality that the protocol is supposed to perform is defined. The ideal functionality \(\mathcal {F}\) receives the inputs directly from Alice and Bob, performs the computations locally following the primitive specification and delivers the outputs directly to Alice and Bob. A protocol \(\pi\) executing in the real world is said to UC-realize functionality \(\mathcal {F}\) if for every adversary \(\mathcal {A}\) there exists a simulator \(\mathcal {S}\) such that no environment \(\mathcal {Z}\) can distinguish between: (1) an execution of the protocol \(\pi\) in the real world with participants Alice and Bob, and adversary \(\mathcal {A}\); (2) and an ideal execution with dummy parties (that only forward inputs/outputs), \(\mathcal {F}\) and \(\mathcal {S}\).
This work like the vast majority of the privacy-preserving machine learning protocols in the literature considers honest-but-curious, static adversaries. In more detail, the adversary chooses the party that he wants to corrupt before the protocol execution and he also follows the protocol instructions (but tries to learn additional information).
Setup assumptions and the trusted initializer model
Secure-two party computations are impossible to achieve without further assumptions. We consider the trusted initializer model, in which a trusted initializer functionality \(\mathcal {F}^{\mathcal {D}_{}}_{\mathsf {TI}}\) pre-distributes correlated randomness to Alice and Bob. A trusted initializer has been often used to enable highly efficient solutions both in the context of privacy-preserving machine learning [14,15,16,17,18] as well as in other applications, e.g., [19,20,21,22,23,24].
If a trusted initializer is not desirable, the computing parties can “emulate” such a trusted party by using computational assumptions in an offline phase in association with a suitable setup assumption, as done e.g. in SecureML [2].Footnote 3 Even with such a different technique to realize the offline phase, the online phase of our protocols would remain the same. The novelties of our work are in the online phase, and can be used in combination with any standard technique for the offline phase, such as the TI assumption (as we do in our implementation), or the computational assumptions made in SecureML. Our solution for the online phase leads to substantially better runtimes than SecureML, as we document in the “Results” section.
Simplifications In our proofs the simulation strategy is simple and will be described briefly: all the messages look uniformly random from the recipient’s point of view, except for the messages that open a secret shared value to a party, but these ones can be easily simulated using the output of the respective functionalities. Therefore a simulator \(\mathcal {S}\), having the leverage of being able to simulate the trusted initializer functionality \(\mathcal {F}^{\mathcal {D}_{}}_{\mathsf {TI}}\) in the ideal world, can easily perform a perfect simulation of a real protocol execution; therefore making the real and ideal worlds indistinguishable for any environment \(\mathcal {Z}\). In the ideal functionalities the messages are public delayed outputs, meaning that the simulator is first asked whether they should be delivered or not (this is due to the modeling that the adversary controls the network scheduling). This fact as well as the session identifications are omitted from our functionalities’ descriptions for the sake of readability.
Secret sharing based secure multi-party computation
Our MPC solution is based on additive secret sharing over a ring \(\mathbb {Z}_{q}\) \(=\) \(\{0,1,\ldots ,q-1\}\). When secret sharing a value \(x \in \mathbb {Z}_{q}\), Alice and Bob receive shares \(x_A\) and \(x_B\), respectively, that are chosen uniformly at random in \(\mathbb {Z}_{q}\) with the constraint that \(x_A + x_B = x \mod q\). We denote the pair of shares by \(\llbracket x\rrbracket _q\). All computations are modulo q and the modular notation is henceforth omitted for conciseness. Note that no information of the secret value x is revealed to either party holding only one share. The secret shared value can be revealed/opened to each party by combining both shares. Some operations on secret shared values can be computed locally with no communication. Let \(\llbracket x\rrbracket _q\), \(\llbracket y\rrbracket _q\) be secret shared values and c be a constant. Alice and Bob can perform the following operations locally:
-
Addition (\(z=x+y\)): Each party locally adds its local shares of x and y in order to obtain a share of z. This will be denoted by \(\llbracket z\rrbracket _q \leftarrow \llbracket x\rrbracket _q+\llbracket y\rrbracket _q\).
-
Subtraction (\(z=x-y\)): Each party locally subtracts its local share of y from that of x in order to obtain a share of z. This will be denoted by \(\llbracket z\rrbracket _q\leftarrow \llbracket x\rrbracket _q-\llbracket y\rrbracket _q\).
-
Multiplication by a constant (\(z=cx\)): Each party multiplies its local share of x by c to obtain a share of z. This will be denoted by \(\llbracket z\rrbracket _q\leftarrow c\llbracket x\rrbracket _q\)
-
Addition of a constant (\(z=x+c\)): Alice adds c to her share \(x_A\) of x to obtain \(z_A\), while Bob sets \(z_B=x_B\). This will be denoted by \(\llbracket z\rrbracket _q\leftarrow \llbracket x\rrbracket _q + c\).
The secure multiplication of secret shared values (i.e., \(z=xy\)) cannot be done locally and involves communication between Alice and Bob. To obtain an efficient secure multiplication solution, we use the multiplication triples technique that was originally proposed by Beaver [35]. We use a trusted initializer to pre-distribute the multiplication triples (which are a form of correlated randomness) to Alice and Bob. We use the same protocol \(\pi _{\mathsf {DMM}}\) for secure (matrix) multiplication of secret shared values as in [17, 36] and denote by \(\pi _{\mathsf {DM}}\) the protocol for the special case of multiplication of scalars and \(\pi _{\mathsf {IP}}\) for the inner product. As shown in [17] the protocol \(\pi _{\mathsf {DMM}}\) (described in Protocol 2) UC-realizes the distributed matrix multiplication functionality \(\mathcal {F}_{\mathsf {DMM}}\) in the trusted initializer model.
Converting to fixed-point representation
Each data owner initially needs to convert their training data to integers modulo q so that they can be secret shared. As illustrated in Fig. 3, each feature value \(x \in \mathbb {R}\) is converted into a fixed point approximation of x using a two’s complement representation for negative numbers. We define this new value as \(Q(x) \in \mathbb {Z}_q\). This conversion is shown in Eq. (1):
$$\begin{aligned} Q(x) = {\left\{ \begin{array}{ll} 2^\lambda - \left\lfloor { 2^a \cdot |x| }\right\rfloor &{} \text{ if } x < 0 \\ \left\lfloor { 2^a \cdot x }\right\rfloor &{} \text{ if } x \ge 0 \end{array}\right. } \end{aligned}$$
(1)
Specifically, when we convert Q(x) into its bit representation, we define the first a bits from the right to hold the fractional part of x, and the next b bits to represent the non-negative integer part of x, and the most significant bit (MSB) to represent the sign (positive or negative). We define \(\lambda\) to represent the total number of bits such that the ring size q is defined as \(q=2^\lambda\). It is important to choose a \(\lambda\) that is large enough to represent the largest number x that can be produced during the LR protocol, and therefore \(\lambda\) should be chosen to be at least \(2(a+b)\) (see Truncation). It is also important to choose a b that is large enough to represent the maximum possible value of the integer part of all x’s (this is dependent on the data). This conversion and bit representation is shown in Fig. 3.
Truncation
When multiplying numbers that were converted into a fixed point representation with a fractional bits, the resulting product will end up with a more bits representing the fractional part. For example, a fixed point representation of x and y, for \(x, y > 0\), is \(x\cdot 2^a\) and \(y\cdot 2^a\), respectively. The multiplication of both these terms results in \(xy\cdot 2^{2a}\), showing that now 2a bits are representing the fractional part, which we must scale back down to \(xy\cdot 2^a\) to do any further computations. In our solution, we use the two-party local truncation protocol for fixed point representations of real numbers proposed in [2] that we will refer to as \(\pi _{\mathsf {trunc}}\). It does not involve any messages between the two parties, each party simply performs an operation on its own local share. This protocol almost always incurs an error of at most a bit flip in the least-significant bit. However, with probability \(2^{a +1-\lambda }\), where a is the number of fractional bits, the resulting value is completely random.
When this truncation protocol is performed on increasingly large data sets (in our case we run over 7 billion secure multiplications), the probability of an erroneous truncation becomes a real issue—an issue not significant in previous implementations. There are two phases in which truncation is performed: (1) when computing the dot product (inner product) of the current weights vector with a training example in line 7 of Algorithm 1, and (2) when the weight differentials (\(\Delta w_i\)) are adjusted in line 9 of Algorithm 1. If a truncation error occurs during (1), the resulting erroneous value will be pushed into a reasonable range by the activation function and incur only a minor error for that round. If the error occurs during (2), an element of the weights vector will be updated to a completely random ring element and recovery from this error will be impossible. To mitigate this in experiments, we make use of 10–12 bits of fractional precision with a ring size of 64 bits, making the probability of failure \(\frac{1}{2^{53}}< p < \frac{1}{2^{51}}\). The number of truncations that need to be performed is also reduced in our implementation by waiting to perform truncation until it is absolutely required. For instance, instead of truncating each result of multiplication between an attribute and its corresponding weight, a single truncation can be performed at the end of the entire dot product.
Additional error is incurred on the accuracy by the fixed point representation itself. Through cross-validation with an in-the-clear implementation, we determined that 12 bits of fractional precision provide enough accuracy to make the output accuracy indistinguishable between the secure version and the plaintext version.
Conversion of sharings
For efficiency reasons, in some of the steps for securely computing the activation function we use secret sharings over \(\mathbb {Z}_2\), while in others we use secret sharings over \(\mathbb {Z}_{2^{\lambda }}\). Therefore we need to be able to convert between the two types of secret sharings.
We use the two-party protocol from [17] for performing the bit-decomposition of a secret-shared value \(\llbracket x\rrbracket _{2^{\lambda }}\) to shares \(\llbracket {x_i}\rrbracket _{_2}\), where \(x_\lambda \cdots x_1\) is the binary representation of x. It works like the ripple carry adder arithmetic circuit based on the insight that the difference between the sum of the two additive shares held by the parties and an “XOR-sharing” of that sum is the carry vector. As proven in [17], the bit-decomposition protocol \(\pi _{\mathsf {decomp}}\) (described in Protocol 3) UC-realizes the bit-decomposition functionality \(\mathcal {F}_{\mathsf {decomp}}\).
In our implementation we use a highly parallelized and optimized version of the bit-decomposition protocol \(\pi _{\mathsf {decomp}}\) in order to improve the communication efficiency of the overall solution. The optimizations are described in the Appendix.
The opposite of a secure bit-decomposition is converting from bit sharing to an additive sharing over a larger ring. In our secure activation function protocol, we require securely converting a bit sharing to an additive sharing in \(2^\lambda\). This is done using the protocol \(\pi _{\mathsf {2to2^\lambda }}\) from [18] (described in Protocol 4) that UC-realizes the secret sharing conversion functionality \(\mathcal {F}_{\mathsf {2to2^\lambda }}\).
Secure activation function
We propose a new protocol that evaluates \(\rho\) from Fig. 2b directly over additive shares and does not require full secure comparisons, which would have been more expensive. Instead of doing straightforward comparisons between z, 0.5 and \(-0.5\), we derive the result through checking two things: (i) whether \(z'=z+ 1/2\) is positive or negative; (ii) whether \(z' \ge 1\). Both checks can be performed without using a full comparison protocol.
When \(z'\) is bit decomposed, the most significant bit is 0 if \(z'\) is non-negative and 1 if \(z'\) is negative. In fact, if out of the \(\lambda\) bits, the a lowest bits are used to represent the fractional component and the b next bits are used to represent the integer component, then the remaining \(\lambda -a-b\) bits all have the same value as the most significant bit. We will use this fact in order to optimize the protocol by only performing a partial bit-decomposition and deducting whether \(z'\) is positive or negative from the \((a+b+1)\)-th bit.
In the case that \(z'\) is negative, the output of \(\rho\) is 0. But, if \(z'\) is positive, we need to determine whether \(z' \ge 1\) in order to know if the output of \(\rho\) should be fixed to 1 or to \(z'\). A positive \(z'\) is such that \(z' \ge 1\) if and only if at least one of the b bits corresponding to the integer component of \(z'\) representation is equal to 1, therefore we only need to analyze those b bits to determine if \(z' \ge 1\).
Our secure protocol \(\pi _\rho\) is described in Protocol 5. The AND operation corresponds to multiplications in \(\mathbb {Z}_{2}\). By the application of De Morgan’s law, the OR operation is performed using the AND and negation operations. The successive multiplications can be optimized to only take a logarithmic number of rounds by using well-known techniques.
The activation function protocol \(\pi _\rho\) UC-realizes the activation function functionality \(\mathcal {F}_{\rho }\). The correctness can be checked by inspecting the three possible cases: (i) if \(z > 1/2\), then \(\mathsf {pos}=1\) and \(\mathsf {geq1}=1\) (since at least one of the bits representing the integer component of \(z+1/2\) will have a value 1). The output is thus \(\llbracket 2^a\rrbracket _{2^{\lambda }}\) (the fixed-point representation of 1); if \(-1/2 \le z < 1/2\), then \(\mathsf {pos}=1\) and \(\mathsf {geq1}=0\), and therefore the output will be \(\llbracket z'\rrbracket _{2^{\lambda }}\), which is the fixed-point representation of \(z+1/2\); if \(z<-1/2\), then \(\mathsf {pos}=0\) and the output will be a secret sharing representing zero as expected. The security follows trivially from the UC-security of the building blocks used and the fact that no secret sharing is opened.
Secure logistic regression training
We now present our secure LR training protocol that uses a combination of the previously mentioned building blocks.
Notice that in the full gradient descent technique described in Algorithm 1, the only operations that cannot be performed fully locally by the data processors, i.e. on their own local shares, are:
-
The computation of the inner product in line 7
-
The activation function \(\rho\) in line 7
-
The multiplication of \(t_d-o_d\) with \(d_{d,i}\) in line 9
Our secure LR training protocol \(\pi _{\mathsf {LR-Training}}\) (described in Protocol 6) shows how the secure building blocks described before can be used to securely compute these operations. The inner product is securely computed using \(\pi _{\mathsf {IP}}\) on line 5, and since this involves multiplication on numbers that are scaled to a fixed-point representation, we truncate the result using \(\pi _{\mathsf {trunc}}\). The activation function is securely computed using \(\pi _\rho\) on line 6. The multiplication of \(t_d-o_d\) with \(x_{d,i}\) is done using secure multiplication with batching on line 11. Since this also involves multiplication on numbers that are scaled, the result is truncated using \(\pi _{\mathsf {trunc}}\) in line 14. A slight difference between the full gradient descent technique described in Algorithm 1 and our protocol \(\pi _{\mathsf {LR-Training}}\), is that instead of updating \(\Delta w_i\) after every evaluation of the activation function, we batch together all activation function evaluations before computing the \(\Delta w_i\). Since the activation function requires a bit-decomposition of the input, we can now make use of the efficient batch bit-decomposition protocol batch-\(\pi _{\mathsf {decompOPT}}\) (see Appendix) within the activation function protocol \(\pi _\rho\).
The LR training protocol \(\pi _{\mathsf {LR-Training}}\) UC-realizes the logistic regression training functionality \(\mathcal {F}_{\mathsf {LR-Training}}\). The correctness is trivial and the security follows straightforwardly from the UC-security of the building blocks used in \(\pi _{\mathsf {LR-Training}}\).
The following steps describe end-to-end how to securely train a LR classifier:
-
1
The TI sends the correlated randomness needed for efficient secure multiplication to the data processors. Note that while our current implementation has the TI continuously sending the correlated randomness, it is possible for the TI to send all correlated randomness as the first step, and therefore can leave and not be involved during the rest of the protocol.
-
2
Each data owner converts the values in the set of training examples D that it holds to a fixed-point representation as described in Eq. 1. Each value is then split into two shares, which are then sent to the data processor 1 and data processor 2 respectively.
-
3
Each data processor receives the shares of data from the data owners. They now have secret sharings \((\llbracket \varvec{x}_d\rrbracket , \llbracket t_d\rrbracket )\) of the set of training examples D. The learning rate \(\eta\) and number of iterations \(n_{iter}\) are predetermined and public to both data processors.
-
4
The data processors collaborate to train the LR model. They both follow the secure LR training protocol \(\pi _{\mathsf {LR-Training}}\).
-
5
At the end of the protocol, each data processor will hold shares of the model’s weights \(\llbracket w_i\rrbracket\). Each data processor sends their shares to all of the data owners, who can then combine the shares to learn the weights of the LR model.
Cryptographic engineering optimizations
Sockets and threading
A single iteration of the LR protocol is highly parallelizable in three distinct segments: (1) computing the dot products between the current weights and the data set, (2) computing the activation of each dot product result, and (3) computing the gradient and updating the weights. In each of these phases, a large number of computations are required, but none have dependencies on others. We take advantage of this by completing each of these phases with thread pools that can be configured for the machine running the protocol. We implemented the proposed protocols in Rust; with Rust’s ownership concept, it is possible to yield results from threads without message passing or reallocation. Hence, the code is constructed to transfer ownership of results at each phase back to the main thread to avoid as much inter-process communication as possible. Additionally, all threads complete socket communications by computing all intermediate results directly in the socket buffer by implementing the buffer as a union of byte array and unsigned 64-bit integer array. This buffer is allocated on the stack by each thread which circumvents the need for a shared memory block while also avoiding slower heap memory. The implementation of this configuration reduced running times significantly based on our trials.
Further, all modular arithmetic operations are handled implicitly with the Rust API’s Wrapping struct which tells the ALU to ignore integer overflow. As long as the size of the ring over which the MPC protocols are performed is selected to align with a provided primitive bit width (i.e. 8, 16, 32, 64, 128) it is possible to omit computing the remainder of arithmetic with this construction.