The Sharemind platform
Our SMC protocols are implemented on top of the Sharemind^{®;} platform [14]. The platform provides a distributed virtual machine (VM) that must be installed at each of the computing servers. The machine interprets the description of a privacypreserving application (in our case, the deduplication), where the cryptographic details are abstracted away. From the application developer’s point of view, different pieces of data are merely labeled as “public” or “private”. The private data will never be learned by any of the servers, unless it is deliberately declassified (i.e. opened to all servers) by the application. While the public values are stored at each server in plain, the private values are stored in a secretshared manner, preventing a single server from learning its value. Underneath, the virtual machine executes cryptographic protocols to perform operations on private secretshared data, which in general requires interaction between the servers. The underlying cryptographic protocols have been proven to be composable [15], meaning that the applications do not need to undergo any additional security proofs. Only deliberate declassification of values needs to be justified. This also concerns the deduplication that we present in this paper.
The main protocol set of Sharemind [16], denoted shared3p, is based on additive sharing among three parties. The private representation of a value u∈R from a finite ringR is denoted by ⟦u⟧, and is defined as ⟦u⟧=(⟦u⟧_{1},⟦u⟧_{2},⟦u⟧_{3}), where the share ⟦u⟧_{i}∈R is held by the ith server. The shares are random elements of R, with the constraint that the sum up to u. Hence, if anyone manages to get only one or two shares of u, he cannot infer any information about u. In [16], the authors have presented protocols for a number of basic arithmetic, relational and logical operations, such as addition, multiplication, comparisons, etc., transforming shares of the inputs into shares of the output. The supported rings are \(\mathbb {Z}_{2^{n}}\) and \(\mathbb {Z}_{2}^{n}\), and several different rings may be in used simultaneously in the same application. There are special protocols for converting shares between different rings. This basic set of operations has been extended in numerous followup papers [17–22].
The protocols of shared3p are secure against one passively corrupted party. Our deduplication application is built on top of them. In the rest of this section, we present all algorithms that we used in our application. In the loops, we use foreach and while to denote parallelizable loops, whilst for loops are sequential. Parallelization is important due to the latency of all nonlinear operations with private values, due to the need to exchange messages between computation servers. We denote vectors as \(\vec {x} = \langle {x_{1},\ldots,x_{n}}\rangle \).
The first solution: output all results once in the end
In out first solution, the results of computation are only available at the end, after all clients (the health centers) have uploaded the hashes of their records. The outline of the process is the following. In the first phase, the SMC servers are collecting input data from the clients, without performing any deduplication detection on it. When the uploads have ceased (e.g. the deadline or some other trigger event has happened), the servers stop collecting the data. They run the deduplication algorithm on all data they managed to collect so far, and give to each client its personal output. If a record is duplicated, then the client that first uploaded it will not be notified that it is a duplicate. All other clients that have uploaded the same record will receive a notification about it.
Cryptographic building blocks
We fix two cryptographic functions: the hash function H:{0,1}^{∗}→{0,1}^{η}, and a block cipher \(E:\{0,1\}^{\eta }\times \mathcal {K}\rightarrow \{0,1\}^{\eta }\), where \(\mathcal {K}\) is the set of possible keys for E. The block size of E should be the same as the output length of H, so that we could apply E directly to the output of H. The challenge is that we will apply E to secretshared values, and hence the block cipher E has to be easily computable under SMC. We have picked AES128 as E, and the privacypreserving implementation of AES128 is already available in Sharemind [19]. There exist newer, possibly more efficient implementations of AES [23], as well as proposals for SMCoptimized block ciphers [24], which have not been implemented on Sharemind yet, and could potentially speed the computation up.
In our solution, we have taken η=128. We let H be the composition of SHA256 cryptographic hash function and a universal oneway hash function family (UOWHF) [25].
Computation on the client side
The behaviour of a client is described in Algorithm 1. At startup, each client queries the Sharemind servers for the random parameters of the UOWHF H, which the servers generate in the beginning. Each client takes the input records as soon as they become available, picks the necessary attributes from each record, applies H to them, and uploads the hashes to Sharemind servers in secretshared manner. The Sharemind API readily supports that. At this point, no server learns the exact values of the hashes, since they are all secretshared.
In the end, when the servers have finished the computation, the client queries them for its personal result. The servers respond with the shares of the output, which is a vector of booleans of the same length as the number of records from this client, indicating whether a record is a duplicate one. The client reconstructs the result vector. Again, the Sharemind API already has support for that.
Computation on the server side
We describe the work of the servers in phases.
Startup.
This short phase is given in Algorithm 2. The servers privately generate the following random values:
All these values are generated in such a way that they remain secretshared among the servers, and no server actually learns them. Sharemind API supports such shared randomness generation, and we denote the corresponding functionality as random.
The servers initialize a public variable cnt ←0. It will be used for indexation of clients.
Upload
During the upload phase, two different activity threads are essentially taking place in parallel.
One thread is the actual acceptance of data from the clients, described in Algorithm 3. The hashes of clients are stored into an private array ⟦v⟧, and the corresponding client identities into a public array s under the same indices. In principle, this algorithm could be invoked several times for the same client, if it intends to split up the upload of its data. Several clients may want to connect to the servers at the same time, so we need to make use of Sharemind’s database support to avoid race condition on cnt.
The other thread of activity is encrypting the elements ⟦v_{i}⟧. For each i, the servers evaluate E on ⟦v_{i}⟧ in a privacypreserving way, using the same key ⟦K⟧ for each encryption, and take the first 64 bits of the result, as described in Algorithm 4. The only reason why we take only 64 of 128 bits is that the largest ring that Sharemind currently supports is \(\mathbb {Z}_{2^{64}}\). There are no problems with taking only 64 bits, as long as there are no collisions. With the envisioned amounts of data (around 10^{7}=2^{23.5} records in total), collisions are unlikely; the birthday paradox puts their probability at around 2^{−64+2·23.5}=2^{−17}. We note that the communication complexity also reduces with the number of bits of the data types.
In our implementation, evaluation of E can take place either immediately after the data upload, or it can be run in a separate thread, waiting until more data comes to process it all in parallel. In any case, the application of E is done in parallelized manner, batching up the hashes that are waiting for being processed so far.
Computation of results.
When the upload has finished and the ciphertext ⟦c_{i}⟧ has been computed for each i∈{0,…,cnt−1}, then the final results are computed as in Algorithm 5. First of all, all computed ciphertexts are privately shuffled, and then declassified. Since they are encrypted with a block cipher whose key remains secretshared, it only reveals the “duplication pattern”, i.e. how many values occur there exactly n times. The private shuffle [26] in the shared3p protocol set is highly efficient: the amount of communication is linear in the size of the shuffled vector, and the number of communication rounds is constant. It also allows to easily apply the shuffle inverse, which is as efficient as the shuffle itself.
At this point, the servers could already mark the duplicates with boolean values 1 and the remaining values with 0. After the obtained vector is privately shuffled back, the servers can return the secret shared bits to the clients, so that the ith clients learns the bits indicating the duplicity of its own records. The problem is that the servers need to notify all clients except the first one, but because of shuffling they do not know which entry belongs to the first client. They cannot declassify the client indices ⟦s_{i}⟧ either since it would partially undo the shuffling.
In order to determine which value belongs to the first client, the servers run Algorithm 6. This algorithm is applied to each set of shuffled entries that have identical values, to determine the minimum amongst them. The minimum should be labeled false since it is not considered a duplicate, and all other elements should be labeled true. The idea behind this recursive algorithm is the following. The inputs are split into pairs, and a comparison performed within each pair. The element that is larger is definitely not the minimum of the entire set, so we can immediately write b_{i}:=true. The indices of all elements that turned out to be smaller are stored into m_{i}, and the whole algorithm is now applied again to the elements indexed by m_{i}. The procedure is repeated until there are is only one element left, which is the minimum, so the algorithm returns false in oneelement case.
Although Algorithm 6 works with private values, it reveals the ordering between certain elements of \(\vec {t}\). Because of the random shuffle, the ordering of this vector is completely random, so it does not disclose any information. No server learns from the output more than it would from a random permutation.
The second solution: output results immediately
Our second solution considers the case where the client is immediately notified, which of its hashes of records have already been uploaded by some previous client. This task is more difficult than the previous one. We cannot simply declassify the encrypted hashes when they arrive, since it would leak which pairs of centers have intersecting sets of records.
The second solution has the same components E and H as the first solution, constructed in the same manner. Again, the computation servers agree on the key ⟦K⟧ in the beginning, as well as the parameters of H.
Computation on the client side
The clients perform pretty much in the same manner as in the first solution, according to Algorithm 1. The only technical difference is that the servers react to the upload immediately, so the client most likely does not interrupt the session with the servers, and stays connected until it receives the result.
Computation on the server side
Assume that the servers already store T encrypted hashes (zero or more), provided by the previous clients. These T hashes have been stored as public values z_{1},…,z_{T}, where z_{i} is equal to the first 64 bits of E(K,v_{i}). Duplicates have already been removed among z_{1},…,z_{T}. Assuming that duplicates never end up among z_{1},…,z_{T}, since ⟦K⟧ remains private, z_{1},…,z_{T} are computationally indistinguishable from T uniformly randomly distributed values, so it is safe to make them public.
The 64bit values z_{1},…,z_{T} are kept in 2^{B}buckets, numbered from 0 to 2^{B}−1. In our implementation, B=16. Each bucket is just a vector of values. Each z_{i} is assigned to the bucket B_{j}, where j is equal to the first 16 bits of z_{i}.
Now suppose that a center has uploaded the hashes ⟦v_{1}⟧,…,⟦v_{t}⟧. The servers need to check whether these hashes has occurred before. The computation starts by encrypting each ⟦v_{i}⟧. Let \(\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket \) be the results of encryption, computed by \(\llbracket {z^{\prime }_{i}}\rrbracket =\mathsf {{leftbits}_{64}}(E(\llbracket {K}\rrbracket,\llbracket {v_{i}}\rrbracket))\). We cannot immediately declassify them since it would leak which pairs of centers have intersecting sets of records. Instead, we should use privacypreserving comparison.
We do not want to simply invoke the private comparison protocol for each z_{i} and \(\llbracket {z^{\prime }_{j}}\rrbracket \), because we consider their number to be too large. Indeed, as we are aiming to handle ca. 10 million records, this method would cause us to compare each pair of records, leading to ca. 5·10^{13} invocations of the comparison protocol. An ℓbit comparison requires slightly more network communication than a ℓbit multiplication [16], with the latter requiring each computation server to send and receive two ℓbit values [17]. If ℓ=64 and there are 5·10^{13} operations, then each server has to send out and receive at least 6·10^{15} bits, which on a 100 Mb/s network (specified in the conditions of the competition task) would take almost two years.
We reduce the number of comparisons in the following manner. Let ⟦z^{′}⟧ be one of the private values \(\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket \); all t values are handled in parallel. The comparison of ⟦z^{′}⟧ with z_{1},…,z_{T} is described in Algorithm 7. In this algorithm, we let N be the maximum size of a bucket. We denote the jth element of the ith bucket by B_{i,j}. We assume that each bucket has exactly N elements, adding special dummy elements if necessary.
The characteristic vector of an element \(x\in \mathbb {Z}_{M}\) is a vector \(\langle {b_{0},\ldots,b_{M1}}\rangle \in \mathbb {Z}_{2}^{M}\), where b_{x}=1 and all other elements are equal to 0. The shared3p protocol set has a simple and efficient protocol for computing characteristic vectors, described in [27]. The protocol turns a private value into a private characteristic vector. The characteristic vector of leftbits_{16}(⟦z^{′}⟧) marks the index of the bucket to which ⟦z^{′}⟧ belongs, and the expression \(\bigoplus _{i=0}^{2^{B}1}\llbracket {b_{i}}\rrbracket \cdot \mathbf {B}_{i,j}\) returns exactly the jth element of that bucket, which we denote ⟦y_{j}⟧. This way, the values ⟦y_{1}⟧,…,⟦y_{N}⟧ are the privately represented content of the bucket, into which z^{′} would belong. The private comparison \(\llbracket {z^{\prime }}\rrbracket \stackrel {?}{=}\llbracket {y_{j}}\rrbracket \) is performed for all j, thus comparing ⟦z^{′}⟧ against each element that belongs to the ith bucket. Finally, \(\llbracket {b}\rrbracket =\bigvee _{j=1}^{n}\llbracket {c_{j}}\rrbracket \) is the private OR of all comparisons, which tells whether there had been at least one match. The private bits ⟦b_{1}⟧,…,⟦b_{t}⟧ resulting from applying Algorithm 7 to all \(\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket \) are returned to the client.
After returning the answer to the client, the buckets have to be updated with \(z^{\prime }_{1},\ldots,z^{\prime }_{t}\). It is safe to declassify \(\llbracket {z^{\prime }_{i}}\rrbracket \) if b_{i}=0, since in this case \(\llbracket {z^{\prime }_{i}}\rrbracket \) cannot be correlated to any of z_{i} and is indistinguishable from a random value. However, we cannot immediately declassify the vector \(\llbracket {\vec {b}}\rrbracket \), because the positions of duplicated elements may give away information about the input data. Since we are allowed to leak the total number of duplicated entries per client, we can do as described in Algorithm 8:

1.
randomly shuffle 〈⟦b_{1}⟧,…,⟦b_{t}⟧〉 and \(\langle {\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket }\rangle \), using the same permutation;

2.
declassify \(\llbracket {\vec {b}}\rrbracket \);

3.
declassify those ⟦zi′⟧, where b_{i}=0, and add these \(z^{\prime }_{i}\) to the respective buckets.
In our implementation, ⟦z^{′}⟧ is shared over \(\mathbb {Z}_{2}^{64}\), hence taking the first 16 bits of it is a local operation, resulting in a value in \(\mathbb {Z}_{2}^{16}\). The characteristic vector protocol in [27] is easily adaptable to compute the characteristic vectors of elements of \(\mathbb {Z}_{2}^{B}\), and its result is a vector over \(\mathbb {Z}_{2}\) with length 2^{B}. The computation of the characteristic vector requires communication of two elements of \(\mathbb {Z}_{2}^{B}\) and one element of \(\mathbb {Z}_{2}^{2^{B}}\) (in total, not per party).
The computation of ⟦y_{j}⟧ in Algorithm 7 is again a local operation. The computations of ⟦c_{j}⟧ and their disjunction are straightforward using the protocols available in the shared3p set of Sharemind.
Speeding up local computations
In Alg. 7, the computation of ⟦y_{j}⟧ is local. Nevertheless, in practice it takes a major part of the entire effort of the protocol, taking up most of the time in it. If we knew something about the size of z^{′}, we could compute \(\bigoplus \) not over all buckets, but only over a subset of them, covering only the range into which z^{′} is guaranteed to fall, with a negligible error.
A bucket is defined by the most significant bits of elements that it contains. If we take any two buckets, then all elements in one of them will be strictly smaller than all elements in the other one. If we sort \(\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket \) in ascending order (Sharemind has efficient protocols for sorting private values [28]), we know that the first elements more likely belong to the buckets with “smaller” bits, and the last elements more likely belong to buckets with “larger” bits. We can estimate these probabilities more precisely.
As the key ⟦K⟧ is secret, and the hashes ⟦v_{1}⟧,…,⟦v_{t}⟧ are all different, the values \(\llbracket {z^{\prime }_{1}}\rrbracket,\ldots,\llbracket {z^{\prime }_{t}}\rrbracket \) can be treated as mutually independent, uniformly random elements of \(\mathbb {Z}_{2}^{64}\). After sorting them, their likely ranges can be derived from the order statistics as follows.
Let \(\mathcal {P}\) be a discrete probability distribution over values x_{1},x_{2},…, such that the probability mass of x_{i} is p_{i}. Let X_{1},…,X_{n} be random variables sampled from \(\mathcal {P}\), and let \(X^{\prime }_{1},\ldots,X^{\prime }_{n}\) be obtained after sorting X_{1},…,X_{n} in ascending order. We have \(\text {Pr}[X^{\prime }_{j} \leq x_{i}] = \sum _{k=j}^{n}{{n}\choose {k}}P_{i}^{k} \cdot (1  P_{i})^{nk}\), where \(P_{i} = Pr[X_{i} \leq x_{i}] = \sum _{k=1}^{i} p_{k}\). This quantity comes from summing up probabilities of all possible combinations, where at least j of n variables are smaller than x_{i}. For a fixed j, this expression is actually the cumulative density function of a binomial distribution B(n,P_{j}).
In our case, \(\mathcal {P}\) is the distribution over AES ciphertexts, i.e p_{i}=2^{−128} for all i. The sorted ciphertexts \(z^{\prime }_{i}\) are instances of random variables \(X^{\prime }_{i}\). We want to find m_{i} and M_{i}, such that \(\text {Pr}[z^{\prime }_{i}< m_{i}]\leq \varepsilon \) and \(\text {Pr}[z^{\prime }_{i}>M_{i}]\leq \varepsilon \), where ε is the desired error probability. Since we are dealing with binomial distribution, we can use e.g. Höffding’s inequality
$$\text{Pr}[z^{\prime}_{i} \leq m_{i}] \leq exp\left(2\frac{(n \cdot P_{i}  m_{i})^{2}}{n}\right)$$
and Chernoff’s inequality
$$\text{Pr}[z^{\prime}_{i} \leq m_{i}] \leq exp\left(\frac{1}{2P_{i}}\cdot\frac{(n\cdot P_{i}m_{i})^{2}}{n}\right)\enspace,$$
where exp(x)=e^{x} for Euler’s number e. We can solve the equation \(\epsilon = exp(2\frac {(n \cdot P_{i}  m_{i})^{2}}{n})\) if \(P_{i} \leq \frac {1}{4}\), and \(\epsilon = exp(\frac {1}{2P_{i}}\frac {(n \cdot P_{i}  m_{i})^{2}}{n})\) if \(P_{i} \geq \frac {1}{4}\), getting the value for m_{i}. The value for M_{i} can be obtained analogously, since \(\text {Pr}[z^{\prime }_{i} \geq x_{i}] = \text {Pr}[z^{\prime }_{i} \leq x_{i}]\).
By default, we use ε=2^{−40} as the probability of error. As the total number of hashes is expected to be around 1000·10000≈2^{23.5} and we have two bounds to try for each hash, the probability of making a bounds check error during the whole run is not more than 2·2^{−40+23.5}=2^{−15.5}, which we consider acceptable, and which is also similar to errors due to the collisions in the first 64 bits of AES output.
The usefulness of these bounds increases together with t. If t=100 (and ε=2^{−40}), then we gain little, as the ranges [m_{i},M_{i}] still cover around half of the whole range. If t=10000, then the sorted values can be much more tightly positioned — the ranges [m_{i},M_{i}] are less than 1/10 of the whole range.
Alternative comparison
The communication costs of Algorithm 7 may be further reduced, ultimately turning them into a constant (assuming that B is constant), albeit with a further increase in the costs of local computation.
Consider the bucket B_{i} with elements B_{i,1},…,B_{i,N}. The value z^{′} is an element B_{i} iff it is a root of the polynomial \(\mathbf {P}_{i}(x)=\prod _{j=1}^{N}(x\mathbf {B}_{i,j})\). The polynomial is considered over the field \(\mathbb {F}_{2^{64}}\). The elements of this field are 64bit strings and their addition is bitwise exclusive or. Hence, an additive sharing over \(\mathbb {F}_{2^{64}}\) is at the same time also sharing over \(\mathbb {Z}_{2}^{64}\) and vice versa.
Let P_{i,0},…,P_{i,N} be the coefficients of the polynomial P_{i}. It does not make sense to compute P_{i}(⟦z^{′}⟧) in a straightforward way, because this would involve N−1 private multiplications for each bucket. A better way is given in Algorithm 9.
In Algorithm 9, ⟦d_{i}⟧ is computed as a scalar product of the private vector of the powers of z^{′} with the public vector of the coefficients of P_{i}. The powers of ⟦z^{′}⟧ have to be computed only once for all buckets. These powers are computed with the help of the usual multiplication protocol of Sharemind, but working over \(\mathbb {F}_{2^{64}}\). The local operations in this protocol are multiplications in \(\mathbb {F}_{2^{64}}\), which are relatively more expensive than ordinary bitwise operations. In our implementation we use the NTL library [29] for binary field operations. The computation of all ⟦d_{i}⟧ involves many multiplications in this field, so even though the computation does not require any communication between the parties, it is quite heavy on the local side.
As the computation of ⟦z^{′}^{2}⟧,…,⟦z^{′}^{N}⟧ is done over a field, we can push its heaviest part to the precomputation phase, leaving just a single private multiplication to be done during runtime. The method is described in ([30], Algorithm 1). The precomputation consists of generating a random invertible element \(\llbracket {r}\rrbracket \in \mathbb {F}_{2^{64}}^{*}\) together with its inverse ⟦r^{−1}⟧ and computing ⟦r^{2}⟧,…,⟦r^{N}⟧. During the runtime, one computes ⟦z^{′}⟧·⟦r^{−1}⟧ and declassifies it. For an exponent k, the private value ⟦z^{′}^{k}⟧ is then found as ⟦z^{′}^{k}⟧=(z^{′}·r^{−1})^{k}·⟦r^{k}⟧, which can be computed locally, without interaction.
In binary fields, including \(\mathbb {F}_{2^{64}}\), squaring of additively shared values does not require communication between the servers. This can be used to speed up precomputations. In ([30], Algorithm 4) it is shown how to reduce the communication cost of computing ⟦r^{2}⟧,…,⟦r^{N}⟧ to that of approximately \(\sqrt {N}\) multiplications.
The polynomialbased comparison method is also amenable to the order statistic related speedup described in the previous section. Both comparison methods have been implemented in our second solution.