 Research
 Open access
 Published:
Privately computing setmaximal matches in genomic data
BMC Medical Genomics volume 13, Article number: 72 (2020)
Abstract
Background
Finding long matches in deoxyribonucleic acid (DNA) sequences in large aligned genetic sequences is a problem of great interest. A paradigmatic application is the identification of distant relatives via large common subsequences in DNA data. However, because of the sensitive nature of genomic data such computations without security consideration might compromise the privacy of the individuals involved.
Methods
The secret sharing technique enables the computation of matches while respecting the privacy of the inputs of the parties involved. This method requires interaction that depends on the circuit depth needed for the computation.
Results
We design a new depthoptimized algorithm for computing setmaximal matches between a database of aligned genetic sequences and the DNA of an individual while respecting the privacy of both the database owner and the individual. We then implement and evaluate our protocol.
Conclusions
Using modern cryptographic techniques, difficult genomic computations are performed in a privacypreserving way. We enrich this research area by proposing a privacypreserving protocol for setmaximal matches.
Background
The abundance of human genomic data in recent years paves the path towards answering very important questions for human nature, such as identifying the genomes responsible for particular illnesses. Simultaneously, the extremely sensitive nature of this data imposes strict restrictions on its use. Fortunately, there is a variety of cryptographic techniques that allow us to create useful yet privacypreserving systems for computation in genomic data (see [1] for a summary of various techniques). Even though theoretically it is possible to perform every computation in a private way, the generic techniques do not necessarily preserve the efficiency and the accuracy of the original algorithm. Thus, constructing practical privacypreserving protocols has become a very active area of research.
Quantifying the similarity of genomic sequences is a fundamental problem in genome informatics and there exists numerous proposals for tackling this problem. In this work, we focus on privately computing setmaximal matches on genomic data as a way to identify similarity. Initially, research on genomic matching focused on finding efficient and accurate algorithms (e.g. [2–9]). More recently, the issue of privacy has emerged and hence new approaches have been proposed. Freedman et. al. [10] consider the problem of secure keywordsearch in a database by relying on a connection to oblivious evaluation of pseudorandom functions. Various other works focus on problems more specific to genomic data. For instance, Jha et al. [11] develop a protocol for securely computing edit distance of DNA sequences, Blanton et al. [12] propose a protocol for outsourcing DNA search via finite automata to multiple computational servers. Baldi et al. [13] focus on similar applications, such as Paternity Testing, and use techniques on private set operations. He et al. [14] construct a protocol that identifies whether two individuals are relatives without revealing any other information about their genomes. More closely related to our work, Shimizu et al. [15] propose a protocol for privately computing setmaximal matches between a database and an individual starting from a genomic position known only to the individual. Even though we also focus on computing setmaximal matches, our problem is more general since our schemes outputs all setmaximal matches without leaking their locations for the client.
The efforts to construct secure but yet efficient and practical protocols for genome analysis have also been reinforced by the establishment of the Integrating Data for Analysis, Anonymization and SHaring (iDASH) privacy and security workshop [16]. Each year, the workshop poses a set of tasks to evaluate the employment of cryptographic techniques for realworld challenges in genome analysis.
In this work, we propose a novel protocol for measuring similarity between a database of DNA sequences and a query DNA sequence privately by identifying the setmaximal matches between the database and the query. This problem was also the competition task on the secure multiparty computation track in the iDASH workshop for 2018 [16].
Methods
Our goal is to design a protocol for computing the setmaximal matches between a database and a query. We assume that all variable sites are biallelic; namely each site has a value in {0,1}. Before reviewing the main cryptographic tools used in this work, we explain our notation and give the formal definition of setmaximal matches.
Notation If M is a matrix, then M_{j,i} denotes the element of the jth row and ith column. We denote by a bold lowercase letter (e.g. x,y) a sequence of biallelic sites; the ith site of x is denoted by x[i]. Namely, \(\mathbf {x} = \left (\mathbf {x}[1],\mathbf {x}[2],\dots, \mathbf {x}[n] \right)\). For simplicity, we use the notation x[i_{1},i_{2}] to denote the substring \((\mathbf {x}[i_{1}], \dots, \mathbf {x}[i_{2}])\). A bold uppercase letter (e.g. yY) represents a database of genomic sequences; the jth tuple of the database Y (i.e. the genomic data of the jth individual in the database) is denoted using a bold lowercase letter as y_{j}. We call n the number of sites in our genomic sequences and m the number of sequences in the database.
We use the Big O notation for describing the limiting behavior of the running time of an algorithm. We say that f(x)=O(g(x)) if and only if there exists a constant c such that for large enough x, f(x)≤cg(x).
Definition Let Y be a database and x be a query sequence, then a substring x[i_{1},i_{2}] with i_{1}≤i_{2} is a setmaximal match between x and Y if there exists a j such that:

1
x[i_{1},i_{2}]=y_{j}[i_{1},i_{2}]

2
x[i_{1}−1]≠y_{j}[i_{1}−1] and x[i_{2}+1]≠y_{j}[i_{2}+1]
and for any j^{′}≠j, there exists no interval [i1′,i2′] such that x[i1′,i2′]=y_{j}[i1′,i2′] and [i_{1},i_{2}] is a strict subset of [i1′,i2′].
The first two properties assure that the match is a locally maximal match, namely that the match cannot be extended in either side and the last property is satisfied when the match is not strictly contained in another match with a different database entry. Sometimes, it is useful to keep only long enough matches; in this case, the definition has an extra parameter called threshold.
Definition A match is a setmaximal match between x and Y with threshold t if it is a setmaximal match between x and Y and additionally its length is more than or equal to t.
Cryptographic tools
Secure computation allows multiple parties to compute a joint function while preserving the privacy of their individual inputs. There are feasibility results [17, 18] that allow us to compile any computation into a secure one. Unfortunately, these methods do not preserve the efficiency of the original algorithm. Hence, it is often essential to design novel algorithms that take into account the special nature of secure computation protocols.
Security model. Let P_{0} and P_{1} be two parties holding inputs x_{1} and x_{2} respectively. They are interested in jointly computing a function f(x_{1},x_{2}). A protocol between the two parties A and B, is called secure (or privacypreserving) if it leaks nothing about the inputs x_{1} (to P_{1}) and x_{2} (to P_{0}) apart from the output value f(x_{1},x_{2}) (and the length of the inputs). There are two wellstudied adversarial models, the semihonest and the malicious. A semihonest adversary observes all communication between the computing parties (and tries to learn information about the inputs), but is not allowed to deviate from the protocol. On the contrary, a malicious adversary is allowed to arbitrarily deviate from the protocol in order to try to learn extra information about the inputs. In both case, the adversary is assumed to be computationally bounded. In this work, the two involved parties P_{0} and P_{1} are the client and the server and our protocol is secure in the semihonest model.
In this work, we focus on the GoldreichMicaliWigderson (GMW) method [17] for secure computation with boolean secret shares. This method gives a way to privately compute XOR (⊕), AND (∧) and NOT (¬) gates in the semihonest model. Since all computations can be expressed by a circuit containing only these three types of gates, boolean sharing allows to perform every computation in a private fashion. We briefly describe how operations are performed in boolean sharing and some of the available optimizations and implementations.
Intuitively, secret sharing of a value x is a split of x in many parts such that each part does not reveal any information about x, but the knowledge of all the parts allows the recovery of x. The GMW framework suggests a specific way to share values such that it is possible to perform any operation on them. Namely, let us assume that there are two parties and each party knows a share of x, then using GMW they can end up with shares of any function of x.
Sharing of a bitb. The boolean shares of a bit b are two bits 〈b〉_{0} and 〈b〉_{1} such that 〈b〉_{0}⊕〈b〉_{1}=b.
Reconstruction of a bitb. If parties P_{0} and P_{1} have the shares 〈b〉_{0} and 〈b〉_{1} respectively, then they reconstruct the bit b by exchanging shares and computing 〈b〉_{0}⊕〈b〉_{1}.
Computing XOR privately. Assume that party P_{i} knows the secret shares 〈x〉_{i} and 〈y〉_{i} of the bits x and y respectively. Then, party P_{i} computes a share of x⊕y by locally computing 〈x〉_{i}⊕〈y〉_{i}.
Computing AND privately. Assume that party P_{i} knows the secret shares 〈x〉_{i} and 〈y〉_{i} of the bits x and y respectively. The shares of x∧y are evaluated using a precomputed multiplication triple (〈a〉_{i},〈b〉_{i},〈c〉_{i}) of the bits a,b,c such that a∧b=c. Initially, P_{i} computes the shares 〈e〉_{i}=〈a〉_{i}⊕〈x〉_{i} and 〈f〉_{i}=〈b〉_{i}⊕〈y〉_{i}. Then, both parties P_{0} and P_{1} reconstruct the values e and f. Finally, P_{i}’s new share is equal to (i∧e∧f)⊕(f∧〈a〉_{i})⊕(e∧〈b〉_{i})⊕〈c〉_{i}.
Computing NOT privately. Assume that party P_{i} knows the secret share 〈x〉_{i} of the bits x. Then, party P_{i} computes a share of ¬x by locally computing 〈x〉_{i}⊕i.
We remark that the computation of the multiplication triples does not depend on the actual computation or input, so it can be done in advance during a precomputation phase which requires interaction between the parties. The multiplication triples can be computed using the cryptographic primitive of random oblivious transfer [19]. Additionally, we note that computing XOR and NOT gates is done locally and does not require any interaction. On the other hand, an AND operation requires one flow of interaction in order to reconstruct the values e and f. Therefore, since we can compute many AND operations in parallel, the number of rounds required for computing a function f is proportional to the ANDdepth of its circuit representation.
Since the introduction of GMW, various optimizations have been proposed [19–21]. Our implementation is based on the ABY framework [22], which provides semihonest security. Apart from GMW on boolean shares, this framework provides implementations of two other wellstudied methods for secure computation, secure computation using arithmetic shares and Yao’s Garbled Circuits. The ABY framework is suitable for mixedprotocols, since it allows for very efficient conversion between the different secure computation methods. Even though our solution is not a mixedprotocol, we use ABY since it includes all known optimizations of GMW and it allows the composition of our protocol with others that are potentially more efficient if implemented using another method of secure computation.
Twoparty secure protocols using the GMW framework. If party P_{0} has input x_{0} and party P_{1} has input x_{1}, then they can privately compute a function f, which is given in the form of a circuit containing XOR,AND and NOT gates, as follows:

1
Party P_{i} secret shares x_{i} by sending a uniformly random binary string r_{i} with length equal to its input to P_{1−i} and setting its shares equal to x_{i}⊕r_{i}, where ⊕ denotes the bitwise XOR operation.

2
The two parties compute the function f gatebygate as described above and end up with boolean shares of the output.

3
The two parties exchange shares and reconstruct the output of the function by computing the XOR of their shares with the shares received by the other party.
By slightly modifying the above protocol, it is possible to achieve selective reconstruction of the output, in which case only one of the parties learns the output. For instance, if only the client should learn the output, then at the third step the server sends its shares and the client sends nothing. In this case, the client has enough information to recover the output, whereas the server does not learn the output.
Results
The GMW framework allows us to transform any computation in a form of a circuit into a privacypreserving one against semihonest adversaries. Therefore, we focus on designing a depthoptimized circuit with XORAND and NOT gates to compute setmaximal matches. Then, using the generic protocol described in the previous section, we have a secure protocol for setmaximal matches that requires at most as many rounds of interaction as the depth of the circuit.
Problem definition
We give an efficient and depthoptimized algorithm for computing setmaximal matches. More specifically, the problem specification is as follows:
Input: A genomic database Y containing m sequences, each of size n, a query genomic sequence x of size n, and a threshold value t. Output: A matrix M of size m×n such that the element M_{j,i} is equal to the length of the match between x and y_{j} ending at position i if the match is setmaximal with threshold t and 0 otherwise.
We note that the output as described above leaks the position of setmaximal matches. Because of the very sensitive nature of genomic sequences, it is beneficial to hide even this information. Therefore, we slightly modify the above problem so that the list of setmaximal matches is revealed after applying a random permutation to each row of the output.
Input: A genomic database Y containing m sequences, each of size n, a query genomic sequence x of size n, a threshold value t and m permutations \((\pi _{k})_{k \in \{1,\dots, m\}}\). Output: Let M be a matrix of size m×n such that the element M_{j,i} is equal to the length of the match between x and y_{j} ending at position i if the match is setmaximal with threshold t and 0 otherwise. The output is the permutation of each row k of M according to π_{k}.
We observe that the output still reveals the lengths of setmaximal matches for each index. This information could be hidden by applying a random permutation on all the entries of M, instead of each row. However, it seems that this would reduce the applicability of the computation, since this information seems central for certain application.
In the secure protocol, the database Y and the permutations \((\pi _{k})_{k \in \{1,\dots, m\}}\) are the input of the server, the query x is the input of the client and the threshold t is known to both parties. To avoid leakage to the server, the protocol has selective reconstruction to the client.
Algorithm description
We describe our algorithm. We include a sample execution in Fig. 1.

1
We compute a matrix M of size m×n such that M_{j,i} is equal to the length of the match between the query x and the database entry y_{j} ending at position i (Fig. 1b).

2
We set M_{j,i} to 0 if the match of x and y_{j} ending at position i is below the threshold t (Fig. 1c).

3
We compute a matrix L of size m×n such that L_{j,i} is 0 if there is a j^{′}≠j such that \(\phantom {\dot {i}\!}\mathbf {M}_{j',i} > \mathbf {M}_{j,i}\) and 1 otherwise (Fig. 1d).

4
We compute K such that K_{j,i} is 0 if there is a j^{′} (may be equal j) such that \(\phantom {\dot {i}\!}\mathbf {L}_{j',i}= y{L}_{j',i+1} = 1\). Namely, there exists a match that is extended to position i+1 (Fig. 1e).

5
We set M_{j,i}←M_{j,i}K_{j,i}L_{j,i} and we permute the row M_{k} according to permutation π_{k} (Fig. 1f).
After the steps described above, the matrix M contains the correct output:

The output M contains the correct length of matches computed in step 1.

The output M contains no matches of length less than t, since all such matching have been removed in step 2.

The output M does not output matches strictly contained in another match. If a match is contained in another larger match, then either there is a match with a preceding starting point or a match with a succeeding ending point or both. In the first and third case, this match is excluded in step 3, since there is another database entry with larger value in the corresponding positions of M. In the second case, the match is excluded in step 4, since there is another extendable match in the corresponding positions.

The output M contains only locally maximal matches. After step 1, M contains the length of matches from their starting position, so it is not possible for a match to be extendable toward a previous position. Additionally, from step 4, M does not contain matches extendable towards the next position.
We now describe how to implement each of these steps using boolean circuits with AND,XOR and NOT gates in a depthoptimized way. Compute the length of matches: First, we compute an auxiliary matrix B^{(0)} such that \(\mathbf {B}^{(0)}_{j,i} = 1\) if x[ i]=y_{j}[ i]. Namely, \(\mathbf {B}^{(0)}_{j,i} = \lnot (\mathbf {x}[\!i] \oplus \mathbf {y}_{j}[\!i])\). We observe that B^{(0)} indicates whether a match has length greater than or equal to one. Using B^{(0)}, we compute whether a match has length greater than or equal to two by setting \(\mathbf {B}_{j,i}^{(1)} \leftarrow \mathbf {B}_{j,i}^{(0)} \wedge \mathbf {B}_{j,i1}^{(0)}\). More generally, if B^{(k)} indicates whether a match has length more than 2^{k} or not, then it can be updated to indicate if the length of a match is more than 2^{k+1} by setting \(\mathbf {B}_{j,i}^{(k+1)} \leftarrow \mathbf {B}_{j,i}^{(k)} \wedge \mathbf {B}_{j,i(2^{k}1)}^{(k)}\).
Concurrently, in each iteration we compute a bound on the length of each match by setting M^{(0)}=B^{(0)} and \(\mathbf {M}_{j,i}^{(k+1)} \leftarrow \mathbf {B}_{j,i(2^{k}1)}^{(k)} \mathbf {M}_{j,i(2^{k}1)}^{(k)} +\mathbf {M}_{j,i}^{(k)} \). We observe that this computation gives the actual length of a match if it is less than or equal to 2^{k} and returns the lower bound of 2^{k} otherwise. Hence, after ⌈log(n)⌉ iterations, we set M=M^{(⌈log(n)⌉)}.
The addition is computed using a depthoptimized adder, which has ANDdepth proportional to the logarithm of the bit length of the numbers returned [23]. Namely, the ANDdepth of the adder is O(log(logn)). Overall, the ANDdepth of the length computation is O(log(n) log(logn)). Remove matches with length below the threshold: Let t_{k}≡t (mod 2^{k}) for \(k \in \left \{1, \dots, \lceil \log (n)\rceil \right \}\) and \((b_{\lceil \log (n)\rceil }, \dots, b_{1})\) be the bit decomposition of t, where b_{⌈log(n)⌉} is the most significant bit and b_{1} is the least significant bit. Let T^{(k)} be m×n matrices that indicate candidate matches; initially \(\mathbf {T}_{j,i}^{(0)} = 1\) for all i and j. Similarly to the length computation, we define the auxiliary matrix B^{(0)} that initially indicates whether a database and query position match or not. In each iteration, we update B^{(k)} as in the length computation; namely, we set \(\mathbf {B}_{j,i}^{(k)} \leftarrow \mathbf {B}_{j,i}^{(k1)} \wedge \mathbf {B}_{j,i(2^{k}  1)}^{(k1)} \). In the kth iteration if b_{k}=1, we update \(\mathbf {T}^{(k)}_{j,i} \leftarrow \mathbf {B}_{j,i  (t_{k}  1)}^{(k)}\mathbf {T}_{j,i}\), otherwise T^{(k)}=T^{(k−1)}. Finally, we set \(\mathbf {M}_{j,i} \leftarrow \mathbf {M}_{j,i} \mathbf {T}_{j,i}^{(\lceil \log (n) \rceil)}\).
This computation intuitively splits the t positions preceding a specific position i into parts of increasing powers of two, then it iteratively checks whether each of these parts is a match. Splitting a number into increasing powers of two is equivalent to computing its bit decomposition. Even though the ANDdepth of this step is O(log(n)), it can be performed in parallel to the length computation where we also use the same auxiliary matrix B. So, this step does not increase the depth of the circuit. Remove matches contained in other matches: We first compute the maximum length of a match for each position i across all the database entries and then perform an equality check between M_{j,i} and the maximum for position i to compute each L_{j,i}. Each such maximum is computed using the D&C comparison circuit [24] in ANDdepth O(log(logn) log(m)). The equality check can be implemented with a depthoptimized circuit in O(log(logn) logm) depth in which the comparison of the bits across the database entries is performed in parallel. Remove extendable matches: A match between x and y_{j} is extendable at position i if B_{j,i+1}=1. So, in order to remove the extendable matches, we compute \(\phantom {\dot {i}\!}\mathbf {K}_{j,i}= \lnot \max _{j'}\left \{\mathbf {L}_{j',i} \wedge \mathbf {L}_{j',i+1}\right \}\) that indicates whether an extendable match exists. Finally, we update M_{j,i}←K_{j,i}∧L_{j,i}∧M_{j,i}. Using the D&C comparison, this operation requires O(log(logn) logm)ANDdepth. Permute matches: Finally, to remove the information regarding the position of matches, we permute each row k of the matrix M according to a permutation π_{k}, which is given as an input. All permutations are performed in parallel and require ANDdepth O(log(n)) using the Waksman permutation network [25].
The total ANDdepth of the above circuit is O(log(n)+ log(logn) log(m)). Since in practice n>>m, the depth can be assumed to be proportional to O(log(n)).
We now present an optimization that reduces the output size and the computational cost of the permutations. Even though this does not offer an asymptotic optimization, it is definitely helpful in the experimental efficiency of the protocol.
Output size reduction and efficient permutations: Before the permutation step, the above circuit outputs a matrix M that contains all setmaximal matches with threshold t. We note that the threshold t guarantees that there exists no setmaximal matches ending at positions with distance less than t. In other words, in each row of the matrix M there is at most one none zero element for every t positions. By combining every t columns of M into one equal to their XOR, we reduce the size of the output by a factor of t without losing the desired information about setmaximal matches.
After the output reduction, we need to permute each row of a matrix of size m×n/t. Therefore, the Waksman permutation network has depth only O(log(n/t)).
Experimental evaluation
We have implemented our protocol for secure computation of setmaximal matches using the ABY framework to evaluate its efficiency. We use the network configuration included in ABY for the communication between the two parties and the default method for precomputing multiplication triplets via oblivious transfer. Then, we build the computation circuit gatebygate and let ABY handle the sharing procedures and secure computation. The problem has three parameters: n, the size of the genomic sequences, m, the size of the database and t, the threshold of setmaximal matches. We run a series of simulations in a single machine of 16GB RAM and Quadcore 2.8 GHz CPU that simulates both the server and the client to evaluate the efficiency of the protocol with respect to these parameters.
In our implementation, the server has as input the database, which defines the parameters n and m, and the threshold value t. Similarly, the client’s input is a query of size n, the database size m and the threshold value t. We note that because of the nature of the GMW protocol both parties need to know the values of all three parameters n, m and t. We evaluate how our protocol scales as n increases for database size m=10,100,1000 and threshold t=1 and t=2^{k}, where k is the bit length of n. Even though the threshold value is an input to the protocol, we plot only two values for clarity of exposition. The two values represent the best and worst values in terms of efficiency. The threshold value 2^{k} maximizes the efficiency gain form the output reduction optimization. On the contrary, when the threshold value t=1, this optimization is not in use, and hence this value corresponds to the worst case running time. In Fig. 2, we observe that for larger enough values of n indeed there is an improvement of both the running time and the depth due to the output reduction technique.
Figure 2a, c, and e show the running time of the protocol for three different values of m. For small n and m, the running time essentially depends on the precomputation phase, but when m and n are large enough the running time for a given m depends almost linearly in n. We observe that for small values of n, there is almost no difference on the running time for the two threshold values. In this case, the output is reduced by a small factor when t=2^{k}, whereas the precomputation time increases, since the computation circuit needs to include the output reduction. Hence, for small values of n, there is no improvement in the efficiency from the output reduction optimization.
In Fig. 2b, d, and f, we notice that the depth depends mainly on log(n), which is what was expected by the analysis in the previous section.
Discussion
Because of the developments on efficiently acquiring DNA data, computing on genomic data has gained a lot of attention in the recent years. At the same time, progress on cryptographic techniques has allowed us to design numerous protocols for private computation that work well in practice. The connection of these two areas of research has led to fascinating directions and applications.
We make progress in an important problem lying in the intersection of these areas concerning the similarity of genomic data. Our motivating application is that of identifying relatives without compromising the privacy of the genomic data of the individuals involved.
Conclusions
We construct an efficient algorithm for computing setmaximal matches, which is compatible with secure computation approaches. More specifically, our algorithm is designed carefully so that it remains efficient when compiled in the GMW framework, which offers a generic way to perform secure computation in the semihonest model of security.
We implement and evaluate our algorithm using the ABY framework. Our algorithm runs for relatively large datasets and the behavior of the running time and the rounds of interaction is compatible with our theoretical analysis. The ABY framework is a very general framework that offers many capabilities. Unfortunately, this generality hurts the efficiency of our protocol, so it would be beneficial for the practical efficiency of our scheme to implement it using a tailored secure computation protocol, which is more lightweight and contains only the parts necessary for our protocol.
This work extends an exciting line of research that combines cryptographic techniques for secure computation with efficient and accurate algorithms for genomic analysis. Our main contribution is on securely finding setmaximal matches between a database and a query sequence.
Availability of data and materials
The code used during the current study is available from the corresponding author on reasonable request.
References
Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed N. Privacypreserving techniques of genomic data–a survey. Brief Bioinform. 2017; 20(3):1–9.
Lipman D, Pearson W. Rapid and sensitive protein similarity searches. Science. 1985; 227(4693):1435–41.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.
James Kent W. Blat  the blastlike alignment tool. Genome Res. 2002; 12:656–64.
Ma B, Tromp J, Li M. Patternhunter: faster and more sensitive homology search. Bioinformatics. 2002; 18(3):440–5.
Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754–60.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biol. 2009; 10(3):25.
Li H, Homer N. A survey of sequence alignment algorithms for nextgeneration sequencing. Brief Bioinform. 2010; 11(5):473–83.
Durbin R. Efficient haplotype matching and storage using the positional burrows–wheeler transform (pbwt). Bioinformatics. 2014; 30(9):1266–72.
Freedman MJ, Ishai Y, Pinkas B, Reingold O. Keyword search and oblivious pseudorandom functions. In: Proceedings Theory of Cryptography, Second Theory of Cryptography Conference, TCC 2005, February 1012, 2005,. Cambridge: Springer Berlin Heidelberg: 2005. p. 303–24.
Jha S, Kruger L, Shmatikov V. Towards practical privacy for genomic computation. In: 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE: 2008. p. 216–30.
Blanton M, Aliasgari M. Secure outsourcing of dna searching via finite automata. In: In Conference on Data and Applications Security (DBSec). Berlin: Springer: 2010. p. 49–64.
Baldi P, Baronio R, De Cristofaro E, Gasti P, Tsudik G. Countering gattaca: Efficient and secure testing of fullysequenced human genomes. In: Proceedings of the 18th ACM Conference on Computer and Communications Security. CCS ’11. New York: ACM: 2011. p. 691–702.
He D, Furlotte NA, Hormozdiari F, Joo JWJ, Wadia A, Ostrovsky R, Sahai A, Eskin E. Identifying genetic relatives without compromising privacy. Genome Res. 2014; 24(4):664–72.
Shimizu K, Nuida K, Rätsch G. Efficient privacypreserving string search and an application in genomics. Bioinformatics. 2016; 32:1652–61.
iDASH. 2018. http://www.humangenomeprivacy.org/2018/. Accessed 17 June 2019.
Goldreich O, Micali S, Wigderson A. How to play any mental game. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing. STOC ’87. New York: ACM: 1987. p. 218–29.
Yao AC. Protocols for secure computations. In: Proceedings of the 23rd Annual Symposium on Foundations of Computer Science. SFCS ’82. Washington: IEEE Computer Society: 1982. p. 160–4.
Asharov G, Lindell Y, Schneider T, Zohner M. More efficient oblivious transfer extensions. J Cryptol. 2017; 30(3):805–58.
Ishai Y, Kilian J, Nissim K, Petrank E. Extending oblivious transfers efficiently In: Boneh D, editor. Advances in Cryptology  CRYPTO 2003. Berlin, Heidelberg: Springer: 2003. p. 145–61.
Schneider T, Zohner M. Gmw vs. yao? efficient secure twoparty computation with low depth circuits In: Sadeghi AR, editor. Financial Cryptography and Data Security. Berlin: Springer: 2013. p. 275–92.
Demmler D, Schneider T, Zohner M. ABY  A framework for efficient mixedprotocol secure twoparty computation, February 811. In: 22nd Annual Network and Distributed System Security Symposium, NDSS 2015. San Diego: Internet Society: 2015.
Ladner RE, Fischer MJ. Parallel prefix computation. J ACM. 1980; 27(4):831–8.
Garay J, Schoenmakers B, Villegas J. Practical and secure solutions for integer comparison. In: Public Key Cryptography. Berlin: Springer: 2007. p. 330–42.
Waksman A. A permutation network. J ACM. 1968; 15(1):159–63.
Acknowledgements
The authors would like to thank the iDASH workshop organizers and the reviewers for the helpful comments.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 13 Supplement 7, 2020: Proceedings of the 7th iDASH Privacy and Security Workshop 2018. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume13supplement7.
Funding
Publication costs were funded by Microsoft Research.
Author information
Authors and Affiliations
Contributions
KS was a major contributor in designing the algorithm, implementing the protocol and writing the manuscript. EG contributed in the protocol implementation and in examining the correctness of the algorithm. HC suggested using the Waksman permutation networks in the last step of the protocol and the optimization for the output size reduction All author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
This work was conducted when KS was an intern at Microsoft Research and EG and HC were employed by Microsoft Research.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sotiraki, K., Ghosh, E. & Chen, H. Privately computing setmaximal matches in genomic data. BMC Med Genomics 13 (Suppl 7), 72 (2020). https://doi.org/10.1186/s129200200718x
Published:
DOI: https://doi.org/10.1186/s129200200718x