Recent improvements on Genomic data sharing efforts have led researchers and clinicians gaining access and make comparisons across data from millions of individuals. Such development made it easier for genetic variant interpretation and in some cases treatment of rare diseases such as some special cancer types [1]. Most of the big organisations i.e., Broad institute in the U.S., BGI in china, Wellcome Trust Sanger in the UK etc. have an interest of making DNA data easier to access in order for their researchers to treat patients one on one. However, after twelve years of completion Human Genome project, the tremendous growth of genomic data has exceeded the containers build to hold such data. Genomic and clinical data are generally still collected in either by disease, institution or by country. More importantly, current data sharing privacy requirements do not necessarily protect individuals identity within and across institutions and countries. Furthermore, data often stored in incompatible file format and there are no standardized tools and analytical methods are in place [1–3].
With such tremendous needs for global genomic and clinical data repository system, Global Alliance for Genomic and Health (GA4GH) has created a federated data ecosystems called Beacon data network, a way for searching genomic data as simple as World Wide Web. Since the project’s launch in the middle of 2015, the beacon network has currently 23 different organizations covering over 250 genomic datasets. The data sets served through beacons can be queried individually or in aggregate via the Beacon Network, a federated search engine (http://www.beacon-network.org) [1]. Thus, the Beacon Project aims to simplify data sharing through a web service (beacon) that provides only allele-presence information. Users can query institutional beacons for information about genomic data available at the institution. For example, an individual could ask the beacon web server about a genome that has a specific nucleotide and the beacon would response either yes or no [4]. By providing only allele-presence information, beacons were assumed safe from attacks that require allele frequencies.
Although the beacon network has set up to share data and protect patient privacy simultaneously, it could potentially leak phenotype and membership information of an individual [4]. There is currently no cap on the number of queries a user can make in the Beacon database. Recently, Shringarpure and Bustamante showed that anonymous patients whose DNA data is shared via beacon network can be re-identified [5]. If an attacker has access to victims DNA, s/he can query different beacons to see whether the victim is in the dataset. They further demonstrated that it is possible to infer whether or not the victim is affected by a certain condition or disease [5]. Therefore, the anonymous beacons are inherently insecure and are open to re-identification attacks. For brevity, we will denote the attack as Bustamante Attack through the rest of the paper.
Very recently, some solutions [6, 7] have been proposed based on different policies around the access of the beacon service. However, these solutions will disrupt the quintessential feature of the proposed beacon service: that is to provide faster access to genomic data and to give open access to the research community. Different access controls are highly necessary for human genomic data access where phenotype or sensitive information about the disease is disclosed. However, the beacon service only provides us aggregate results of yes or no leading the researcher to a decision regarding the dataset’s relatedness to his or her research. Therefore, we propose two solutions based on privacy-preserving techniques, which fit well with the beacon service and mitigate the possibility of identifying an individual from the dataset.
In this article, we explain the ‘Bustamante Attack’ [5] on genomic beacon services and propose two privacy preserving solutions. The contributions of this article can be summarized in two folds: a) understanding the statistical formulations and soundness of the attack, b) analyze lightweight privacy preserving solutions to mitigate the attack. The main contributions of our work are as follows:
-
We present the statistical and the mathematical model of the attack in a simplified form. This helps us to analyze different and more realistic parameters on the original attack framework to exploit some weakness and justify our solutions accordingly.
-
We show the required steps for any data owner to calculate the risk involved in sharing their genomic data in a beacon service.
-
We propose two easy to implement and lightweight privacy preserving solutions which ensure the applicability of the beacon service as well as the privacy of the participants.
-
We provide extensive experiments over synthetic data (according to [5]) to show the privacy-utility evaluation of our proposed methods which will help the development of different privacy preserving techniques on such attack model later on.
Beacon service for genomic data
A beacon is an online web search engine developed by the Global Alliance for Genomic and Health (GA4GH), which provides a way for genomic data owners and research institutes to easily share genomic data while maintaining patients privacy (Fig. 1). It is a genetic mutation sharing platform that allows any user to query an institution’s databases to determine whether these databases contain a genetic variant of interest while keeping all other sequence data obscured. A query in this search engine is defined by three parameters: chromosome number, position in that chromosome, and target nucleotide (A/T/G/C). A beacon query answer is either true or false, denoting the presence of that nucleotide in that specific position and target chromosome. In other words, it will only answer yes/no for the questions like: Do you have any genomes with an ‘A/T/G/C’ at some position ‘Y’, on specific chromosome ‘Z’. This allows a researcher to target some specific dataset, which is relevant to his or her research. This service also helps a clinician to check whether a mutation found in one of her patients is also present in others without actually having access to their genomes [8].
Beacons are easy-to-implement techniques for several large-scale organizations when it comes to sharing genomic data. It also saves researchers a tremendous amount of time for tracking down useful data for their work as well [9, 10]. Unlike large centralized data repositories, a beacon network is distributed across many databases around the world and is virtually connected through software interfaces allowing continuous authorised access. This federated data ecosystem allows each organization to control their legal data within their jurisdiction [1]. The shared Genomics API in the beacon framework makes it easy to query all at once and ensures that GA4GH team can quickly add new beacons to the network.
Bustamante attack on beacon service
A recent study done by Shringarpure and Bustamante [5], developed a likelihood-ratio test that uses only allele presence information to predict if the genome of an individual is present or not in the beacon database.This study suggested that beacons are susceptible to re-identification attacks and thus can be subjugated to invade genetic privacy. Since a beacon database includes data with known phenotypes information such as cancer, autism or other diseases, this re-identification also potentially disclose phenotype information about an individual whose genomic data is present in the beacon [11]. Through simulations, they demonstrated that by making just 5000 queries, it was possible to identify someone and even their relatives in a beacon consisting 1000 individuals. They found that re-identification of an individual is possible even with the sequencing errors and variant-calling differences. They also demonstrated that a beacon constructed with 65 European individuals from the 1000 genome projects, it is possible to detect membership in the beacon with just 250 SNPs [5].
In this section, we briefly introduce the Bustamante attack and analyze its statistical methods. The goal of this attack is to know whether a genomic sequence g belongs to a specific database with the help of the beacon service. To answer this question they considered two hypothesis:
-
1.
Null hypothesis H
0: the query individual is not in the beacon service.
-
2.
Alternative hypothesis H
1: the query individual is in the beacon service.
To determine the correct one, the adversary is allowed to query the beacon service with unlimited amount of queries. The adversary queries specific locations where the query individual has alternative allele to see whether the beacon server also contains an individual with the same allele values. Therefore, the responses of the beacon service are a sequence x
1,…,x
n
of yes or no. If we consider yes and no with ‘1’ and ‘0’ respectively, the the answer sequence, R will be a binary vector. For example, if the query individual is in the database, we will get yes (or 1) in each query. However if there are some genome sequencing error, we might get some wrong answers as well. This error is denoted by δ and also considered by the attack [5].
There is also another considerable case where multiple individual have the same allele in the database. This is why the attacker needs to leverage the likelihood ratio of both the assumptions whether the the user is in the dataset or not. For a database of N genome, the log of this likelihood ratio can be computed for the response series R regarding the hypotheses H
i
as follows:
$$\begin{array}{*{20}l} L_{H_{i}}(R)= \sum\limits_{i=1}^{n}x_{i}\log P(x_{i}=1|H_{i})+\\ (1-x_{i})\log P(x_{i}=0|H_{i}) \end{array} $$
where, n is the number of queries and x
i
is the result from the beacon. x
i
=1 denotes the query is present in the database which can come either from the target genome or any of the other N−1 genomes. x
i
is only 0 when the query is not present in any of the N genomes.
In article [5], the authors using some simplifying assumptions proved that if the query individual is in the beacon database, R=x
1,…,x
n
follows a Binomial (n,1−D
N
) distribution, otherwise R has a Binomial (n,1−δ
D
N−1) distribution. Therefore, the hypothesis can be rewritten as follows:
-
1.
Null hypothesis H
0: θ=θ
0=n(1−D
N
).
-
2.
Alternative hypothesis H
1: θ=θ
1=n(1−δ
D
N−1).
Therefore, we have:
$$ L_{H_{0}}(R)=\sum\limits_{i=1}^{n}x_{i}\log(1-D_{N})+(1-x_{i})\log(D_{N}) $$
(1)
and for alternative hypothesis,
$$ L_{H_{1}}(R)=\sum\limits_{i=1}^{n}x_{i}\log(1-\delta D_{N-1})+(1-x_{i})\log(\delta D_{N-1}) $$
(2)
where D
N−1 is the probability that other N−1 individuals (all individual except the query individual) have not the specified allele in the determined location.
Basically, the \(L_{H_{i}}(R)\) will maximize if the H
i
hypothesis is correct. Therefore, we compute \(\Lambda =L_{H_{0}}(R)-L_{H_{1}}(R)\) and the Λ will declare which hypothesis is true.
The log of the likelihood-ratio statistics can be rewritten from Eqs. 1 and 2 as,
$$\begin{array}{*{20}l} \Lambda &= L_{H_{0}}(R)- L_{H_{1}}(R) \\ &=n\log\left(\frac{D_{N}}{\delta D_{N-1}}\right)+ \log\left(\frac{\delta D_{N-1} (1-D_{N})}{D_{N} (1-\delta D_{N-1})}\right) \sum\limits_{i=1}^{n}x_{i} \\ &=nB+C\sum\limits_{i=1}^{n}x_{i} \end{array} $$
(3)
In any distribution, a threshold t can be fixed where the null hypothesis will be rejected if Λ<t and accepted otherwise. The attacker need to decide an appropriate threshold for a specific beacon dataset before launching the attack. Suppose a false positive error α is given. Regarding this value and the beacon statistical properties, the threshold t
α
is determined such that Pr(Λ < t
α
|H
0) = α. From Eq. 3,
$$\begin{array}{@{}rcl@{}} &Pr\left(nB+C\sum_{i=1}^{n}x_{i}<t_{\alpha}|H_{0}\right)<\alpha \\ &Pr\left(\sum_{i=1}^{n}x_{i}>\frac{t_{\alpha}-nB}{C}|H_{0}\right)<\alpha~(C~is~negative) \\ &Pr\left(\sum_{i=1}^{n}x_{i}>t'_{\alpha}|H_{0}\right)<\alpha \end{array} $$
(4)
In the attack instead of calculating Λ and comparing it to the threshold t
α
, \(\sum _{i=1}^{n}x_{i}\) is computed and compared with \(t^{\prime }_{\alpha }\) to make the decision. This threshold \(t^{\prime }_{\alpha }\) is used to decide whether the null or the alternative hypothesis is correct. In other words whether the individual is present in the beacon database or not will be dictated by this \(t^{\prime }_{\alpha }\). To calculate this, the adversary sums the responses from the beacon x
i
and retrieves \(\sum x_{i}\). The null hypothesis is rejected simply if \(\sum x_{i}>t^{\prime }_{\alpha }\) which leads to a conclusion that the query individual is present in the beacon and the attack is successful.
To calculate the D
N
, the authors assumed that the adversary has an idea about the distribution of the allele frequencies on those query positions. Specifically, alternate allele frequencies, f for all SNPs observed in the population are claimed to be distributed as a β distribution according to [5]. Here, f∼β(a
′,b
′), where a=a
′+1 and b=b
′+1, and (a
′,b
′) can be precomputed from the genomic dataset in which the beacon service is running. Thus, the adversary needs \(\phantom {\dot {i}\!}n\sim N^{a^{\prime }+1}\) queries to make his or her decision whether the target individual is present in the database. The value D
N
can be approximated as,
$$ D_{N}\approx \frac{\Gamma(a+b)}{\Gamma(b)(2N+a+b)^{a}} $$
(5)
To see the details of deriving and proving the above formula see [5]. We will need this \(t^{\prime }_{\alpha }\) and D
N
for further analysis in the upcoming section as these parameters dictate the attack.