In this paper we describe the solution submitted by our team to the second task of iDASH Privacy & Security Workshop 2017 competition [1]. Before proceeding to solution description itself we start by introducing some background and related works. Afterwards we describe more formally competition problem together with a typical use-case.
Related works
DNA is the molecule that stores genetic instructions used by any living organism in their growth, development and functioning. The DNA molecules are organized in chains which form the genome. Studying human genome has plenty of practical applications in the medical, social, legal fields, etc. Any two individuals share about 99.9% of their genomic DNA and the remaining 0.1% track the differences between them. The vast majority of these differences take the form of single-nucleotide polymorphism (SNP). A SNP is a substitution of one base pair at a certain location when compared to a reference genome. Genome SNP variations are studied in order to track disease genes or heritable traits.
One important genomic application is the search for top most significant SNPs, in a dataset labeled with control and case, which are chosen according to the statistical χ2 test.
As an example, this application can be used to detect genome differences between a group of persons which has a disease and another group which does not have it. The most significant SNPs (supposedly) influence disease susceptibility.
Genome sequencing cost decreases each year [2]. More and more genome data is available for full scale medical research [3]. Cloud storage and computing is a straightforward solution to the challenge of storing and processing huge amounts of genomic data [4]. However, outsourcing genomic data to an untrusted cloud environment can be difficult or even impossible because of privacy and confidentiality concerns [5, 6]. Many research works [7–9] study the inference of sensitive personal information (e.g. person identity and appearance, disease condition) from genomic data.
Homomorphic encryption is a solution which can ensure genomic data privacy while being able to perform computations. Homomorphic property of group based cryptography was stated in [10]. The first fully homomorphic encryption scheme (supporting both addition and multiplication) was introduced by Gentry in [11]. Since then, several other authors proposed new and more efficient homomorphic encryption schemes [12–14]. The most recent one [15] being able to execute a 2-input Boolean gate in less than 13 milliseconds. On a side note, this encryption scheme was used by 2 teams in the third track of iDASH 2017 competition [1]. From an applicative point of view, the authors of [16–20] introduced and discussed the use of homomorphic encryption to genomic data processing (e.g. genetic association, logistic regression, genomic medicine). Secure multi-party computation protocols can also be used to provide private genomic data analysis [21–23]. The main issue of these solutions is the performance bottleneck when applied to large-scale genomic data computations.
Hardware assisted privacy preserving solutions (i.e. Intel Software Guard Extensions (SGX)) allow to leverage the performance gap of cryptography only based solutions (e.g. homomorphic and functional encryption, multi-party computation protocols, etc.). Intel SGX allows to pragmatically instantiate diverse cryptographic concepts without huge overhead. Secure genomic computations using Intel SGX have been studied in many research works: rare disease analysis [24], genomic queries [25], etc. The 2017 iDASH competition second track was to perform a whole genome variants search in a multi-party context.
Overview of intel SGX
Intel’s Software Guard Extensions (SGX) was first introduced in 2015 on the Skylake micro-architecture. The aim of this extension is to provide a Trusted Execution Environment (TEE) in which applications can protect critical code and data against malicious privileged system code (operating system, hyper-visor, BIOS, etc.). The trusted part of the application is called an enclave in SGX dialect. The key point is that enclave code and data inside the CPU perimeter runs in the clear, but are encrypted outside. Figure 1 illustrates the execution of an application using SGX. SGX is built on three components:
-
17 new CPU instructions,
-
a Memory Encryption Engine (MEE) to encrypt/decrypt on the fly,
-
a MEE buffer of 128MB, in which 96MB are available to the application.
More information on Intel SGX can be found in the white-paper [26] and a detailed description [27]. Possible use-cases of SGX applications are secure remote computation, secure web browsing, digital rights management, etc.
Even if at first view one can think that Intel SGX allows to securely execute applications on encrypted data, particular attention should be paid to the manner applications are implemented. Existing works [28–30] present side-channel (cache timing, page faults, memory access patterns) attacks on SGX enclaves. They arrive to discover secrets (e.g. secret key of an encryption algorithm) from applications executed inside an enclave. This attack is possible because of the information which leaks from application execution and highly depends on how the application was implemented.
VCF file format
The Variant Call Format (VCF) is a format of text files used for storing genome variations. Compared to other file formats which store lots of redundant data (as mentioned earlier 99.9% of genome is shared between individuals), a VCF file tracks only differences from a reference genome. In this work we suppose that VCF files contain only SNP gene differences. A sample of VCF file (first 8 lines) is given below:
##real id in 1000genome project: HG00253
#CHROM POS ID REF ALT QUAL FILTER TYPE
1 13110 rs540538026 G A 100 PASS heterozygous
1 13116 rs62635286 T G 100 PASS heterozygous
1 13118 rs200579949 A G 100 PASS heterozygous
1 14930 rs75454623 A G 100 PASS heterozygous
1 15211 rs78601809 T G 100 PASS homozygous
1 18849 rs533090414 C G 100 PASS homozygous
A VCF file contains meta-information lines (starting with two “#” symbols), one header line (starting with a “#” symbol) and then one data line per SNP. Each SNP information line contains exactly 8 fields. First 5 fields are: chromosome identifier (CHROM), position within chromosome (POS), unique SNP identifier (ID), reference (REF) and alternate (ALT) base. We consider that chromosome and position fields are integers. SNP identifier is a string. Reference and alternate base are non equal symbols from the set {A,C,G,T,N}. The last field (TYPE) shows whether SNP is heterozygous or homozygous. One important property of VCF files is that SNPs are sorted in increasing order by chromosome and position.