My sister's keeper?: genomic research and the identifiability of siblings

Background Genomic sequencing of SNPs is increasingly prevalent, though the amount of familial information these data contain has not been quantified. Methods We provide a framework for measuring the risk to siblings of a patient's SNP genotype disclosure, and demonstrate that sibling SNP genotypes can be inferred with substantial accuracy. Results Extending this inference technique, we determine that a very low number of matches at commonly varying SNPs is sufficient to confirm sib-ship, demonstrating that published sequence data can reliably be used to derive sibling identities. Using HapMap trio data, at SNPs where one child is homozygotic major, with a minor allele frequency ≤ 0.20, (N = 452684, 65.1%) we achieve 91.9% inference accuracy for sibling genotypes. Conclusion These findings demonstrate that substantial discrimination and privacy risks arise from use of inferred familial genomic data.


Background
Genomic data are increasingly integrated into clinical environments, stored in genealogical and medical records [1,2] and shared with the broader research community [3,4] without full appreciation of the extent to which these commodity level measurements may disclose the health risks or even identity of family members. While siblings, on average, share half of their contiguous chromosomal segments, well over half of a sibling's allelic values can be inferred using only population-specific allele frequency data and the genotypes of another sib. The informed consent process for research and clinical genomic data transmission must therefore include rigor-ous treatment of accurately quantified disclosure risks for all who will be impacted by such activity.
It is remarkably easy to positively identify a person with fewer than 40 independent, commonly varying SNPs, using a physical sample or a copy of those values [5]. As DNA sequences cannot be revoked or changed once they are released, any disclosure of such data poses a life-long privacy risk. Unlike conventional fingerprints, which provide little direct information about patients or relatives, SNP genotypes may encode phenotypic characteristics, which can link sequences to people [6]. Despite these privacy issues [7,8], use of genetic sequencing is increasing in both forensics [9] and clinical medicine. The recent genetic fingerprinting provision in the renewal of the federal Violence Against Women Act [10], alone, may result in one million new sequenced individuals each year, markedly increasing the number of available links between identities and genotypes. This genetic fingerprinting has an impact on people beyond those directly sequencedgenetic testing partially reveals genotypes of siblings and other family members.
At each locus in a child's genome, each parent transmits only one of his or her two chromosomes. If we have the genotype of one child, and would like to use that information to help infer the genotype of a sibling, we consider both the known parental genotypes (for the alleles they have transmitted to their first sibling,) and also consider those chromosomes they have but have not transmitted. We assume that the unknown parental alleles are drawn from a reference population, such as one of the HapMap populations. Now, considering the genotype of the inferred sibling (2 nd child), with probability 0.25, the sibling will receive the same 2 chromosomes transmitted to the first child, in which case they will have the same genotype. With probability 0.25, the inferred sibling will receive both previously untransmitted chromosomes, in which case the sibling will have the same genotype distribution as the reference population. If only one of the same chromosomes is transmitted, then one chromosome will be the same and the other will be drawn from the population.

Methods
To quantify the risk of SNP disclosure to relatives, we demonstrate a model for inferring sibling genotypes using proband SNP data and population-specific allele frequency databases, such as the HapMap [10,11]. We also evaluate the probability that two people, in a selected pool of individuals, are siblings given a match at an independent subset of SNPs, and show that this number can be made remarkably low with appropriate SNP selection.

Enhanced ability to infer sibling genotypes
First, consider the case where one sibling's genotype is known to be 'AA', and the goal is to determine the probability that a second sibling's genotype will also be 'AA' at that locus. Because there is additional knowledge-the familial relationship between the two sibs-the prior probability of the second sib carrying a specific genotype at a selected SNP will be altered under the new constraint. A conditional probability expression that sums over the nine possible parental genotypic combinations (for example, maternal genotype 'Aa' with paternal genotype 'AA') at a single SNP, each denoted as i can be used: where Sib 1 AA and Sib 2 AA refer to Sib 1 and Sib 2 genotypes 'AA' at a selected SNP, respectively.
With unknown parental genotypes, we would calculate p(Sib 2 AA) considering all nine possible parental genotype combinations, but knowledge that Sib 1 has genotype 'AA' allows exclusion of any parental combinations where either parent has genotype 'aa', as that would require the transmission of at least one copy of the 'a' allele to Sib 1 , if non-paternity and new mutations are excluded. HapMap SNP population frequencies, p and q, for each selected SNP, can be used to calculate the probabilities of each parental combination, i. Once these values have been calculated, the genotype of the first sibling eliminates possible parental genotypic candidates (Figs. 1A-C), and the remaining probabilities are normalized.

Measuring the information content of Sibling genotype data
When calculating the probability of a specific Sib 2 genotype given a known Sib 1 genotype, it is possible to directly measure the benefit of the proband genotype information in improving Sib 2 inferences. This involves measuring the difference between the prior Hardy-Weinberg probability for the genotype, given only population frequencies, and the posterior probability, as calculated by the conditional expression above. To measure the information content provided by the first sibling's genotype, we propose the use of a likelihood ratio test statistic, comparing models where two individuals are known to be siblings versus two individuals that are known to be unrelated. There are a total of nine possible likelihood ratios, Λ Ind1, Ind2 genotypes , for each of the possible individual genotypic combinations, such as Ind 1 AA: (a-c) Refining mechanism for homozygous major SNPs: when the first sibling is homozygous major (a), homozygous minor (b), or heterozygous (c) at a given SNP, this constrains the possible parental genotypes; in the first case, five of nine parental geno-typic combinations can be eliminated (crossed boxes) Figure 1 (a-c) Refining mechanism for homozygous major SNPs: when the first sibling is homozygous major (a), homozygous minor (b), or heterozygous (c) at a given SNP, this constrains the possible parental genotypes; in the first case, five of nine parental genotypic combinations can be eliminated (crossed boxes). Using HapMap CEPH SNP population frequencies, p and q, the probability frequencies are populated for the remaining squares, and normalized. The probability that subsequent sibs will be homozygous major, heterzygous, or homozygous minor can then be calculated using the probabilities that parents would contribute specific allelic values. (d) For each of 30 HapMap CEPH trios, the Sib 1 genotype and the SNP population frequencies are used (without the parent genotypes) to infer p('AA'), p('Aa'), and p('aa') for subsequent siblings. Those probabilities are then validated against those that would be expected given only the parental genotypes at each SNP.
The denominator becomes p(Ind 2 genotype), which is either p 2 , 2pq, or q 2 . This is intuitive; when considering two unrelated individuals, the probability that the 2 nd has a specific genotype can only be identified using the population frequencies for that genotype. The numerator is the posterior probability expression derived in Table 1, also in terms of p and q. The log of this odds ratio can then be used as a statistic for measuring relatedness, depending only on the SNP allele frequency and the Sib 1 genotype (Fig. 2).
The allele frequency, p, that maximizes this statistic can then be found numerically for each Λ Ind1, Ind2 genotypes expression, to identify which allele frequencies and conditions are most informative for genotypic inferences. These results are below in Table 2.

Confirming sib-ship with two non-matching sets of SNP genotypes
The above inference technique can be extended to confirm sib-ship in two non-matching samples of SNP sequence data. Given a set of matches at M independent loci from a pool of N individuals, an expanded form of Bayes Theorem can be used to calculate p(sibs|match at M loci) directly: p(match|!sibs) can be calculated for each SNP using the population frequency; it is the probability that two unrelated individuals in the population would share the same genotype, 'AA', 'Aa', or 'aa'. The expression p(match|!sibs) is effectively the same as p(match) as long as the sample pool, N, is large enough, as the probability of sib-ship is very low in a large pool. For three different pool sizes, (N = 100,000;10,000,000;6,000,000,000), we have created a sib-ship probability surface that varies with the number of matched SNPs and MAF of those SNPs (Fig. 3) and published supporting values for these probabilities in Table 3.
For SNPs that commonly vary in the population, a small number of genotypic matches are required to confirm sibship.

Modeling a series of SNP inferences using a binomial distribution
A binomial distribution can be used to represent a series of sibling genotypic inferences, such as the probability of correct inferences at 50 SNP loci, if each inference meets specific criteria. Independent inferences can be treated as a random variable with probability p of success, as long as independent SNPs are selected, with the same MAF and Sib 1 genotype.
where p(k, n, p) refers to the probability that k correct inferences were made out of n attempted inferences when the probability of success for each inference attempt is p.   The error reduction depends only on the allele frequencies, and at all frequencies, the error is reduced, improving the quality of genotypic inference.
This measure will enable those who attempt to infer SNP genotypes to calculate the probability of matching at a subset of independent SNPs.
The cumulative binomial measures the probability of reaching up to k successes in n trials with probability p of success at each attempt: If n guesses are considered (i.e. n SNPs are genotyped and used for sib inference), F(k, n, p) is the probability that at least k of those will be correct.

Validation of SNP genotype inference using HapMap trio data
We then empirically infer sibling genotypic sequences from HapMap trio child genotypes using the above technique. At 700,000 SNP loci on chromosomes 2, 4, and 7, in each of 30 HapMap CEPH trios, the trio sibling, Sib 1 , known genotypes are combined with the CEPH and global HapMap SNP allele frequencies to produce genotypic inferences of a hypothetical sib, Sib 2 , at these loci. The inference method produces three genotypic probabilities for Sib 2 (or subsequent siblings): p(Sib 2 AA|Sib 1 genotype), p(Sib 2 Aa|Sib 1 genotype), and p(Sib 2 aa|Sib 1 genotype) for each SNP, which we call the SNP probability vector.
The ability to correctly infer a sibling genotype from a trio child genotype can be validated by comparing whether the best estimated genotype, using only the sibling genotype and population frequencies, matches the best estimated genotype using the parental genotypic data (Fig.  1D). We do this by comparing the plural, largest, value in the SNP probability vector, with the plural value in the SNP probability vector that would be expected given the parental genotypes and Mendelian Inheritance.
Log likelihood ratio test statistic for sibling inferences: for each Sib 1 genotype, the log likelihood ratio for each possible Sib 2 inference is shown versus MAF Figure 2 Log likelihood ratio test statistic for sibling inferences: for each Sib 1 genotype, the log likelihood ratio for each possible Sib 2 inference is shown versus MAF. These charts describe how informative the Sib 1 genotype is when inferring each Sib 2 genotype.

Deriving propensity to disease from sibling SNP data
Additionally, sibling SNP data can be used to quantify an individual's disease propensity through genotypic inference, without that individual's actual sequence data. For example, the likelihood ratio test statistic above may also be used to describe relative risk, using a multiplicative model.
For example, the relative risk of Sib 2 Aa, carrying one copy of the disease allele 'a', is provided by information from the Sib 1 aa genotype: In this example, at MAF = 0.01, the relative risk of genotype 'Aa' is 25.25, given information that Sib 1 carries genotype 'aa' at that locus. However, at MAF = 0.5, the relative risk of genotype 'Aa' is 0.75, given information that Sib 1 carries genotype 'aa', explaining that the risk of having the genotype 'Aa' is reduced at this MAF. This may seem counterintuitive, as the risk of carrying a disease allele is actually higher at this MAF, but Sib 2 carrying genotype 'Aa' is lower than in the control population, while the relative risk of carrying the disease allele with genotype 'aa' is higher.
At MAF 0.5, Γ aa|Sib1aa is 2.25, demonstrating that it is more likely that a disease allele will be carried by Sib 2 in genotype 'aa' than in the control population given the Sib 1 genotype.
The explicit probability of developing a disease is also altered. If an individual with genotype 'Aa' at a specific locus has a probability p d of developing a disease by age a, and that individual has a probability p s of having that genotype given his sibling's genotype at that locus, his probability of developing that disease by age a is p s· p d . This can easily be extended to multiple independent loci, important for diseases in which a set of common or rare variants dictates disease likelihood [12,13]. As SNPs are both clinically informative and there is a wealth of supporting allele frequency data, they have been the focus of our analysis,  however there are other genomic data types which should be considered in a rigorous privacy and propensity analysis, including copy number variant and mutation data.

Discussion
These findings demonstrate that substantial discrimination and privacy concerns arise from use of inferred familial genomic data. While the Genetic Information Nondiscrimination Act of 2008 (GINA, H.R. 493), recently passed into law, would mitigate the threat of direct discriminatory action by employers or insurers [14], there will continue to be other uses of genomic data that pose privacy risks, including the use of genetic testing in setting life, disability, and long-term care insurance premiums [15]. Familial genotypic sequences can be used to assist in forensic or criminal investigations for indirect   In a sample pool of size N, provided below, the probability that two individuals are siblings given a match at a subset of SNPs is charted as a function of M, the number of independent SNPs that they match at, and the minor allele frequency, q.
identification of genotype, increasing the number of people who may be identified [16,17]. Similarly, Freedom of Information Act (FOIA)[18] requests related to federallyfunded genome wide association studies could potentially be used to identify research participants and their family members. Clinically, choosing the detail and type of disease propensity information that must be disclosed to patients and their potentially affected family members is also under debate [19,20].
Quantifying the information content of disclosed genomic data will add clarity to the informed consent process when a patient shares genotypic data for research use. For research investigations, it is conceivable that a subject would want to limit the impact of her genomic disclosure on her family members, or be asked to have a discussion with specific family members before proceeding. Providing subjects with different levels of genomic anonymity based on their sequence data, along with an estimate of the probability of re-identification and familial impact for each of those anonymity levels, will allow patients to trade off altruistically motivated sharing [21] with privacy consideration, especially when they volunteer to share all the variants in their genome [22].
While the inference accuracy rates are very high, particularly for inferences where Sib 1 has a homozygous major genotype, we would like to caution that some of these findings are not always highly informative. For example, if the MAF is 0.01, where 99% of the alleles in the population are the major allele, the prior probability for a homozygous major allele is 0.99*0.99 ≅ 0.98. If Sib 1 has a homozygous major allele, the posterior probability of observing a homozygous major allele in another sibling is (1/4 + 1/4*0.99*0.99 + 1/2*0.99) ≅ 0.99. In this case, the difference between prior and posterior probabilities is only 0.01, and knowledge of the Sib 1 genotype provides very little information, as most accuracy comes from the allele frequency in the population.
However, homozygous minor alleles are much more informative. With a MAF of 0.2, if Sib 1 has a homozygous minor genotype, the probability of Sib 2 having the same genotype, given only the reference population is 0.04. Given that Sib 1 has a homozygous minor genotype, Sib 2 will have a homozygous minor allele with probability of (1/4 + 1/4*0.2*0.2 + 1/2*0.2) = 0.36, which is quite different from the prior probability of 0.04.
One limitation of this study is that the population-based estimates for MAF rely on the HapMap study population sizes, which, at present, are small, though these types of sources will continue to expand. For example, the CEPH population contains 90 participants, so each trio child contributes 1/90 th of the allele frequency data used in the study. This approach also depends on the independence of the loci considered, and would need to be adapted for SNPs that are in linkage disequilibrium. Extending this study to include linked SNP loci is possible, using the haplotype block information for HapMap populations that is available. To ensure that SNPs are independent, linkage data from the HapMap population can be used to confirm Fraction of correct Sib2 inferences: the fraction of Sib2 SNPs that can be correctly identified when Sib1 is (a)homozygous major or (b)heterozygous Figure 4 Fraction of correct Sib2 inferences: the fraction of Sib2 SNPs that can be correctly identified when Sib1 is (a)homozygous major or (b)heterozygous. Each line represents use of distinct data-inclusion or exclusion of Sib1 genotypes, and use of population-specific or global allele frequency data. Without Sib1 genotypes, homozygous major inferences would always be incorrect at MAF ≥ 0.33 and heterozygous inferences would always incorrect at MAF ≤ 0.33. At many allele frequencies, use of Sib1 genotypes dramatically improves Sib2 inferences.
independence, and SNPs that are far from one another may be selected. Additionally, this approach does not consider the possibility of genotypic errors, which may be common on some platforms. An adjustment using a binomial probability distribution could be used to account for possible errors.

Conclusion
Technologies for sequencing large numbers of SNPs are rapidly dropping in cost, which will help realize the promise of personalized medicine, but pose substantial personal and familial privacy risks. While electronic storage and transmission of genetic tests is not yet a common component of medical record data, these tests will soon be stored in electronic medical records and personally controlled health records [23]. This mandates the need for improved informed consent models and access control mechanisms for genomic data. The increasingly common practice of electronically publishing research-related SNP data requires a delicate balance between the enormous potential benefits of shared genomic data through NCBI and other resources, and the privacy rights of both sequenced individuals and their family members.

Inferring sibling genotypic sequences from HapMap trio children
Here, we explore a specific example of sibling genotypic inference in greater depth, considering the case where one sibling's genotype is known to be 'AA', and the goal is to determine the probability that the second sibling's genotype will also be 'AA' at that locus. The conditional probability expression that sums over the nine possible parental genotypic combinations (for example, maternal genotype 'Aa' with paternal genotype 'AA') at a single SNP, with each specific parental genotypic combination denoted as i can be used: where ⎥ ⎥ straint increases the probability to p 2 +pq+(q 2 /4), increasing inference accuracy by pq+(q 2 /4).
The remaining entries in the probability vector, p(Sib 2 Aa|Sib 1 AA), and p(Sib 2 aa|Sib 1 AA), can then be calculated just as we have done for p(Sib 2 AA|Sib 1 AA) above. Again, these probabilities have been generated without any actual knowledge of the parent genotypes. If the Sib 1 genotype were instead 'Aa' or 'aa', the above technique can similarly be used (with a different combination of possible parental genotypes) to calculate the two other probability vectors, [p(Sib 2 AA|Sib 1 Aa), p(Sib 2 Aa|Sib 1 Aa), p(Sib 2 aa|Sib 1 Aa)] and [p(Sib 2 AA|Sib 1 aa), p(Sib 2 Aa|Sib 1 aa), p(Sib 2 aa|Sib 1 aa)].

Validating the sibling genotype probability vector using parental genotypic data
To validate the results of the refining strategy on inferring the second sibling genotype, the authentic parental genotypes are used to create the probability vector p('AA'), p('Aa'), p('aa') at the SNP being evaluated, for the children the pair would be expected to have. For each of the trio pairs at each of the SNPs being tested, the probability vector was calculated.

Error reduction calculation
The error reduction measurement identifies the extent to which inference error is reduced. For example, where we are trying to infer the probability that Sib 2 has genotype 'AA' at a specific SNP, we calculate the absolute value of the difference between our best inference and the Hardy Weinberg probability for Sib 2 to have genotype 'AA', using population-specific allele frequency data and the Sib 1 genotype, |p(Sib 2 AA|Sib 1 genotype)-p(Sib 2 AA)|. This value is specifically the improvement to the probability value from the new data, when inferring the specific event that Sib 2 will have genotype 'AA' and Sib 1 will have the specific genotype in question.
Any change to p(Sib 2 AA) must also correspond with the opposite change in the sum of p(Sib 2 Aa) and p(Sib 2 aa). To accurately represent the overall error reduction by Sib 1 genotype, with any of three possible Sib 2 genotypes, the average of the three values is measured. For example, where the Sib 1 genotype is 'AA', the overall average improvement (and error reduction) is the average of |p(Sib 2 AA) -p(Sib 2 AA|Sib 1 AA)|, |p(Sib 2 Aa) p(Sib 2 Aa|Sib 1 AA)|, and |p(Sib 2 aa) -p(Sib 2 aa|Sib 1 AA)|.

Scoring metric for calculating correct fraction of inferences
To ascertain whether the inferences are helpful for producing correct answers, a scoring metric was used to calculate the fraction of correct SNP inferences, in our empirical inference validation study. For each SNP inference, the scoring metric provides a full point when the plural entry in the inference vector, (the maximum of p('AA'), p('Aa'), and p('aa'), and thus the predicted sib genotype), matches the plural entry in the parental validation vector (the empirical most likely genotype). Given the parental genotype values, it is possible, and not infrequent, that a validation probability vector has two matching plural values, for example, if p('AA') = p('Aa') = 0.5. When this is the case, one half point was awarded if the plural value in the inference vector matched one of the two validation choices, to signify that one of the two equally likely candidates was chosen.
There are some conditions that arise from use of a simple scoring metric, where it becomes difficult to score well. For example, a heterozygous Sib 1 will likely result in a 0.5 score for inferences. A score of 1 point would be possible if one parent had a genotype of 'AA' and the other had genotype 'aa', making the probability that the parents would have a child with genotype 'Aa' equal 1. Most remaining parental combinations would not result in the probability of child genotype 'Aa' equal to 1, and would likely result in only a half point. These values can be adjusted using machine learning techniques or more robust decision making algorithms, but those are out of the scope of this work.