Refining multivariate disease phenotypes for high chip heritability
© Sun et al.; 2015
Published: 23 September 2015
Statistical genetics shows that the success of both genetic association studies and genomic prediction methods is positively associated with the heritability of the trait used in the analysis. Identifying highly heritable components of a complex disease can thus enhance genetic studies of the disease. Existing heritable component analysis methods use data from related individuals to compute linearly-combined traits to maximize heritability. Recent advances in acquiring genome-wide markers have enhanced heritability estimation using genotypic data from apparently unrelated individuals, which is referred to as the chip heritability. Novel statistical models are thus needed to identify disease components (subtypes) with high chip heritability.
We propose an optimization approach to identify highly heritable components of a complex disease as a function of multiple clinical variables. The heritability of the components is estimated directly from unrelated individuals using their genome-wide single nucleotide polymorphisms. The proposed approach can also model the fixed effects due to covariates, such as age and race, so that the derived traits have high chip heritability after correcting for fixed effects. A new sequential quadratic programming algorithm is developed to efficiently solve the proposed optimization problem.
The proposed algorithm was validated both in simulations and the analysis of a real-world dataset that was aggregated from genetic studies of cocaine, opoid, and alcohol dependence. Simulation studies demonstrated that the proposed approach could identify the hypothesized component from multiple synthesized features. A case study on cocaine dependence (CD) identified a quantitative trait that achieved chip heritability of 0.86 estimated using a cross-validation process. This quantitative trait corresponded to the likelihood of an individual's membership in a CD subtype. Clinical analysis showed that the subtype enclosed individuals who reported heavy use of cocaine but few withdrawal symptoms.
Extensive experiments on both synthetic and real-world data demonstrate the effectiveness of the proposed approach as a means to find meaningful disease components with high chip heritability.
Identifying genetic variation that underlies complex diseases has important implications in medicine. To date, genome-wide association studies (GWAS) have had limited success in dissecting the genetic etiology of complex diseases. For instance, very few associations identified for substance use disorders at a genome-wide significant level have been replicated [1–3]. Complex disorders are often characterized by multiple disease indicators. For example, to diagnose whether a patient has a lifetime drug dependence disorder, clinicians interview the patient to understand his or her drug use behaviors, the negative consequences of the drug use, the treatment history and other co-occuring medical conditions. All of these clinical variables are used to arrive at a diagnosis of dependence on a certain drug . There is substantial variation in these variables in the disease population, and these variables also present different levels of heritability, i.e., some are more genetically influenced than others. This phenotypic heterogeneity diminishes evidence of genetic association. Statistical genetics also shows that the success of most gene discovery studies is positively associated with the heritability of the trait used in the association analysis . Hence, identifying more homogeneous and highly heritable components of a complex disease could enhance the association analysis.
The ability to translate genotype information into a quantitative prediction of disease phenotypes is important for precision medicine . Genomic prediction methods that predict a phenotype based on genome-wide single nucleotide polymorphisms (SNPs) may provide a suitable analytic tool [7, 8]. These methods expand the traditional single-marker-regression-based GWAS model for detecting few variants of large effect to multi-marker predictive models with many variants of small effect. The predictive ability of genomic prediction methods relies on several factors, especially trait heritability . If we identify higly heritable components of a complex disease, it could also improve the utility of genomic prediction methods to predict subtypes (defined by the components) of the disease.
Because the success of both association analysis and genomic prediction is dependent on the trait heritability, heritability can be a valid target for refining multivariate disease phenotypes. The narrow-sense heritability h2 is defined by the percentage of phenotypic variance that is due to additive genetic effects . The broad-sense heritability H2 is defined as the overall genetic contribution to the phenotypic variation. The heritability of a quantitative trait is commonly estimated from related individuals in pedigrees. Recent advances in acquiring dense genome-wide genetic markers have enhanced heritability estimation from apparently unrelated individuals using their genome-wide SNPs. The SNP-based heritability, often referred to as the chip heritability, is defined as the portion of the phenotypic variation that can be explained by the genotyped genetic markers . It has been argued that estimating h2 from unrelated individuals has an advantage over traditional pedigree-based methods because the estimated chip h2 corresponds only to the causal-variant heritability that is tagged by the genotyped SNPs [8, 12].
Phenotype refinement is an important but underdeveloped genetics research area. Unsupervised cluster analysis or latent class analysis has been commonly used to partition a study population into subgroups based on clinical variables [13–18]. This approach can create subgroups of individuals that differ in clinical symptoms and features, but may have limited utility in genetic analysis. Because genetic data are not used during the creation of the subgroups, the resultant subtypes (subgroups) are not guaranteed to have high heritability, and hence may not be informative for genetic association.
More relevant to this present work, a number of prior methods identify the principal components of clinical data that are heritable, and characterize the components by linear combinations of clinical variables [19–23]. Thus, these methods are often called heritable component analysis. All existing methods decompose the variance of clinical data into two components: the variance due to additive genetic effects estimated from pedigrees; and the variance due to other effects (residuals). Then, they solve a generalized eigen-decomposition problem to identify the linear combination of the clinical variables that maximizes the ratio of additive-genetic variance versus the residual variance, thus leading to high heritability of the resultant linearly combined trait. Nearly all of these methods use pedigree-based heritability estimation (an exception is ), and all assume a genetic model that is based on a single causal variant, an assumption that is commonly violated for complex diseases.
Although the latest heritable component analysis method  is effective and computationally efficient, a fundamental question is how much heritability of the derived trait can be explained by the genotyped SNPs. Because GWAS and genomic predictions mainly utilize the genotyped SNPs, the utility of the derived trait may be limited by a low chip heritability. Thus, novel statistical models are needed to directly target high chip heritability. In this paper, we propose an approach to identify the components of a multivariate disease phenotype that maximizes the chip h2. To estimate the chip heritability of a given trait, the latest methods use the restricted maximum likelihood (REML) method, which assumes that the trait follows a mixed effect model with random genetic effects, and fixed effects due to covariates, such as age, sex and race [8, 12]. To identify a trait of high chip h2, we need to solve the inverse problem of (chip) heritability estimation. In other words, we now search for a trait (e.g., a linearly-combined trait) so that its chip heritability is high when estimated using the REML method. Directly solving the inverse problem leads to a quadratic optimization problem that can be optimized efficiently via a sequential quadratic programming algorithm. We validated the proposed approach in simulations as well as in the analysis of a real-world dataset that was aggregated from genetic studies of cocaine, opioid, and alcohol dependence. Our experimental results demonstrated the effectiveness and generalizability of the proposed approach.
The proposed statistical model
where and can be estimated by the REML method [24, 11]. The chip her-itability estimated on the m causal variants is computed as, where is the total phenotypic variance. Because the causal variants of y are usually unknown for a trait, recent research has proposed to estimate a GRM using genome-wide SNPs [8, 12].
Given data on y, C and Z, and are obtained by maximizing the log like-lihood of observing the trait values which corresponds to maximizing . The chip heritability of a trait y is computed using the resultant optimal and .
In our study, however, we solve the inverse problem of the above estimation model. A definitive quantitative trait y is not known beforehand but needs to be derived from a set of known clinical variables. Let X n×d be the data matrix of d clinical variables x for the same n subjects as in Z. A trait y is defined by a linear function of y = w ┬ x where w is the vector of combination coefficients. Correspondingly, the trait values y = Xw. Unlike the heritability estimation process that finds the best values of and to maximize the likelihood of observing the values of y, the inverse problem searches for the best w so to form a trait y that maximizes the likelihood, (or equivalently the log likelihood ), of observing a large heritability, i.e., a large but small . For simplicity and easy interpretation of the resultant model, here we only consider linear models, but the proposed method can be easily extended to construct non-linear models through kernel mapping .
where λ is a hyper-parameter and needs to be tuned, and and are included to pre-balance the two items in the objective function. The value of λ can either be chosen by users according to domain knowledge or determined using a crossvalidation process as done in our experiments. According to learning theory , minimizing corresponds to empirical risk minimization, whereas minimizing the objective in Eq.(10) corresponds to structural risk minimization that improves the generalizability of the resultant model. There are many different ways to define R(w) . The L2 vector norm defined by is a common choice. The L1 vector norm defined by can be a better choice when model sparsity is required to select variables for use in the model. In more complicated applications where variables may be grouped and feature selection among groups is expected, a structured regularizer, such as the group lasso , can be used where contains the indices of variables belonging to a group k.
The algorithm we will describe next, although is designed for Problem (11), can be modified to solve Problem (10) that may take another form of the regularizers.
where f denotes the objective function, g's denote the constraints, and e = 2d + 1, indicating the number of constraints in that group. It is straightforward to show that Eq.(12) is equivalent to Eq.(11) in the sense that at optimality w = u − v =γ(1 : d) − γ(d + 1 : 2d). When Eq.(12) reaches optimality, at least one of the two components u i and v i at ny i-th position of the two vectors will be 0. Otherwise, by setting and if u i ≥ v i , or and if u i < v i , we obtain a better solution with and than (u, v). Therefore, at optimality, Then, Eq.(12) becomes exactly the same as Eq.(11).
We summarize the proposed algorithm in Algorithm 1. It has been proved that a SQP based algorithm can converge to a local minimizer of the optimization problem (12) .
Algorithm 1 A sequential quadratic programming approach to solving Problem (11)
We validated the proposed approach in both simulations and the analysis of a real-world data set that was aggregated from multiple genetic studies of cocaine dependence (CD).
Cocaine use and related behaviors data
We used the Semi-Structured Assessment for Drug Dependence and Alcoholism (SSADDA) dataset aggregated from genetic studies of drug dependence to evaluate the proposed algorithm. The SSADDA subjects were recruited from multiple sites, including the University of Connecticut Health Center, Yale University School of Medicine, the University of Pennsylvania School of Medicine, McLean Hospital and the Medical University of South Carolina. All subjects participated using procedures approved by the institutional review board at each participating site. There were 6,621 subjects genotyped with a total of 1,140,420 SNPs genome-wide. Among the subjects, 2,674 were stratified into the African American population using STRUCTURE software v2.3 , and only these subjects were used in our experiments to avoid spurious findings due to population structure. We removed 537 subjects who had other family members in the data so the GRM was computed for unrelated individuals.
Input: Z, C, X, λ
Initialize γwith u = 1, v = 0.
Initialize the Lagrange multipliers α= 1.
Evaluate f, , ∇g i and with the current γand α.
Solve Problem (13) to obtain and .
Perform a line search to find the searching step size s.
Update γand αas in Eq.(14). Repeat 4-7 until γreaches a fixed point.
A series of data cleaning steps were performed to ensure the quality of genotypic markers. Markers that meet any of the following conditions were excluded: low call rate (<98% subjects received values for the marker), G/C and A/T markers (to avoid strand issues), deviation from Hardy-Weinberg equilibrium at p <10 −8 , significant cohort calling discrepancy and being monomorphic. We also removed non-autosomal markers, so that only markers from the 22 autosomal chromosomes were used in the analysis. After these data cleaning steps, 690,864 SNPs remained. Genetic relationship was estimated for each pair of subjects by the genome-wide complex trait analysis (GCTA) software  using all 690,864 SNPs. We then excluded 385 subjects whose relatedness to some subjects was greater than 0.025 (corresponding to the relatedness of second cousins). The remaining sample, 1,752 subjects, was used in our analysis.
All subjects were interviewed with a computer-assisted assessment system called the SSADDA , which consists of survey questions designed for cocaine use and related behaviors. All subjects were reported to have used cocaine in their lifetime. The responses to those questions in the SSADDA led to the definition of thirteen important cocaine use related variables, based on which a diagnosis of CD was determined. There were seven binary variables as listed below, which represent the seven cocaine dependence criteria in DSM-IV.
F1 - tolerance to cocaine;
F2 - withdrawal from cocaine;
F3 - using cocaine in larger amounts or over longer period than intended;
F4 - persistent desire or unsuccessful efforts to cut down or control cocaine use;
F5 - great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine;
F6 - gave up or reduced important social, occupational, or recreational activities because of cocaine use;
F7 - cocaine use despite knowledge of persistent or recurrent physical or psychological problems likely to have been caused or exacerbated by cocaine. In our experiments, positive responses to the seven variables were coded by 1 and negative responses were coded by 0. We also included six continuous variables in the analysis as listed below:
F8 - number of cocaine symptom endorsed;
F9 - age when first used cocaine;
F10 - age when last used cocaine;
F11 - age when first diagnosed with DSM-IV cocaine dependence;
F12 - age when last diagnosed with DSM-IV cocaine dependence;
F13 - transition time in years between the first cocaine use and the first cocaine dependence diagnosis.
All these variables were normalized to the range of [0, 1] in the analysis.
Following the same design principle used in the simulations for testing chip heritability in , we used the real-life genotypic data in the CD study but synthesized phenotypic data. We simulated quantitative traits based on the mixed-effect linear model shown in Eq.(1). We first synthesized a dataset that contained 5 phenotypic features, all of which were created with moderate to high heritability, and were used to form a quantitative trait of very high heritability reaching 0.8. We then added irrelevant features, which varied mainly due to covariates, to create five other simulated datasets. These datasets consisted of 10, 20, 30, 40 and 50 features where only the first 5 of them were used in the model of the final trait. These datasets were used to determine whether the proposed algorithm could identify the right features for use in the model.
To synthesize features with genetic effects, we randomly picked 2,000 of the 690,864 SNPs in the cocaine use data set and used them as the causal variants of these features. The random effect coefficient u j associated with each of the 2,000 markers was generated independently by sampling from the standard normal distribution N (0, 1). The residual component ε i for each individual was drawn from the normal distribution of mean 0 and variance var(z i u)(1/h2 − 1) where z i is the i-th row of Z, var(·) is the sample variance of a random variable and h2 is the heritability of the feature. To synthesize features with no genetic effects, we ignored the term Zu in Eq.(1) and created ε by randomly sampling from the standard normal distribution. To further synthesize features with fixed covariate effects, we used sex and age of the individuals in the CD study as the covariates, and arbitrarily set their effects, i.e., the β coefficients, to 0.2 and 0.5.
We evaluated the proposed method in two different experimental settings:
Setting 1: This setting assumed that there were no covariate effects in the quantitative trait. The five relevant features were simulated as follows. We used the procedure described in the above paragraph to create four features with h2 equal to 0.2: x1, · · · , x4. Then we simulated the final quantitative trait y1 with h2 = 0.8 using the same procedure. A five-entry weight vector was created with arbitrary values, such as w = [0.22, 0.67, 0.60, 0.30, 0.22], used in our experiments. Then, the fifth feature was directly computed as . By simulating the data in this way, we knew that there was at least one linear combination of the five features in the data that would result in a composite trait (i.e., y1) with h2 of 0.8. Hence, if our approach worked, it should at least find this linear combination if there was no any other one that gave even higher h2. Note that the heritability of the fifth feature had to depend on the empirical estimation, but given how it was created, there were genetic effects in this feature.
We then created 45 other features that had no genetic effects, and added a certain number of these features to the original 5 features to create 5 other datasets. Hence, there were in total 6 datasets for 1,752 subjects with 5, 10, 20, 30, 40 and 50 features. We used this set of data (i.e., the discovery set) in training to retrieve the combination models. Then we repeated the above procedure to create another independent set of data (i.e., the validation set) to validate the resultant models.
Setting 2: This setting assumed that the two covariates, sex and age, had fixed effects to the features and the final trait. We generated 5 features by adding fixed effects to the 5 useful features created in Setting 1. Because fixed effects do not change h2 of a trait, we computed a composite trait y2 using the same pre-specified weight vector w that was used in Setting 1. We then created 45 other features with only covariate effects using the procedure described early on. Five other datasets were generated consisting of 10, 20, 30, 40 and 50 features. Note that the optimal weight vector for these datasets should have zero entries for all features except the first 5 features that were synthesized. Similarly, a discovery suite of the six datasets and another suite of them were synthesized using the same procedure for training and validation, respectively.
We estimated the chip h2 of the features created in the synthetic datasets using GCTA software. The four features synthesized with a pre-specified h2 = 0.2 had empirical chip h2 values 0.2 ± 0.01 in these datasets. The chip h2 estimate of the fifth feature was 0.57 in the discovery set and 0.6 in the validation set. Because fixed effects do not affect trait heritability, the five relevant features and the final quantitative traits had the same empirical chip h2 in Settings 1 and 2. The features simulated with no genetic effects had h2 estimates that ranged from 0 to 0.05, and most of these features had estimates less than 10−5.
The proposed analyses
We first validated the proposed approach in a variety of experiments with the synthetic data. Then we applied our approach to the real-life cocaine use data to identify important components or subtypes of the disease defined by linear combinations of clinical features. Such a combination can be used to define a disease subtype because it produces a quantitative trait for each individual, which amounts to the membership likelihood of the individual in a subtype. Because the actual causal variants were known for synthetic data, we calculated the GRM of the individuals directly using the causal variants. In the case study for CD, because the real causal variants were unknown, we followed the commonly-used procedure in the literature on chip heritability estimation  and computed the GRM using all 690,864 SNPs that remained in the data. All of the reported chip heritability was estimated using GCTA software.
Tuning of the hyperparameter: For both the simulation and the CD case study, we performed 10 times three-fold cross validation (CV) to help determine a proper value of λ. At each fold of the CV, a linear model was derived by running the proposed method on 2/3 of the data in the dataset, and then tested on the remaining 1/3 of the data. The cross validated h2 of the derived trait was estimated using only subjects in the remaining 1/3 of the data which was not used to train the model. We ran the same CV process for each pre-specified choice of λ (the choices we used are reported in the results section) and chose the λ value that gave a trait of the highest cross validated h2 for each experimental setting.
Evaluation metrics: We reported and investigated the CV performance (including the mean values and standard deviations of the validation h2 obtained in the CV process described above) for each λ choice in each experiment. Once the best value of λ was chosen through the cross validation for a dataset, we applied the proposed approach with the best λ to the entire data in the dataset to derive a quantitative trait. The chip heritability of this trait was estimated using the separately-synthesized validation datasets in simulations and by another cross validation process in the case study. In other words, in simulations, we estimated the valiation chip h2 using the trait values computed by the linear model on the newly-synthesized validation samples. In the CD case study, we computed the trait h2 using SNPs of 2/3 of the subjects randomly sampled from the dataset and repeated the random sampling 10 times to report the averaged h2 value. We named this process the evaluation CV. Moreover, besides the heritability as a major evaluation metric, we also measured the effectiveness of our approach by comparing the derived trait models and the linear model implanted in the simulated data. We calculated the squared difference between the learned weights and the true weights w, i.e., and the mean of squared residuals , and reported the values in plots. Additional evaluation steps were conducted in the case study to clinically interpret the resultant quantitative trait (see the later paragraph).
Comparison: The validated chip h2 of our derived traits was compared with that of all quantitative features in the data in both simulations and the CD case study. In each of the experiments, our derived trait was also compared with the commonly-used disease phenotype, often referred to as a symptom count, which was the quantitative trait created by equal weighted aggregation of all features in the data. Given that no prior method existed to identify heritable disease components using the genome-wide SNPs, on the real-life cocaine use data, we compared our approach with a recently published method  that aimed to derive linearly-combined traits using pedigrees of related individuals. This comparison considered whether a pedigree-based heritable component analysis method can identify a disease component with a chip heritability comparable to that found by our approach. As multi-member families were included in the original cocaine use dataset, i.e., a superset of the sample used by our approach, it was feasible to apply the method in  to derive a trait. Then we computed the trait values on the unrelated individuals used by our approach to compare the chip h2 of the two approaches using the evaluation CV. Note that the prior pedigree-based approach was actually given a favor because it used the superset of 2,674 subjects (unrelated individuals were treated as one-member pedigrees) to derive the trait in comparison with our approach that used only the 1,752 unrelated individuals.
Clinical interpretation: It is very important to understand the clinical implications of the quantitative trait (or an empirical subtype) derived by our approach from the aggregated CD study data. From prior work [17, 31], we identified three key steps to ensure the clinical validity of an empirical subtype. We first examined the specific features selected by our approach for use in the model. Second, we studied the distribution (or histogram) of the quantitative scores among the 1,752 subjects. From the distribution plot, we examined whether there were obvious subgroups of the scores. Third, the subgroups of subjects were characterized and compared on 11 of the most important clinical variables reflecting cocaine use and related behaviors including both the features selected and those not selected for use in the linear model. The individuals receiving very high or very low values of the quantitative trait may show the most representative features of the subtype.
We pre-specified 21 different λ values ranging from 0 to 0.04 with step size 0.002 for use in the cross-validation tuning process. The validation or test h2 for each λ choice was plotted for each of the six datasets in Figure 1 (for Setting 1 where data were generated without covariate effects) and Figure 2 (for Setting 2 where data were generated with covariate effects). The mean, median, and the standard deviations of the test h2 values in the cross validation were plotted for each tested λ. These two figures show that our approach could identify components (quantitative traits) with test h2 estimate of ~ 0.8, which was the heritability of the implanted heritable component (the simulated true model), for all datasets even with many irrelevant features in some of the datasets for both settings. This result demonstrates that our approach identified highly heritable disease components and could successfully correct for fixed covariate effects.
A case study of cocaine dependence
Figure 8 shows the distribution of the trait values (i.e., the membership scores) of the subjects. It shows that based on the scores, the samples can be partitioned into four subgroups. There were 250 subjects (14.27% of total) in Group 1, which had a mean score of -2.22. Group 2 consisted of 323 subjects and comprised 18.44% of the entire sample set. Its mean score was -0.8. Group 3 was the largest one and consisted of 821 subjects (46.86% of the sample). The mean score of this group was -0.2. Group 4 was the smallest, comprised of 237 individuals (13.53% of the sample), with a mean score of 1.22.
Characteristic of the three subject groups on important clinical variables related to cocaine use.
Tolerance to cocaine
Withdrawal from cocaine
Using cocaine in larger amounts or over longer period than intended
Persistent desiring or unsuccessful cutting down cocaine use
Great amount of time spent in activities related to cocaine
Gave up or reduced important activities because of cocaine use
Cocaine use despite knowledge of problems likely caused by cocaine
Number of CD criteria endorsed
Age when first used cocaine
Age onset of DSM4 cocaine dependence
Transition time in years from first cocaine use to first CD diagnosis
We developed an approach to identify composite traits from multivariate phenotypes that are highly heritable, as estimated using genome-wide SNPs. The trait we derived is in the form of a linear combination of variables related to the phenotype, that is y = Xw. A quadratic optimization problem was formulated, in which optimal w was sought to optimize the log likelihood for estimating variance components in REML. In this formulation, variance components are set to their ideal values with the additive genetic variance component equal to 1 and other components equal to 0. To avoid overfitting, we incorporated a regularization term in our formulation. An efficient algorithm based on the sequential quadratic programming framework was developed to solve the proposed optimization problem. We evaluated the proposed approach on both synthetic and real world data. The empirical results demonstrate the effectiveness of our approach as a means to identify traits with much higher chip h2 than commonly-used disease phenotypes.
In this paper, the pairwise genetic relationship among subjects was estimated from genome-wide SNPs. However, it can also be estimated from SNPs restricted to a specific region, such as on a particular chromosome or in genes related to a pathway, to explore the genetic architecture of a trait. When SNPs within a specific region are used, the trait resulting from the proposed approach will achieve the maximized genetic variance component corresponding to this region. In an application, such as substance dependence, there are known pathways involved, so it may be of utility to determine whether there is a composite trait, the variance of which can be largely explained by the variants within the pathways. This will be a future application of our approach.
This work was supported by NIH grant R01DA037349, NSF grant DBI-1356655, and IIS-1447711. Jinbo Bi was also supported by NSF grants IIS-1320586 and IIS-1407205. Henry R. Kranzler was also supported by NIH grants R01DA030976, R01AA021164, R01AA023192, and R01CA184315.
We thank Joel Gelernter, M.D. from Yale University who was instrumental in recruiting, characterizing, and genotyping the subjects in the SSADDA dataset used here. Kathleen Brady, M.D., Ph.D. of the Medical University of South Carolina, Roger Weiss, M.D., of McLean Hospital and Harvard Medical School, and David Oslin, M.D., of the University of Pennsylvania Perelman School of Medicine oversaw study recruitment at their respective sites. That work was supported by NIH grants R01AA011330, R01AA017535, R01DA012690, R01DA018432, and R01DA012849.
Publication costs for this article were funded by NIH grant R01DA037349.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 3, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3.
- Gelernter J, Sherva R, Koesterer R, Almasy L, Zhao H, Kranzler HR, Farrer L: Genome-wide association study of cocaine dependence and related traits: Fam53b identified as a risk gene. Mol Psychiatry. 2013Google Scholar
- Gelernter J, Kranzler HR, Sherva R, Koesterer R, Almasy L, Zhao H, Farrer LA: Genome-wide association study of opioid dependence: multiple associations mapped to calcium and potassium pathways. Biol Psychiatry. 2014, 76 (1): 66-74. 10.1016/j.biopsych.2013.08.034.View ArticlePubMedGoogle Scholar
- Treutlein J, Rietschel M: Genome-wide association studies of alcohol dependence and substance use disorders. Curr Psychiatry Rep. 2011, 13 (2): 147-55. 10.1007/s11920-011-0176-4.View ArticlePubMedGoogle Scholar
- Pierucci-Lagha A, Gelernter J, Chan G, Arias A, Cubells JF, Farrer L, Kranzler HR: Reliability of dsm-iv diagnostic criteria using the semi-structured assessment for drug dependence and alcoholism (ssadda). Drug Alcohol Depend. 2007, 91 (1): 85-90. 10.1016/j.drugalcdep.2007.04.014.View ArticlePubMedPubMed CentralGoogle Scholar
- Balding DJ, Bishop MJ, Cannings C: Handbook of Statistical Genetics. 2007, John Wiley & Sons, Chichester, England; Hoboken, NJ, 3View ArticleGoogle Scholar
- de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of whole-genome markers. Nature Reviews Genetics. 2010, 11 (12): 880-886. 10.1038/nrg2898.View ArticlePubMedGoogle Scholar
- Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157 (4): 1819-1829.PubMedPubMed CentralGoogle Scholar
- Yang J, Montgomery GW, Goddard ME, Visscher PM, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG: Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010, 42 (7): 565-10.1038/ng.608.View ArticlePubMedPubMed CentralGoogle Scholar
- Makowsky R, Pajewski NM, Klimentidis YC, Vazquez AI, Duarte CW, Allison DB, de los Campos G: Beyond missing heritability: prediction of complex traits. PLoS Genetics. 2011, 7 (4): 1002051-10.1371/journal.pgen.1002051.View ArticleGoogle Scholar
- Hill WG, Wray NR: Heritability in the genomics era-concepts and misconceptions. Nature Reviews Genetics. 2008, 9 (4): 255-266. 10.1038/nrg2322.PubMedGoogle Scholar
- Yang J, Lee SH, Goddard ME, Visscher PM: Gcta: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011, 88 (1): 76-82. 10.1016/j.ajhg.2010.11.011.View ArticlePubMedPubMed CentralGoogle Scholar
- Speed D, Hemani G, Johnson MR, Balding DJ: Improved heritability estimation from genome-wide SNPs. American Journal of Human Genetics. 2012, 91 (6): 1011-1021. 10.1016/j.ajhg.2012.10.010.View ArticlePubMedPubMed CentralGoogle Scholar
- Kranzler HR, Wilcox M, Weiss RD, Brady K, Hesselbrock V, Rounsaville B, Farrer L, Gelernter J: The validity of cocaine dependence subtypes. Addict Behav. 2008, 33 (1): 41-53. 10.1016/j.addbeh.2007.05.011.View ArticlePubMedGoogle Scholar
- Bi J, Gelernter J, Sun J, Kranzler HR: Comparing the utility of homogeneous subtypes of cocaine use and related behaviors with dsm-iv cocaine dependence as traits for genetic association analysis. Am J Med Genet B Neuropsychiatr Genet. 2014, 165B (2): 148-56.View ArticlePubMedGoogle Scholar
- Sun J, Bi J, Chan G, Oslin D, Farrer L, Gelernter J, Kranzler HR: Improved methods to identify stable, highly heritable subtypes of opioid use and related behaviors. Addictive Behaviors. 2012Google Scholar
- Gelernter J, Panhuysen C, Wilcox M, Hesselbrock V, Rounsaville B, Poling J, Weiss R, Sonne S, Zhao H, Farrer L, Kranzler HR: Genomewide linkage scan for opioid dependence and related traits. American Journal of Human Genetics. 2006, 78 (5): 759-769. 10.1086/503631.View ArticlePubMedPubMed CentralGoogle Scholar
- Babor TF, Caetano R: Subtypes of substance dependence and abuse: implications for diagnostic classification and empirical research. Addiction (Abingdon, England). 2006, 101: 104-10.View ArticleGoogle Scholar
- Hu VW, Addington A, Hyman A: Novel autism subtype-dependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published gwas data. PloS ONE. 2011, 6 (4): 19067-10.1371/journal.pone.0019067.View ArticleGoogle Scholar
- Ott J, Rabinowitz D: A principal-components approach based on heritability for combining phenotype information. Hum Hered. 1999, 49 (2): 106-11. 10.1159/000022854.View ArticlePubMedGoogle Scholar
- Wang Y, Fang Y, Jin M: A ridge penalized principal-components approach based on heritability for high-dimensional data. Hum Hered. 2007, 64 (3): 182-91. 10.1159/000102991.View ArticlePubMedGoogle Scholar
- Klei L, Luca D, Devlin B, Roeder K: Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol. 2008, 32 (1): 9-19. 10.1002/gepi.20257.View ArticlePubMedGoogle Scholar
- Oualkacha K, Labbe A, Ciampi A, Roy MA, Maziade M: Principal components of heritability for high dimension quantitative traits and general pedigrees. Statistical Applications in Genetics and Molecular Biology. 2012, 11 (2):Google Scholar
- Sun J, Bi J, Kranzler HR: Quadratic optimization to identify highly heritable quantitative traits from complex phenotypic features. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD '13. 2013, ACM, New York, NY, USA, 811-819.View ArticleGoogle Scholar
- Patterson HD, Thompson R: Recovery of inter-block information when block sizes are unequal. Biometrika. 1971, 58 (3): 545-554. 10.1093/biomet/58.3.545.View ArticleGoogle Scholar
- Verbyla AP: A conditional derivation of residual maximum likelihood. Australian Journal of Statistics. 1990, 32 (2): 227-230. 10.1111/j.1467-842X.1990.tb01015.x. doi:10.1111/j.1467-842X.1990.tb01015.xView ArticleGoogle Scholar
- Vapnik VN: An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999, 10 (5): 988-999. 10.1109/72.788640.View ArticlePubMedGoogle Scholar
- Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New York, 2Google Scholar
- Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New YorkGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 945-59.PubMedPubMed CentralGoogle Scholar
- Yang J, Benyamin B, Visscher PM: Common snps explain a large proportion of the heritability for human height. Nat Genet. 2010, 42 (7): 565-9. 10.1038/ng.608.View ArticlePubMedPubMed CentralGoogle Scholar
- Hesselbrock VM, Hesselbrock MN: Are there empirically supported and clinically useful subtypes of alcohol dependence?. Addiction. 2006, 101 (Suppl 1): 97-103.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.