 Research
 Open Access
 Published:
Refining multivariate disease phenotypes for high chip heritability
BMC Medical Genomics volumeÂ 8, ArticleÂ number:Â S3 (2015)
Abstract
Background
Statistical genetics shows that the success of both genetic association studies and genomic prediction methods is positively associated with the heritability of the trait used in the analysis. Identifying highly heritable components of a complex disease can thus enhance genetic studies of the disease. Existing heritable component analysis methods use data from related individuals to compute linearlycombined traits to maximize heritability. Recent advances in acquiring genomewide markers have enhanced heritability estimation using genotypic data from apparently unrelated individuals, which is referred to as the chip heritability. Novel statistical models are thus needed to identify disease components (subtypes) with high chip heritability.
Methods
We propose an optimization approach to identify highly heritable components of a complex disease as a function of multiple clinical variables. The heritability of the components is estimated directly from unrelated individuals using their genomewide single nucleotide polymorphisms. The proposed approach can also model the fixed effects due to covariates, such as age and race, so that the derived traits have high chip heritability after correcting for fixed effects. A new sequential quadratic programming algorithm is developed to efficiently solve the proposed optimization problem.
Results
The proposed algorithm was validated both in simulations and the analysis of a realworld dataset that was aggregated from genetic studies of cocaine, opoid, and alcohol dependence. Simulation studies demonstrated that the proposed approach could identify the hypothesized component from multiple synthesized features. A case study on cocaine dependence (CD) identified a quantitative trait that achieved chip heritability of 0.86 estimated using a crossvalidation process. This quantitative trait corresponded to the likelihood of an individual's membership in a CD subtype. Clinical analysis showed that the subtype enclosed individuals who reported heavy use of cocaine but few withdrawal symptoms.
Conclusions
Extensive experiments on both synthetic and realworld data demonstrate the effectiveness of the proposed approach as a means to find meaningful disease components with high chip heritability.
Introduction
Identifying genetic variation that underlies complex diseases has important implications in medicine. To date, genomewide association studies (GWAS) have had limited success in dissecting the genetic etiology of complex diseases. For instance, very few associations identified for substance use disorders at a genomewide significant level have been replicated [1â€“3]. Complex disorders are often characterized by multiple disease indicators. For example, to diagnose whether a patient has a lifetime drug dependence disorder, clinicians interview the patient to understand his or her drug use behaviors, the negative consequences of the drug use, the treatment history and other cooccuring medical conditions. All of these clinical variables are used to arrive at a diagnosis of dependence on a certain drug [4]. There is substantial variation in these variables in the disease population, and these variables also present different levels of heritability, i.e., some are more genetically influenced than others. This phenotypic heterogeneity diminishes evidence of genetic association. Statistical genetics also shows that the success of most gene discovery studies is positively associated with the heritability of the trait used in the association analysis [5]. Hence, identifying more homogeneous and highly heritable components of a complex disease could enhance the association analysis.
The ability to translate genotype information into a quantitative prediction of disease phenotypes is important for precision medicine [6]. Genomic prediction methods that predict a phenotype based on genomewide single nucleotide polymorphisms (SNPs) may provide a suitable analytic tool [7, 8]. These methods expand the traditional singlemarkerregressionbased GWAS model for detecting few variants of large effect to multimarker predictive models with many variants of small effect. The predictive ability of genomic prediction methods relies on several factors, especially trait heritability [9]. If we identify higly heritable components of a complex disease, it could also improve the utility of genomic prediction methods to predict subtypes (defined by the components) of the disease.
Because the success of both association analysis and genomic prediction is dependent on the trait heritability, heritability can be a valid target for refining multivariate disease phenotypes. The narrowsense heritability h^{2} is defined by the percentage of phenotypic variance that is due to additive genetic effects [10]. The broadsense heritability H^{2} is defined as the overall genetic contribution to the phenotypic variation. The heritability of a quantitative trait is commonly estimated from related individuals in pedigrees. Recent advances in acquiring dense genomewide genetic markers have enhanced heritability estimation from apparently unrelated individuals using their genomewide SNPs. The SNPbased heritability, often referred to as the chip heritability, is defined as the portion of the phenotypic variation that can be explained by the genotyped genetic markers [11]. It has been argued that estimating h^{2} from unrelated individuals has an advantage over traditional pedigreebased methods because the estimated chip h^{2} corresponds only to the causalvariant heritability that is tagged by the genotyped SNPs [8, 12].
Phenotype refinement is an important but underdeveloped genetics research area. Unsupervised cluster analysis or latent class analysis has been commonly used to partition a study population into subgroups based on clinical variables [13â€“18]. This approach can create subgroups of individuals that differ in clinical symptoms and features, but may have limited utility in genetic analysis. Because genetic data are not used during the creation of the subgroups, the resultant subtypes (subgroups) are not guaranteed to have high heritability, and hence may not be informative for genetic association.
More relevant to this present work, a number of prior methods identify the principal components of clinical data that are heritable, and characterize the components by linear combinations of clinical variables [19â€“23]. Thus, these methods are often called heritable component analysis. All existing methods decompose the variance of clinical data into two components: the variance due to additive genetic effects estimated from pedigrees; and the variance due to other effects (residuals). Then, they solve a generalized eigendecomposition problem to identify the linear combination of the clinical variables that maximizes the ratio of additivegenetic variance versus the residual variance, thus leading to high heritability of the resultant linearly combined trait. Nearly all of these methods use pedigreebased heritability estimation (an exception is [23]), and all assume a genetic model that is based on a single causal variant, an assumption that is commonly violated for complex diseases.
Although the latest heritable component analysis method [23] is effective and computationally efficient, a fundamental question is how much heritability of the derived trait can be explained by the genotyped SNPs. Because GWAS and genomic predictions mainly utilize the genotyped SNPs, the utility of the derived trait may be limited by a low chip heritability. Thus, novel statistical models are needed to directly target high chip heritability. In this paper, we propose an approach to identify the components of a multivariate disease phenotype that maximizes the chip h^{2}. To estimate the chip heritability of a given trait, the latest methods use the restricted maximum likelihood (REML) method, which assumes that the trait follows a mixed effect model with random genetic effects, and fixed effects due to covariates, such as age, sex and race [8, 12]. To identify a trait of high chip h^{2}, we need to solve the inverse problem of (chip) heritability estimation. In other words, we now search for a trait (e.g., a linearlycombined trait) so that its chip heritability is high when estimated using the REML method. Directly solving the inverse problem leads to a quadratic optimization problem that can be optimized efficiently via a sequential quadratic programming algorithm. We validated the proposed approach in simulations as well as in the analysis of a realworld dataset that was aggregated from genetic studies of cocaine, opioid, and alcohol dependence. Our experimental results demonstrated the effectiveness and generalizability of the proposed approach.
Methods
The proposed statistical model
Given a set of n subjects, we denote their trait values of a quantitative trait y by a vector y of length n. We use a matrix Z_{ nÃ—m } to represent their standardized genotypic data at m genetic markers, and C_{ nÃ—p } to represent their data on p covariates. The matrix Z is calculated from the genotypic data as follows. Let f_{ j } be the frequency of reference allele at the jth genetic variant, r_{ ij } be the number of copies of reference allele that the ith subject has at the jth locus. The standardized genotype z_{ ij } is calculated as {r}_{ij}\mathsf{\text{2}}{f}_{j}/\sqrt{\mathsf{\text{2}}{f}_{j}\left(\mathsf{\text{1}}{f}_{j}\right)}[8]. The chip heritability estimation method assumes the following mixedeffect linear model [8, 12] that characterizes how a phenotype is related to genotypes and covariates:
where Îµis a vector of length n, which specifies residual effects. In Eq.(1), all covariates create fixed effects (fixed Î²) on the phenotype whereas genetic effects are random (random u). Assume that u and Îµfollow Gaussian distributions: u ~ N (0, I{{\mathrm{\xcf\u0192}}_{e}}^{\mathsf{\text{2}}}) and Îµ ~ N (0, I{{\mathrm{\xcf\u0192}}_{u}}^{\mathsf{\text{2}}}). Then, the covariance of y between individuals, denoted by Î©_{ nÃ—n }, can be calculated as:
Let {{\mathrm{\xcf\u0192}}_{g}}^{\mathsf{\text{2}}} be the phenotypic variance attributable to all of the m genetic causal variants. Then, we have {\mathrm{\xcf\u0192}}_{g}^{2}=m{\mathrm{\xcf\u0192}}_{u}^{2}. Let G = ZZ^{T} /m, which is referred to as the genetic relationship matrix (GRM) among subjects determined by the causal variants. Then Eq. (2) can be rewritten as:
where{{\mathrm{\xcf\u0192}}_{g}}^{\mathsf{\text{2}}} and{{\mathrm{\xcf\u0192}}_{e}}^{\mathsf{\text{2}}} can be estimated by the REML method [24, 11]. The chip heritability estimated on the m causal variants is computed as{{h}_{p}}^{2}={{\mathrm{\xcf\u0192}}_{g}}^{2}/{{\mathrm{\xcf\u0192}}_{e}}^{2}, where {{\mathrm{\xcf\u0192}}_{p}}^{2}={{\mathrm{\xcf\u0192}}_{g}}^{2}+{{\mathrm{\xcf\u0192}}_{e}}^{2} is the total phenotypic variance. Because the causal variants of y are usually unknown for a trait, recent research has proposed to estimate a GRM using genomewide SNPs [8, 12].
The main idea of REML for estimating the variance components is to first eliminate the fixed effect due to covariates from the observed values of y and then estimate the variance components from the random effect part. The REML finds n basis vectors represented by columns of a matrix L_{ nÃ—n }. This matrix has two submatrices L = [L_{1} L_{2}] with L_{1} of size n Ã— p and L_{2} of size n Ã— (n âˆ’ p). The two submatrices satisfy {L}_{1}^{T}C={I}_{p\xc3\u2014p}, and {L}_{2}^{T}C=0. Let\stackrel{\xcc\u0192}{\mathbf{\text{y}}}={\mathbf{\text{L}}}^{T}\mathbf{\text{y}}, {\stackrel{\xcc\u0192}{y}}_{1}={\mathbf{\text{L}}}_{1}^{T}\mathbf{\text{y}} and {\mathrm{\xe1\xbb\xb8}}_{\mathsf{\text{2}}}={L}_{2}^{T}y. It can be derived that á»¹ follows the following multivariate Gaussian distribution given the multivariate Gaussian assumption of y:
We have {\stackrel{\xcc\u0192}{y}}_{2}~N\left(0,{\mathbf{\text{L}}}_{2}^{T}\text{\xce\xa9}{\mathbf{\text{L}}}_{2}\right) and the conditional distribution:
Then, the log likelihood of á»¹ can be decomposed into:
Where â„“_{2} is not a function of the fixedeffect parameter Î². The two variance components, i.e.,{\mathrm{\xcf\u0192}}_{g}^{2} and{\mathrm{\xcf\u0192}}_{e}^{2} can be estimated by maximizing l_{2}, and there is no additional information in â„“_{1} for estimating the variance components. Once {\mathrm{\xcf\u0192}}_{g}^{2} and {\mathrm{\xcf\u0192}}_{e}^{2} are estimated, a generalized least squares estimate of Î²can be obtained as:
The second log likelihood component â„“_{2} is calculated as (after removing constant):
It has been shown in an early work [25] that when {L}_{1}^{T}C={I}_{p\xc3\u2014p} and {L}_{2}^{T}C=0, we have
Substituting these equations into the calculation of â„“_{2} yields:
where P={\text{\xce\xa9}}^{1}{\text{\xce\xa9}}^{1}C{\left({C}^{T}{\text{\xce\xa9}}^{1}C\right)}^{1}{C}^{T}{\text{\xce\xa9}}^{1} and Î²can be obtained by:
Given data on y, C and Z, {\mathrm{\xcf\u0192}}_{g}^{2} and {\mathrm{\xcf\u0192}}_{e}^{2} are obtained by maximizing the log likelihood of observing the trait values \mathrm{\xe2\u201e\u201c}\left({\mathrm{\xcf\u0192}}_{g}^{2},{\mathrm{\xcf\u0192}}_{e}^{2};y\right) which corresponds to maximizing {\mathrm{\xe2\u201e\u201c}}_{2}\left({\mathrm{\xcf\u0192}}_{g}^{2},{\mathrm{\xcf\u0192}}_{e}^{2};{\stackrel{\xcc\u0192}{\mathbf{\text{y}}}}_{2}\right)[11]. The chip heritability of a trait y is computed using the resultant optimal {\mathrm{\xcf\u0192}}_{g}^{2} and {\mathrm{\xcf\u0192}}_{e}^{2}.
In our study, however, we solve the inverse problem of the above estimation model. A definitive quantitative trait y is not known beforehand but needs to be derived from a set of known clinical variables. Let X_{ nÃ—d } be the data matrix of d clinical variables x for the same n subjects as in Z. A trait y is defined by a linear function of y = w^{â”¬}x where w is the vector of combination coefficients. Correspondingly, the trait values y = Xw. Unlike the heritability estimation process that finds the best values of {\mathrm{\xcf\u0192}}_{g}^{2} and {\mathrm{\xcf\u0192}}_{e}^{2} to maximize the likelihood of observing the values of y, the inverse problem searches for the best w so to form a trait y that maximizes the likelihood, (or equivalently the log likelihood \mathrm{\xe2\u201e\u201c}\left({\mathrm{\xcf\u0192}}_{g}^{2},{\mathrm{\xcf\u0192}}_{e}^{2};y,C,Z\right)), of observing a large heritability, i.e., a large {\mathrm{\xcf\u0192}}_{g}^{2} but small {\mathrm{\xcf\u0192}}_{e}^{2}. For simplicity and easy interpretation of the resultant model, here we only consider linear models, but the proposed method can be easily extended to construct nonlinear models through kernel mapping [26].
Notice that the highest possible heritability of a trait y is 1 when {\mathrm{\xcf\u0192}}_{g}^{2}=1 and {\mathrm{\xcf\u0192}}_{e}^{2}=0. We hence propose to formulate an optimization problem, in which we search for the optimal w that maximizes the log likelihood \mathrm{\xe2\u201e\u201c}\left({\mathrm{\xcf\u0192}}_{g}^{2},{\mathrm{\xcf\u0192}}_{e}^{2};y,C,Z\right) (or equivalently, {\mathrm{\xe2\u201e\u201c}}_{2}\left({\mathrm{\xcf\u0192}}_{g}^{2},{\mathrm{\xcf\u0192}}_{e}^{2};{\stackrel{\xcc\u0192}{\mathbf{\text{y}}}}_{2},\right)) of observing {\mathrm{\xcf\u0192}}_{g}^{2}=1 and {\mathrm{\xcf\u0192}}_{e}^{2}=0. According to Eq.(3), the covariance matrix Î© = G when {\mathrm{\xcf\u0192}}_{g}^{2}=1 and {\mathrm{\xcf\u0192}}_{e}^{2}=0. We substitute the values of these parameters into the log likelihood Eq.(4), and remove any constant terms. The resultant maximization problem is equivalent to the following minimization problem:
where P is calculated as:
When {\mathrm{\xcf\u0192}}_{g}^{2}=1 and {\mathrm{\xcf\u0192}}_{e}^{2}=0, we have {\mathrm{\xcf\u0192}}_{p}^{2}=1 because {\mathrm{\xcf\u0192}}_{p}^{2}={\mathrm{\xcf\u0192}}_{g}^{2}+{\mathrm{\xcf\u0192}}_{e}^{2} This requires to impose a constraint to the optimization problem so that the total phenotypic variance that is due to either genetic or environmental effect should be scaled to1. An estimate of {\mathrm{\xcf\u0192}}_{p}^{2} can be obtained by calculating the sample variance after correcting for the covariate effects as {\widehat{\mathrm{\xcf\u0192}}}_{p}^{2}=\frac{1}{n}{\left(XwC\mathrm{\xce\xb2}\right)}^{T}\left(XwC\mathrm{\xce\xb2}\right). Since Î²can be estimated according to Eq.(5), by substituting the Î²value, {\widehat{\mathrm{\xcf\u0192}}}_{p}^{2} can be computed by
where J=IC{\left({C}^{T}{\text{\xce\xa9}}^{1}C\right)}^{1}{C}^{T}{\text{\xce\xa9}}^{1}. To further simplify the notation, denote
Then
Combining the objective function and the constraint together, the proposed optimization problem is formulated as:
According to statistical learning theory [26], only maximizing the training heritability (by minimizing Eq.(9)), the resultant model may overfit the training data X. If overfitting occurs, the optimal w of Eq.(9) may correspond to a trait that has high heritability on the data that is used to train the linear model, but when the model is applied to a new sample, the trait has low heritability. In order to prevent overfitting and identify a trait with high heritability that can generalize, we incorporate a regularizer R(w) in our formulation (9). The optimization problem becomes:
where Î» is a hyperparameter and needs to be tuned, and \frac{1}{n}and \frac{1}{d}are included to prebalance the two items in the objective function. The value of Î» can either be chosen by users according to domain knowledge or determined using a crossvalidation process as done in our experiments. According to learning theory [26], minimizing \frac{1}{n}{w}^{T}\left({X}^{T}PX\right)w corresponds to empirical risk minimization, whereas minimizing the objective in Eq.(10) corresponds to structural risk minimization that improves the generalizability of the resultant model. There are many different ways to define R(w) [23]. The L_{2} vector norm defined by \left\rightw{}_{2}^{2}={\xe2\u02c6\u2018}_{i}{w}_{i}^{2} is a common choice. The L_{1} vector norm defined by \left\rightw{}_{1}={\xe2\u02c6\u2018}_{i}{w}_{i} can be a better choice when model sparsity is required to select variables for use in the model. In more complicated applications where variables may be grouped and feature selection among groups is expected, a structured regularizer, such as the group lasso {\left\rightw\left\right}_{2,1}=\underset{K=1}{\overset{L}{\xe2\u02c6\u2018}}\sqrt{{\xe2\u02c6\u2018}_{i\xe2\u02c6\u02c6{\mathcal{G}}_{k}}{w}_{i}^{2}}, can be used where {\mathcal{G}}_{k} contains the indices of variables belonging to a group k.
Optimization algorithm
In this paper, we use the L_{1} norm penalty \left\rightw{}_{1} to be R(w), and develop an efficient algorithm to solve the resultant optimization problem as follows:
The algorithm we will describe next, although is designed for Problem (11), can be modified to solve Problem (10) that may take another form of the regularizers.
Due to the use of the {\xe2\u02c6\xa5w\xe2\u02c6\xa5}_{1} norm, the objective function in Problem (11) is not continuously differentiable and a gradient decent type of approach cannot be applied directly. A well known strategy to overcome this obstacle is to decompose w into two parts: w = u âˆ’ v, where both u and v are vectors of the same size as that of w, and all the components in u and v are required to be nonnegative (i.e., u â‰¥ 0, and v â‰¥ 0). BecauseXw=XuXv, we denote Î³= [u^{T} , v^{T} ]^{T} , H = [X , âˆ’ X], and then we have Xw = H Î³. By the change of variables, Problem (11) can be equivalently rewritten as:
where f denotes the objective function, g's denote the constraints, and e = 2d + 1, indicating the number of constraints in that group. It is straightforward to show that Eq.(12) is equivalent to Eq.(11) in the sense that at optimality w = u âˆ’ v =Î³(1 : d) âˆ’ Î³(d + 1 : 2d). When Eq.(12) reaches optimality, at least one of the two components u_{ i } and v_{ i } at ny ith position of the two vectors will be 0. Otherwise, by setting {\mathrm{\AA \copyright}}_{i}={u}_{i}{v}_{i} and {\mathrm{\xe1\xb9\xbd}}_{i}=0 if u_{ i } â‰¥ v_{ i }, or {\mathrm{\AA \copyright}}_{i}=0 and {\mathrm{\xe1\xb9\xbd}}_{i}={v}_{i}{u}_{i}if u_{ i } < v_{ i }, we obtain a better solution with{\mathrm{\AA \copyright}}_{i} and {\mathrm{\xe1\xb9\xbd}}_{i} than (u, v). Therefore, at optimality, \underset{i=1}{\overset{2d}{\xe2\u02c6\u2018}}{\mathrm{\xce\xb3}}_{i}=\underset{i=1}{\overset{d}{\xe2\u02c6\u2018}}{u}_{i}+{v}_{i}=\underset{i=1}{\overset{d}{\xe2\u02c6\u2018}}\left{w}_{i}\right={\xe2\u02c6\xa5\mathbf{\text{w}}\xe2\u02c6\xa5}_{1} Then, Eq.(12) becomes exactly the same as Eq.(11).
Eq.(12) is not a convex problem because of the quadratic equality constraint. However, it can be efficiently solved using a sequential quadratic programming (SQP) algorithm [27] because both of the objective and constraints are either in a quadratic or a linear form. The gradient of the objective and constraint functions with respect to Î³ can be calculated as:
Let Î±be the Lagrange multipliers, the Lagrangian function of this problem can be written as:
and the Hessian of the Lagrangian function with respect to Î³is computed as:
We iteratively search for the optimal solution to Eq.(12). In the tth iteration, we have the iterates Î³_{ t } and Î±_{ t }, and we first solve the following quadratic program to find the moving direction for Î³and Î±,
The optimal solution \widehat{p} to the problem (13) will give the next moving direction for Î³, along which the objective of Problem (12) can be decreased. Let\widehat{\mathbf{\text{q}}} be the optimal Lagrange multipliers of the problem (13) corresponding to \widehat{\mathbf{\text{p}}}. The next moving direction of Î±is calculated as\widehat{\mathbf{\text{q}}}{\mathrm{\xce\pm}}_{t}. After the moving directions are computed, we then employ a line search method described in [27] to find the optimal searching step size s and update Î³and Î±as follows:
We summarize the proposed algorithm in Algorithm 1. It has been proved that a SQP based algorithm can converge to a local minimizer \widehat{\mathrm{\xce\xb3}} of the optimization problem (12) [28].
Algorithm 1 A sequential quadratic programming approach to solving Problem (11)
Data sets
We validated the proposed approach in both simulations and the analysis of a realworld data set that was aggregated from multiple genetic studies of cocaine dependence (CD).
Cocaine use and related behaviors data
We used the SemiStructured Assessment for Drug Dependence and Alcoholism (SSADDA) dataset aggregated from genetic studies of drug dependence to evaluate the proposed algorithm. The SSADDA subjects were recruited from multiple sites, including the University of Connecticut Health Center, Yale University School of Medicine, the University of Pennsylvania School of Medicine, McLean Hospital and the Medical University of South Carolina. All subjects participated using procedures approved by the institutional review board at each participating site. There were 6,621 subjects genotyped with a total of 1,140,420 SNPs genomewide. Among the subjects, 2,674 were stratified into the African American population using STRUCTURE software v2.3 [29], and only these subjects were used in our experiments to avoid spurious findings due to population structure. We removed 537 subjects who had other family members in the data so the GRM was computed for unrelated individuals.
Input: Z, C, X, Î»
Output: Î³
 1.

2.
Initialize Î³with u = 1, v = 0.

3.
Initialize the Lagrange multipliers Î±= 1.

4.
Evaluate f, \xe2\u02c6\u2021f, âˆ‡g_{ i } and {\xe2\u02c6\u2021}^{\mathsf{\text{2}}}\mathcal{L}with the current Î³and Î±.

5.
Solve Problem (13) to obtain \widehat{\mathbf{\text{p}}} and \widehat{q}.

6.
Perform a line search to find the searching step size s.

7.
Update Î³and Î±as in Eq.(14). Repeat 47 until Î³reaches a fixed point.
A series of data cleaning steps were performed to ensure the quality of genotypic markers. Markers that meet any of the following conditions were excluded: low call rate (<98% subjects received values for the marker), G/C and A/T markers (to avoid strand issues), deviation from HardyWeinberg equilibrium at p <10^{âˆ’8}, significant cohort calling discrepancy and being monomorphic. We also removed nonautosomal markers, so that only markers from the 22 autosomal chromosomes were used in the analysis. After these data cleaning steps, 690,864 SNPs remained. Genetic relationship was estimated for each pair of subjects by the genomewide complex trait analysis (GCTA) software [11] using all 690,864 SNPs. We then excluded 385 subjects whose relatedness to some subjects was greater than 0.025 (corresponding to the relatedness of second cousins). The remaining sample, 1,752 subjects, was used in our analysis.
All subjects were interviewed with a computerassisted assessment system called the SSADDA [4], which consists of survey questions designed for cocaine use and related behaviors. All subjects were reported to have used cocaine in their lifetime. The responses to those questions in the SSADDA led to the definition of thirteen important cocaine use related variables, based on which a diagnosis of CD was determined. There were seven binary variables as listed below, which represent the seven cocaine dependence criteria in DSMIV.

F1  tolerance to cocaine;

F2  withdrawal from cocaine;

F3  using cocaine in larger amounts or over longer period than intended;

F4  persistent desire or unsuccessful efforts to cut down or control cocaine use;

F5  great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine;

F6  gave up or reduced important social, occupational, or recreational activities because of cocaine use;

F7  cocaine use despite knowledge of persistent or recurrent physical or psychological problems likely to have been caused or exacerbated by cocaine. In our experiments, positive responses to the seven variables were coded by 1 and negative responses were coded by 0. We also included six continuous variables in the analysis as listed below:

F8  number of cocaine symptom endorsed;

F9  age when first used cocaine;

F10  age when last used cocaine;

F11  age when first diagnosed with DSMIV cocaine dependence;

F12  age when last diagnosed with DSMIV cocaine dependence;

F13  transition time in years between the first cocaine use and the first cocaine dependence diagnosis.
All these variables were normalized to the range of [0, 1] in the analysis.
Synthetic data
Following the same design principle used in the simulations for testing chip heritability in [8], we used the reallife genotypic data in the CD study but synthesized phenotypic data. We simulated quantitative traits based on the mixedeffect linear model shown in Eq.(1). We first synthesized a dataset that contained 5 phenotypic features, all of which were created with moderate to high heritability, and were used to form a quantitative trait of very high heritability reaching 0.8. We then added irrelevant features, which varied mainly due to covariates, to create five other simulated datasets. These datasets consisted of 10, 20, 30, 40 and 50 features where only the first 5 of them were used in the model of the final trait. These datasets were used to determine whether the proposed algorithm could identify the right features for use in the model.
To synthesize features with genetic effects, we randomly picked 2,000 of the 690,864 SNPs in the cocaine use data set and used them as the causal variants of these features. The random effect coefficient u_{ j } associated with each of the 2,000 markers was generated independently by sampling from the standard normal distribution N (0, 1). The residual component Îµ_{ i } for each individual was drawn from the normal distribution of mean 0 and variance var(z_{ i }u)(1/h^{2} âˆ’ 1) where z_{ i } is the ith row of Z, var(Â·) is the sample variance of a random variable and h^{2} is the heritability of the feature. To synthesize features with no genetic effects, we ignored the term Zu in Eq.(1) and created Îµ by randomly sampling from the standard normal distribution. To further synthesize features with fixed covariate effects, we used sex and age of the individuals in the CD study as the covariates, and arbitrarily set their effects, i.e., the Î² coefficients, to 0.2 and 0.5.
We evaluated the proposed method in two different experimental settings:
Setting 1: This setting assumed that there were no covariate effects in the quantitative trait. The five relevant features were simulated as follows. We used the procedure described in the above paragraph to create four features with h^{2} equal to 0.2: x_{1}, Â· Â· Â· , x_{4}. Then we simulated the final quantitative trait y_{1} with h^{2} = 0.8 using the same procedure. A fiveentry weight vector was created with arbitrary values, such as w = [0.22, 0.67, 0.60, 0.30, 0.22], used in our experiments. Then, the fifth feature was directly computed as {x}_{5}=\left({y}_{1}\underset{i=1}{\overset{4}{\xe2\u02c6\u2018}}{w}_{i}{x}_{i}\right)/{w}_{5}. By simulating the data in this way, we knew that there was at least one linear combination of the five features in the data that would result in a composite trait (i.e., y_{1}) with h^{2} of 0.8. Hence, if our approach worked, it should at least find this linear combination if there was no any other one that gave even higher h^{2}. Note that the heritability of the fifth feature had to depend on the empirical estimation, but given how it was created, there were genetic effects in this feature.
We then created 45 other features that had no genetic effects, and added a certain number of these features to the original 5 features to create 5 other datasets. Hence, there were in total 6 datasets for 1,752 subjects with 5, 10, 20, 30, 40 and 50 features. We used this set of data (i.e., the discovery set) in training to retrieve the combination models. Then we repeated the above procedure to create another independent set of data (i.e., the validation set) to validate the resultant models.
Setting 2: This setting assumed that the two covariates, sex and age, had fixed effects to the features and the final trait. We generated 5 features by adding fixed effects to the 5 useful features created in Setting 1. Because fixed effects do not change h^{2} of a trait, we computed a composite trait y_{2} using the same prespecified weight vector w that was used in Setting 1. We then created 45 other features with only covariate effects using the procedure described early on. Five other datasets were generated consisting of 10, 20, 30, 40 and 50 features. Note that the optimal weight vector for these datasets should have zero entries for all features except the first 5 features that were synthesized. Similarly, a discovery suite of the six datasets and another suite of them were synthesized using the same procedure for training and validation, respectively.
We estimated the chip h^{2} of the features created in the synthetic datasets using GCTA software. The four features synthesized with a prespecified h^{2} = 0.2 had empirical chip h^{2} values 0.2 Â± 0.01 in these datasets. The chip h^{2} estimate of the fifth feature was 0.57 in the discovery set and 0.6 in the validation set. Because fixed effects do not affect trait heritability, the five relevant features and the final quantitative traits had the same empirical chip h^{2} in Settings 1 and 2. The features simulated with no genetic effects had h^{2} estimates that ranged from 0 to 0.05, and most of these features had estimates less than 10^{âˆ’5}.
The proposed analyses
We first validated the proposed approach in a variety of experiments with the synthetic data. Then we applied our approach to the reallife cocaine use data to identify important components or subtypes of the disease defined by linear combinations of clinical features. Such a combination can be used to define a disease subtype because it produces a quantitative trait for each individual, which amounts to the membership likelihood of the individual in a subtype. Because the actual causal variants were known for synthetic data, we calculated the GRM of the individuals directly using the causal variants. In the case study for CD, because the real causal variants were unknown, we followed the commonlyused procedure in the literature on chip heritability estimation [30] and computed the GRM using all 690,864 SNPs that remained in the data. All of the reported chip heritability was estimated using GCTA software.
Tuning of the hyperparameter: For both the simulation and the CD case study, we performed 10 times threefold cross validation (CV) to help determine a proper value of Î». At each fold of the CV, a linear model was derived by running the proposed method on 2/3 of the data in the dataset, and then tested on the remaining 1/3 of the data. The cross validated h^{2} of the derived trait was estimated using only subjects in the remaining 1/3 of the data which was not used to train the model. We ran the same CV process for each prespecified choice of Î» (the choices we used are reported in the results section) and chose the Î» value that gave a trait of the highest cross validated h^{2} for each experimental setting.
Evaluation metrics: We reported and investigated the CV performance (including the mean values and standard deviations of the validation h^{2} obtained in the CV process described above) for each Î» choice in each experiment. Once the best value of Î» was chosen through the cross validation for a dataset, we applied the proposed approach with the best Î» to the entire data in the dataset to derive a quantitative trait. The chip heritability of this trait was estimated using the separatelysynthesized validation datasets in simulations and by another cross validation process in the case study. In other words, in simulations, we estimated the valiation chip h^{2} using the trait values computed by the linear model on the newlysynthesized validation samples. In the CD case study, we computed the trait h^{2} using SNPs of 2/3 of the subjects randomly sampled from the dataset and repeated the random sampling 10 times to report the averaged h^{2} value. We named this process the evaluation CV. Moreover, besides the heritability as a major evaluation metric, we also measured the effectiveness of our approach by comparing the derived trait models and the linear model implanted in the simulated data. We calculated the squared difference between the learned weights \widehat{\mathbf{\text{w}}} and the true weights w, i.e., \mathsf{\text{SE}}\left(\mathbf{\text{w}}\right)=\phantom{\rule{2.36043pt}{0ex}}\left\right\mathbf{\text{w}}\widehat{w}{}_{2}^{2} and the mean of squared residuals SE\left(\mathbf{\text{y}}\right)=\left(1/n\right)\underset{i=1}{\overset{n}{\xe2\u02c6\u2018}}{\left({y}_{i}{x}_{i}\hat{\mathbf{\text{w}}}\right)}^{2}, and reported the values in plots. Additional evaluation steps were conducted in the case study to clinically interpret the resultant quantitative trait (see the later paragraph).
Comparison: The validated chip h^{2} of our derived traits was compared with that of all quantitative features in the data in both simulations and the CD case study. In each of the experiments, our derived trait was also compared with the commonlyused disease phenotype, often referred to as a symptom count, which was the quantitative trait created by equal weighted aggregation of all features in the data. Given that no prior method existed to identify heritable disease components using the genomewide SNPs, on the reallife cocaine use data, we compared our approach with a recently published method [23] that aimed to derive linearlycombined traits using pedigrees of related individuals. This comparison considered whether a pedigreebased heritable component analysis method can identify a disease component with a chip heritability comparable to that found by our approach. As multimember families were included in the original cocaine use dataset, i.e., a superset of the sample used by our approach, it was feasible to apply the method in [23] to derive a trait. Then we computed the trait values on the unrelated individuals used by our approach to compare the chip h^{2} of the two approaches using the evaluation CV. Note that the prior pedigreebased approach was actually given a favor because it used the superset of 2,674 subjects (unrelated individuals were treated as onemember pedigrees) to derive the trait in comparison with our approach that used only the 1,752 unrelated individuals.
Clinical interpretation: It is very important to understand the clinical implications of the quantitative trait (or an empirical subtype) derived by our approach from the aggregated CD study data. From prior work [17, 31], we identified three key steps to ensure the clinical validity of an empirical subtype. We first examined the specific features selected by our approach for use in the model. Second, we studied the distribution (or histogram) of the quantitative scores among the 1,752 subjects. From the distribution plot, we examined whether there were obvious subgroups of the scores. Third, the subgroups of subjects were characterized and compared on 11 of the most important clinical variables reflecting cocaine use and related behaviors including both the features selected and those not selected for use in the linear model. The individuals receiving very high or very low values of the quantitative trait may show the most representative features of the subtype.
Results
Simulations
We prespecified 21 different Î» values ranging from 0 to 0.04 with step size 0.002 for use in the crossvalidation tuning process. The validation or test h^{2} for each Î» choice was plotted for each of the six datasets in Figure 1 (for Setting 1 where data were generated without covariate effects) and Figure 2 (for Setting 2 where data were generated with covariate effects). The mean, median, and the standard deviations of the test h^{2} values in the cross validation were plotted for each tested Î». These two figures show that our approach could identify components (quantitative traits) with test h^{2} estimate of ~ 0.8, which was the heritability of the implanted heritable component (the simulated true model), for all datasets even with many irrelevant features in some of the datasets for both settings. This result demonstrates that our approach identified highly heritable disease components and could successfully correct for fixed covariate effects.
In addition, both Figures 1 and 2 show that there was overfitting of the learned models to the training data when Î» was too small, especially when the number of irrelevant features grew to 25, 35 and 45. The larger the number of irrelevant features in the data, the more severe was the overfitting seen when Î» was small. On the other hand, when Î» was too large, underfitting could occur, so the test h^{2} showed a peak in most of the plotted curves. The best choice of Î» for the six datasets (sorted from the smallest number of features in the data to the largest number of features) were 0, 0, 0, 0.002, 0.004 and 0.004 for Setting 1, and 0, 0, 0, 0,002, 0.004 and 0.006 for Setting 2.
Then Î» was set to the optimal value for each of the six datasets, and we reran the algorithm to generate the final quantitative trait from each dataset. We then compared the chip h^{2} between these derived traits and the commonly used disease traits, such as individual features and the equally weighted aggregation of individual features. The results are shown in Figure 3 where all h^{2} values were estimated using only the validation datasets. Results from both settings are shown. For all the datasets and for both settings, our approach could recover the quantitative traits of h^{2} close to 0.8. When the number of irrelevant features in the data increased, the h^{2} values of the derived traits decreased as expected. Typically, when more irrelevant features were included in the data, the learning problem became more challenging. Because covariates do not affect h^{2} estimate when their effect is properly corrected during the estimation, we did not differentiate the two settings when discussing and comparing the h^{2} values of individual features and the aggregated traits. Recall that in our simulation, the highest chip h^{2} of individual features was 0.6 in validation. Figure 3 also included this most heritable feature for comparison. Among the traits derived by equally weighted aggregation, the one developed from the 5feature dataset reached the highest h^{2} (= 0.66). As expected, chip h^{2} of these traits decreased along with the increasing number of irrelevant features. The trait developed from the 50feature dataset had the lowest h^{2} (= 0.28). These results demonstrate that our approach could identify quantitative traits that are more heritable (i.e., with high chip heritability) than those commonly used. The squared difference (error) between learned weights and the optimal (implanted) weights: SE(w) together with the squared error between the derived traits and simulated traits: SE(y) are presented in Figure 4. The results from both settings are provided. We observed that clearly when the number of irrelevant features increased, the noisier data made the learning problem more difficult, and SE(y) and SE(w) increased in both of the settings.
A case study of cocaine dependence
In this study, we used the same prespecified Î» values in the simulations. In all expriments, we used age, sex and the first three principal components of the GRM as covariates. The test h^{2}s of all traits derived for each Î» choice in the CV process are plotted in Figure 5. It shows that there was overfitting when Î» was too small as well. The trait derived with Î» = 0.004 achieved the highest crossvalidated h^{2} on average. We thus derived a model by running our approach with Î» = 0.004 and all the 1,752 subjects in the data. We examined this model and the resultant quantitative trait as discussed in the proposed analyses section.
The weights that each variable received in the model are shown in Figure 6. Of the 13 clinical variables, five (F8  F12) received a zero coefficient, and were completely ruled out from the model. Variable F13 had an coefficient close to 0 (< 10^{âˆ’5}), thus its impact on the resultant trait was minimal. Variables with significant coefficients in the model are the seven cocaine criteria: F3  using cocaine in larger amounts or over longer period than intended, F2  withdrawal from cocaine, and F5  great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine, had the highest impact on the trait. Both F3 and F5 had negative weights, which indicated that positive response to these two variables would lower the score or value of this trait. In contrast, F2 had a positive coefficient, which indicated that a positive response to this variable would increase the score. Variables F4  persistent desire or unsuccessful efforts to cut down or control cocaine use, and F1  tolerance to cocaine, had limited impact on the trait.
The cross validated h^{2} estimate of the trait derived by the proposed approach is 0.87 (with a standard error of 0.13). For comparison, we ran the approach proposed in [23], which identifies heritable components with pedigrees as genetic inputs and its formulation also had a hyperparameter Î». For fair comparison, we first ran 10 times cross validation to choose a proper value for Î». With this Î», we developed a trait using the entire African American sample set. We then estimated the trait h^{2} using the exact same setting (i.e., GRM and covariates) as for the traits derived by the proposed approach. We estimated the h^{2} for all six continuous variables in the data using the same setting. All of the h^{2} values are plotted in Figure 7 together with the trait derived by the proposed approach. The figure clearly shows that the trait derived by the proposed approach had the highest h^{2} among all compared traits, and was significantly higher than that of the trait derived using the prior approach [23]. Note that one of the continuous variables in the cocaine use data was the counting of CD criteria. It was defined as the number of positive responses to the seven CD criteria, i.e., a quantitative trait resulting from a linear combination of F1F7 with equal weights. This trait was reported to be a better trait for genomewide association analysis than the binary CD diagnosis [1]. The CD diagnosis had a value of h^{2} close to zero when estimated using our data. These results demonstrate the effectiveness of our approach in identifying disease components with high chip heritability from complex multivariate phenotypes. It is worth noting that the trait with the second highest h^{2} estimate was the one derived using the prior approach [23]. This implies that (1) both of these two methods (i.e., the proposed method and the one previously reported [23]) can identify heritable components with high chip heritability; but (2) the proposed method outperforms as it directly maximizes heritability using genetic variants.
Figure 8 shows the distribution of the trait values (i.e., the membership scores) of the subjects. It shows that based on the scores, the samples can be partitioned into four subgroups. There were 250 subjects (14.27% of total) in Group 1, which had a mean score of 2.22. Group 2 consisted of 323 subjects and comprised 18.44% of the entire sample set. Its mean score was 0.8. Group 3 was the largest one and consisted of 821 subjects (46.86% of the sample). The mean score of this group was 0.2. Group 4 was the smallest, comprised of 237 individuals (13.53% of the sample), with a mean score of 1.22.
To understand the clinical implications of the derived trait, we characterized the four groups using 11 important clinical variables, including the 7 CD criteria (F1F7), the total number of CD criteria endorsed (F8), age of first cocaine use (F9), age when first diagnosed with DSMIV CD (F11), and the transition time in years from first cocaine use to first DSMIV CD diagnosis (F13). The results are summarized in Table 1. Only 5.6% of the subjects in Group 1 had experienced cocaine withdrawal symptoms (F2), despite the face that this group contained heavy cocaine users (as shown by other variables). Most of the subjects (99.6%) in this group reported using cocaine in larger amounts or over a longer period than indended (F3). Moreover, this group had the highest percentage of subjects (96%) who spent a great amount of time in activities related to cocaine (F5). Group 2 had the lowest percentage of subjects (23.6%) with tolerance to cocaine (F1), but the highest percentage of subjects (92.92%) who used cocaine despite knowledge that problems were likely caused by cocaine (F7). Subjects in Group 3 endorsed all of the seven CD criteria with similar percentages (87.04% was the lowest and 90.60% was the highest among the seven criteria). Group 4 had the lowest percentage of subjects who endorsed F3, F5 and F7. Group 4 also had the highest percentage of subjects with persistent desire or unsuccessful attempts to cut down their cocaine use (F4). Group 1 and Group 2 had a similar transition time from first cocaine use to first CD diagnosis (F13): 8.01 years for Group 1 and 8.07 years for Group 2. These were significantly shorter than the transition time in Groups 3 and 4, which were 12.15 years and 15.19 years, respectively.
Conclusion
We developed an approach to identify composite traits from multivariate phenotypes that are highly heritable, as estimated using genomewide SNPs. The trait we derived is in the form of a linear combination of variables related to the phenotype, that is y = Xw. A quadratic optimization problem was formulated, in which optimal w was sought to optimize the log likelihood for estimating variance components in REML. In this formulation, variance components are set to their ideal values with the additive genetic variance component {\mathrm{\xcf\u0192}}_{g}^{2} equal to 1 and other components equal to 0. To avoid overfitting, we incorporated a regularization term in our formulation. An efficient algorithm based on the sequential quadratic programming framework was developed to solve the proposed optimization problem. We evaluated the proposed approach on both synthetic and real world data. The empirical results demonstrate the effectiveness of our approach as a means to identify traits with much higher chip h^{2} than commonlyused disease phenotypes.
In this paper, the pairwise genetic relationship among subjects was estimated from genomewide SNPs. However, it can also be estimated from SNPs restricted to a specific region, such as on a particular chromosome or in genes related to a pathway, to explore the genetic architecture of a trait. When SNPs within a specific region are used, the trait resulting from the proposed approach will achieve the maximized genetic variance component corresponding to this region. In an application, such as substance dependence, there are known pathways involved, so it may be of utility to determine whether there is a composite trait, the variance of which can be largely explained by the variants within the pathways. This will be a future application of our approach.
References
Gelernter J, Sherva R, Koesterer R, Almasy L, Zhao H, Kranzler HR, Farrer L: Genomewide association study of cocaine dependence and related traits: Fam53b identified as a risk gene. Mol Psychiatry. 2013
Gelernter J, Kranzler HR, Sherva R, Koesterer R, Almasy L, Zhao H, Farrer LA: Genomewide association study of opioid dependence: multiple associations mapped to calcium and potassium pathways. Biol Psychiatry. 2014, 76 (1): 6674. 10.1016/j.biopsych.2013.08.034.
Treutlein J, Rietschel M: Genomewide association studies of alcohol dependence and substance use disorders. Curr Psychiatry Rep. 2011, 13 (2): 14755. 10.1007/s1192001101764.
PierucciLagha A, Gelernter J, Chan G, Arias A, Cubells JF, Farrer L, Kranzler HR: Reliability of dsmiv diagnostic criteria using the semistructured assessment for drug dependence and alcoholism (ssadda). Drug Alcohol Depend. 2007, 91 (1): 8590. 10.1016/j.drugalcdep.2007.04.014.
Balding DJ, Bishop MJ, Cannings C: Handbook of Statistical Genetics. 2007, John Wiley & Sons, Chichester, England; Hoboken, NJ, 3
de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of wholegenome markers. Nature Reviews Genetics. 2010, 11 (12): 880886. 10.1038/nrg2898.
Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157 (4): 18191829.
Yang J, Montgomery GW, Goddard ME, Visscher PM, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG: Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010, 42 (7): 56510.1038/ng.608.
Makowsky R, Pajewski NM, Klimentidis YC, Vazquez AI, Duarte CW, Allison DB, de los Campos G: Beyond missing heritability: prediction of complex traits. PLoS Genetics. 2011, 7 (4): 100205110.1371/journal.pgen.1002051.
Hill WG, Wray NR: Heritability in the genomics eraconcepts and misconceptions. Nature Reviews Genetics. 2008, 9 (4): 255266. 10.1038/nrg2322.
Yang J, Lee SH, Goddard ME, Visscher PM: Gcta: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011, 88 (1): 7682. 10.1016/j.ajhg.2010.11.011.
Speed D, Hemani G, Johnson MR, Balding DJ: Improved heritability estimation from genomewide SNPs. American Journal of Human Genetics. 2012, 91 (6): 10111021. 10.1016/j.ajhg.2012.10.010.
Kranzler HR, Wilcox M, Weiss RD, Brady K, Hesselbrock V, Rounsaville B, Farrer L, Gelernter J: The validity of cocaine dependence subtypes. Addict Behav. 2008, 33 (1): 4153. 10.1016/j.addbeh.2007.05.011.
Bi J, Gelernter J, Sun J, Kranzler HR: Comparing the utility of homogeneous subtypes of cocaine use and related behaviors with dsmiv cocaine dependence as traits for genetic association analysis. Am J Med Genet B Neuropsychiatr Genet. 2014, 165B (2): 14856.
Sun J, Bi J, Chan G, Oslin D, Farrer L, Gelernter J, Kranzler HR: Improved methods to identify stable, highly heritable subtypes of opioid use and related behaviors. Addictive Behaviors. 2012
Gelernter J, Panhuysen C, Wilcox M, Hesselbrock V, Rounsaville B, Poling J, Weiss R, Sonne S, Zhao H, Farrer L, Kranzler HR: Genomewide linkage scan for opioid dependence and related traits. American Journal of Human Genetics. 2006, 78 (5): 759769. 10.1086/503631.
Babor TF, Caetano R: Subtypes of substance dependence and abuse: implications for diagnostic classification and empirical research. Addiction (Abingdon, England). 2006, 101: 10410.
Hu VW, Addington A, Hyman A: Novel autism subtypedependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published gwas data. PloS ONE. 2011, 6 (4): 1906710.1371/journal.pone.0019067.
Ott J, Rabinowitz D: A principalcomponents approach based on heritability for combining phenotype information. Hum Hered. 1999, 49 (2): 10611. 10.1159/000022854.
Wang Y, Fang Y, Jin M: A ridge penalized principalcomponents approach based on heritability for highdimensional data. Hum Hered. 2007, 64 (3): 18291. 10.1159/000102991.
Klei L, Luca D, Devlin B, Roeder K: Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol. 2008, 32 (1): 919. 10.1002/gepi.20257.
Oualkacha K, Labbe A, Ciampi A, Roy MA, Maziade M: Principal components of heritability for high dimension quantitative traits and general pedigrees. Statistical Applications in Genetics and Molecular Biology. 2012, 11 (2):
Sun J, Bi J, Kranzler HR: Quadratic optimization to identify highly heritable quantitative traits from complex phenotypic features. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD '13. 2013, ACM, New York, NY, USA, 811819.
Patterson HD, Thompson R: Recovery of interblock information when block sizes are unequal. Biometrika. 1971, 58 (3): 545554. 10.1093/biomet/58.3.545.
Verbyla AP: A conditional derivation of residual maximum likelihood. Australian Journal of Statistics. 1990, 32 (2): 227230. 10.1111/j.1467842X.1990.tb01015.x. doi:10.1111/j.1467842X.1990.tb01015.x
Vapnik VN: An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999, 10 (5): 988999. 10.1109/72.788640.
Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New York, 2
Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New York
Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 94559.
Yang J, Benyamin B, Visscher PM: Common snps explain a large proportion of the heritability for human height. Nat Genet. 2010, 42 (7): 5659. 10.1038/ng.608.
Hesselbrock VM, Hesselbrock MN: Are there empirically supported and clinically useful subtypes of alcohol dependence?. Addiction. 2006, 101 (Suppl 1): 97103.
Acknowledgements
This work was supported by NIH grant R01DA037349, NSF grant DBI1356655, and IIS1447711. Jinbo Bi was also supported by NSF grants IIS1320586 and IIS1407205. Henry R. Kranzler was also supported by NIH grants R01DA030976, R01AA021164, R01AA023192, and R01CA184315.
We thank Joel Gelernter, M.D. from Yale University who was instrumental in recruiting, characterizing, and genotyping the subjects in the SSADDA dataset used here. Kathleen Brady, M.D., Ph.D. of the Medical University of South Carolina, Roger Weiss, M.D., of McLean Hospital and Harvard Medical School, and David Oslin, M.D., of the University of Pennsylvania Perelman School of Medicine oversaw study recruitment at their respective sites. That work was supported by NIH grants R01AA011330, R01AA017535, R01DA012690, R01DA018432, and R01DA012849.
Publication costs for this article were funded by NIH grant R01DA037349.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 3, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors Jiangwen Sun and Jinbo Bi declare that they have no competing interests. Henry R Kranzler has served as a consultant or advisory board member for the following companies: Alkermes, Lilly, Lundbeck, Otsuka, and Pfizer. He is a member of the Alcohol Clinical Trials Group of the American Society of Clinical Psychopharmacology, which is supported by Abbvie, Alkermes, Ethypharm, Lilly, Lundbeck, and Pfizer.
Authors' contributions
JB, HRK and JS designed the overall study of finding heritable disease subtypes. JS and JB designed the new algorithm discussed in this paper. JS implemented the algorithm and performed the data analyses. HRK provided the aggregated drug dependence data and helped in interpreting the results. JS and JB wrote the first draft of this manuscript.
Rights and permissions
Open Access Â This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sun, J., Kranzler, H.R. & Bi, J. Refining multivariate disease phenotypes for high chip heritability. BMC Med Genomics 8 (Suppl 3), S3 (2015). https://doi.org/10.1186/175587948S3S3
Published:
DOI: https://doi.org/10.1186/175587948S3S3
Keywords
 phenotypegenotype association analysis
 chip heritability
 quadratic optimization
 heritable component analysis