Refining multivariate disease phenotypes for high chip heritability
 Jiangwen Sun^{1},
 Henry R Kranzler^{2} and
 Jinbo Bi^{1}Email author
Published: 23 September 2015
Abstract
Background
Statistical genetics shows that the success of both genetic association studies and genomic prediction methods is positively associated with the heritability of the trait used in the analysis. Identifying highly heritable components of a complex disease can thus enhance genetic studies of the disease. Existing heritable component analysis methods use data from related individuals to compute linearlycombined traits to maximize heritability. Recent advances in acquiring genomewide markers have enhanced heritability estimation using genotypic data from apparently unrelated individuals, which is referred to as the chip heritability. Novel statistical models are thus needed to identify disease components (subtypes) with high chip heritability.
Methods
We propose an optimization approach to identify highly heritable components of a complex disease as a function of multiple clinical variables. The heritability of the components is estimated directly from unrelated individuals using their genomewide single nucleotide polymorphisms. The proposed approach can also model the fixed effects due to covariates, such as age and race, so that the derived traits have high chip heritability after correcting for fixed effects. A new sequential quadratic programming algorithm is developed to efficiently solve the proposed optimization problem.
Results
The proposed algorithm was validated both in simulations and the analysis of a realworld dataset that was aggregated from genetic studies of cocaine, opoid, and alcohol dependence. Simulation studies demonstrated that the proposed approach could identify the hypothesized component from multiple synthesized features. A case study on cocaine dependence (CD) identified a quantitative trait that achieved chip heritability of 0.86 estimated using a crossvalidation process. This quantitative trait corresponded to the likelihood of an individual's membership in a CD subtype. Clinical analysis showed that the subtype enclosed individuals who reported heavy use of cocaine but few withdrawal symptoms.
Conclusions
Extensive experiments on both synthetic and realworld data demonstrate the effectiveness of the proposed approach as a means to find meaningful disease components with high chip heritability.
Keywords
Introduction
Identifying genetic variation that underlies complex diseases has important implications in medicine. To date, genomewide association studies (GWAS) have had limited success in dissecting the genetic etiology of complex diseases. For instance, very few associations identified for substance use disorders at a genomewide significant level have been replicated [1–3]. Complex disorders are often characterized by multiple disease indicators. For example, to diagnose whether a patient has a lifetime drug dependence disorder, clinicians interview the patient to understand his or her drug use behaviors, the negative consequences of the drug use, the treatment history and other cooccuring medical conditions. All of these clinical variables are used to arrive at a diagnosis of dependence on a certain drug [4]. There is substantial variation in these variables in the disease population, and these variables also present different levels of heritability, i.e., some are more genetically influenced than others. This phenotypic heterogeneity diminishes evidence of genetic association. Statistical genetics also shows that the success of most gene discovery studies is positively associated with the heritability of the trait used in the association analysis [5]. Hence, identifying more homogeneous and highly heritable components of a complex disease could enhance the association analysis.
The ability to translate genotype information into a quantitative prediction of disease phenotypes is important for precision medicine [6]. Genomic prediction methods that predict a phenotype based on genomewide single nucleotide polymorphisms (SNPs) may provide a suitable analytic tool [7, 8]. These methods expand the traditional singlemarkerregressionbased GWAS model for detecting few variants of large effect to multimarker predictive models with many variants of small effect. The predictive ability of genomic prediction methods relies on several factors, especially trait heritability [9]. If we identify higly heritable components of a complex disease, it could also improve the utility of genomic prediction methods to predict subtypes (defined by the components) of the disease.
Because the success of both association analysis and genomic prediction is dependent on the trait heritability, heritability can be a valid target for refining multivariate disease phenotypes. The narrowsense heritability h^{2} is defined by the percentage of phenotypic variance that is due to additive genetic effects [10]. The broadsense heritability H^{2} is defined as the overall genetic contribution to the phenotypic variation. The heritability of a quantitative trait is commonly estimated from related individuals in pedigrees. Recent advances in acquiring dense genomewide genetic markers have enhanced heritability estimation from apparently unrelated individuals using their genomewide SNPs. The SNPbased heritability, often referred to as the chip heritability, is defined as the portion of the phenotypic variation that can be explained by the genotyped genetic markers [11]. It has been argued that estimating h^{2} from unrelated individuals has an advantage over traditional pedigreebased methods because the estimated chip h^{2} corresponds only to the causalvariant heritability that is tagged by the genotyped SNPs [8, 12].
Phenotype refinement is an important but underdeveloped genetics research area. Unsupervised cluster analysis or latent class analysis has been commonly used to partition a study population into subgroups based on clinical variables [13–18]. This approach can create subgroups of individuals that differ in clinical symptoms and features, but may have limited utility in genetic analysis. Because genetic data are not used during the creation of the subgroups, the resultant subtypes (subgroups) are not guaranteed to have high heritability, and hence may not be informative for genetic association.
More relevant to this present work, a number of prior methods identify the principal components of clinical data that are heritable, and characterize the components by linear combinations of clinical variables [19–23]. Thus, these methods are often called heritable component analysis. All existing methods decompose the variance of clinical data into two components: the variance due to additive genetic effects estimated from pedigrees; and the variance due to other effects (residuals). Then, they solve a generalized eigendecomposition problem to identify the linear combination of the clinical variables that maximizes the ratio of additivegenetic variance versus the residual variance, thus leading to high heritability of the resultant linearly combined trait. Nearly all of these methods use pedigreebased heritability estimation (an exception is [23]), and all assume a genetic model that is based on a single causal variant, an assumption that is commonly violated for complex diseases.
Although the latest heritable component analysis method [23] is effective and computationally efficient, a fundamental question is how much heritability of the derived trait can be explained by the genotyped SNPs. Because GWAS and genomic predictions mainly utilize the genotyped SNPs, the utility of the derived trait may be limited by a low chip heritability. Thus, novel statistical models are needed to directly target high chip heritability. In this paper, we propose an approach to identify the components of a multivariate disease phenotype that maximizes the chip h^{2}. To estimate the chip heritability of a given trait, the latest methods use the restricted maximum likelihood (REML) method, which assumes that the trait follows a mixed effect model with random genetic effects, and fixed effects due to covariates, such as age, sex and race [8, 12]. To identify a trait of high chip h^{2}, we need to solve the inverse problem of (chip) heritability estimation. In other words, we now search for a trait (e.g., a linearlycombined trait) so that its chip heritability is high when estimated using the REML method. Directly solving the inverse problem leads to a quadratic optimization problem that can be optimized efficiently via a sequential quadratic programming algorithm. We validated the proposed approach in simulations as well as in the analysis of a realworld dataset that was aggregated from genetic studies of cocaine, opioid, and alcohol dependence. Our experimental results demonstrated the effectiveness and generalizability of the proposed approach.
Methods
The proposed statistical model
where${{\sigma}_{g}}^{\mathsf{\text{2}}}$ and${{\sigma}_{e}}^{\mathsf{\text{2}}}$ can be estimated by the REML method [24, 11]. The chip heritability estimated on the m causal variants is computed as${{h}_{p}}^{2}={{\sigma}_{g}}^{2}/{{\sigma}_{e}}^{2}$, where ${{\sigma}_{p}}^{2}={{\sigma}_{g}}^{2}+{{\sigma}_{e}}^{2}$ is the total phenotypic variance. Because the causal variants of y are usually unknown for a trait, recent research has proposed to estimate a GRM using genomewide SNPs [8, 12].
Given data on y, C and Z, ${\sigma}_{g}^{2}$ and ${\sigma}_{e}^{2}$ are obtained by maximizing the log likelihood of observing the trait values $\ell \left({\sigma}_{g}^{2},{\sigma}_{e}^{2};y\right)$ which corresponds to maximizing ${\ell}_{2}\left({\sigma}_{g}^{2},{\sigma}_{e}^{2};{\stackrel{\u0303}{\mathbf{\text{y}}}}_{2}\right)$[11]. The chip heritability of a trait y is computed using the resultant optimal ${\sigma}_{g}^{2}$ and ${\sigma}_{e}^{2}$.
In our study, however, we solve the inverse problem of the above estimation model. A definitive quantitative trait y is not known beforehand but needs to be derived from a set of known clinical variables. Let X_{ n×d } be the data matrix of d clinical variables x for the same n subjects as in Z. A trait y is defined by a linear function of y = w^{ ┬ }x where w is the vector of combination coefficients. Correspondingly, the trait values y = Xw. Unlike the heritability estimation process that finds the best values of ${\sigma}_{g}^{2}$ and ${\sigma}_{e}^{2}$ to maximize the likelihood of observing the values of y, the inverse problem searches for the best w so to form a trait y that maximizes the likelihood, (or equivalently the log likelihood $\ell \left({\sigma}_{g}^{2},{\sigma}_{e}^{2};y,C,Z\right)$), of observing a large heritability, i.e., a large ${\sigma}_{g}^{2}$ but small ${\sigma}_{e}^{2}$. For simplicity and easy interpretation of the resultant model, here we only consider linear models, but the proposed method can be easily extended to construct nonlinear models through kernel mapping [26].
where λ is a hyperparameter and needs to be tuned, and $\frac{1}{n}$and $\frac{1}{d}$are included to prebalance the two items in the objective function. The value of λ can either be chosen by users according to domain knowledge or determined using a crossvalidation process as done in our experiments. According to learning theory [26], minimizing $\frac{1}{n}{w}^{T}\left({X}^{T}PX\right)w$ corresponds to empirical risk minimization, whereas minimizing the objective in Eq.(10) corresponds to structural risk minimization that improves the generalizability of the resultant model. There are many different ways to define R(w) [23]. The L_{2} vector norm defined by $\left\rightw{}_{2}^{2}={\sum}_{i}{w}_{i}^{2}$ is a common choice. The L_{1} vector norm defined by $\left\rightw{}_{1}={\sum}_{i}{w}_{i}$ can be a better choice when model sparsity is required to select variables for use in the model. In more complicated applications where variables may be grouped and feature selection among groups is expected, a structured regularizer, such as the group lasso ${\left\rightw\left\right}_{2,1}=\sum _{K=1}^{L}\sqrt{{\sum}_{i\in {\mathcal{G}}_{k}}{w}_{i}^{2}}$, can be used where ${\mathcal{G}}_{k}$ contains the indices of variables belonging to a group k.
Optimization algorithm
The algorithm we will describe next, although is designed for Problem (11), can be modified to solve Problem (10) that may take another form of the regularizers.
where f denotes the objective function, g's denote the constraints, and e = 2d + 1, indicating the number of constraints in that group. It is straightforward to show that Eq.(12) is equivalent to Eq.(11) in the sense that at optimality w = u − v =γ(1 : d) − γ(d + 1 : 2d). When Eq.(12) reaches optimality, at least one of the two components u_{ i } and v_{ i } at ny ith position of the two vectors will be 0. Otherwise, by setting ${\u0169}_{i}={u}_{i}{v}_{i}$ and ${\u1e7d}_{i}=0$ if u_{ i } ≥ v_{ i }, or ${\u0169}_{i}=0$ and ${\u1e7d}_{i}={v}_{i}{u}_{i}$if u_{ i } < v_{ i }, we obtain a better solution with${\u0169}_{i}$ and ${\u1e7d}_{i}$ than (u, v). Therefore, at optimality, $\sum _{i=1}^{2d}{\gamma}_{i}=\sum _{i=1}^{d}{u}_{i}+{v}_{i}=\sum _{i=1}^{d}\left{w}_{i}\right={\u2225\mathbf{\text{w}}\u2225}_{1}$ Then, Eq.(12) becomes exactly the same as Eq.(11).
We summarize the proposed algorithm in Algorithm 1. It has been proved that a SQP based algorithm can converge to a local minimizer $\widehat{\gamma}$ of the optimization problem (12) [28].
Algorithm 1 A sequential quadratic programming approach to solving Problem (11)
Data sets
We validated the proposed approach in both simulations and the analysis of a realworld data set that was aggregated from multiple genetic studies of cocaine dependence (CD).
Cocaine use and related behaviors data
We used the SemiStructured Assessment for Drug Dependence and Alcoholism (SSADDA) dataset aggregated from genetic studies of drug dependence to evaluate the proposed algorithm. The SSADDA subjects were recruited from multiple sites, including the University of Connecticut Health Center, Yale University School of Medicine, the University of Pennsylvania School of Medicine, McLean Hospital and the Medical University of South Carolina. All subjects participated using procedures approved by the institutional review board at each participating site. There were 6,621 subjects genotyped with a total of 1,140,420 SNPs genomewide. Among the subjects, 2,674 were stratified into the African American population using STRUCTURE software v2.3 [29], and only these subjects were used in our experiments to avoid spurious findings due to population structure. We removed 537 subjects who had other family members in the data so the GRM was computed for unrelated individuals.
Input: Z, C, X, λ
 1.
 2.
Initialize γwith u = 1, v = 0.
 3.
Initialize the Lagrange multipliers α= 1.
 4.
Evaluate f, $\nabla f$, ∇g_{ i } and ${\nabla}^{\mathsf{\text{2}}}\mathcal{L}$with the current γand α.
 5.
Solve Problem (13) to obtain $\widehat{\mathbf{\text{p}}}$ and $\widehat{q}$.
 6.
Perform a line search to find the searching step size s.
 7.
Update γand αas in Eq.(14). Repeat 47 until γreaches a fixed point.
A series of data cleaning steps were performed to ensure the quality of genotypic markers. Markers that meet any of the following conditions were excluded: low call rate (<98% subjects received values for the marker), G/C and A/T markers (to avoid strand issues), deviation from HardyWeinberg equilibrium at p <10^{ −8 }, significant cohort calling discrepancy and being monomorphic. We also removed nonautosomal markers, so that only markers from the 22 autosomal chromosomes were used in the analysis. After these data cleaning steps, 690,864 SNPs remained. Genetic relationship was estimated for each pair of subjects by the genomewide complex trait analysis (GCTA) software [11] using all 690,864 SNPs. We then excluded 385 subjects whose relatedness to some subjects was greater than 0.025 (corresponding to the relatedness of second cousins). The remaining sample, 1,752 subjects, was used in our analysis.
All subjects were interviewed with a computerassisted assessment system called the SSADDA [4], which consists of survey questions designed for cocaine use and related behaviors. All subjects were reported to have used cocaine in their lifetime. The responses to those questions in the SSADDA led to the definition of thirteen important cocaine use related variables, based on which a diagnosis of CD was determined. There were seven binary variables as listed below, which represent the seven cocaine dependence criteria in DSMIV.

F1  tolerance to cocaine;

F2  withdrawal from cocaine;

F3  using cocaine in larger amounts or over longer period than intended;

F4  persistent desire or unsuccessful efforts to cut down or control cocaine use;

F5  great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine;

F6  gave up or reduced important social, occupational, or recreational activities because of cocaine use;

F7  cocaine use despite knowledge of persistent or recurrent physical or psychological problems likely to have been caused or exacerbated by cocaine. In our experiments, positive responses to the seven variables were coded by 1 and negative responses were coded by 0. We also included six continuous variables in the analysis as listed below:

F8  number of cocaine symptom endorsed;

F9  age when first used cocaine;

F10  age when last used cocaine;

F11  age when first diagnosed with DSMIV cocaine dependence;

F12  age when last diagnosed with DSMIV cocaine dependence;

F13  transition time in years between the first cocaine use and the first cocaine dependence diagnosis.
All these variables were normalized to the range of [0, 1] in the analysis.
Synthetic data
Following the same design principle used in the simulations for testing chip heritability in [8], we used the reallife genotypic data in the CD study but synthesized phenotypic data. We simulated quantitative traits based on the mixedeffect linear model shown in Eq.(1). We first synthesized a dataset that contained 5 phenotypic features, all of which were created with moderate to high heritability, and were used to form a quantitative trait of very high heritability reaching 0.8. We then added irrelevant features, which varied mainly due to covariates, to create five other simulated datasets. These datasets consisted of 10, 20, 30, 40 and 50 features where only the first 5 of them were used in the model of the final trait. These datasets were used to determine whether the proposed algorithm could identify the right features for use in the model.
To synthesize features with genetic effects, we randomly picked 2,000 of the 690,864 SNPs in the cocaine use data set and used them as the causal variants of these features. The random effect coefficient u_{ j } associated with each of the 2,000 markers was generated independently by sampling from the standard normal distribution N (0, 1). The residual component ε_{ i } for each individual was drawn from the normal distribution of mean 0 and variance var(z_{ i }u)(1/h^{2} − 1) where z_{ i } is the ith row of Z, var(·) is the sample variance of a random variable and h^{2} is the heritability of the feature. To synthesize features with no genetic effects, we ignored the term Zu in Eq.(1) and created ε by randomly sampling from the standard normal distribution. To further synthesize features with fixed covariate effects, we used sex and age of the individuals in the CD study as the covariates, and arbitrarily set their effects, i.e., the β coefficients, to 0.2 and 0.5.
We evaluated the proposed method in two different experimental settings:
Setting 1: This setting assumed that there were no covariate effects in the quantitative trait. The five relevant features were simulated as follows. We used the procedure described in the above paragraph to create four features with h^{2} equal to 0.2: x_{1}, · · · , x_{4}. Then we simulated the final quantitative trait y_{1} with h^{2} = 0.8 using the same procedure. A fiveentry weight vector was created with arbitrary values, such as w = [0.22, 0.67, 0.60, 0.30, 0.22], used in our experiments. Then, the fifth feature was directly computed as ${x}_{5}=\left({y}_{1}\sum _{i=1}^{4}{w}_{i}{x}_{i}\right)/{w}_{5}$. By simulating the data in this way, we knew that there was at least one linear combination of the five features in the data that would result in a composite trait (i.e., y_{1}) with h^{2} of 0.8. Hence, if our approach worked, it should at least find this linear combination if there was no any other one that gave even higher h^{2}. Note that the heritability of the fifth feature had to depend on the empirical estimation, but given how it was created, there were genetic effects in this feature.
We then created 45 other features that had no genetic effects, and added a certain number of these features to the original 5 features to create 5 other datasets. Hence, there were in total 6 datasets for 1,752 subjects with 5, 10, 20, 30, 40 and 50 features. We used this set of data (i.e., the discovery set) in training to retrieve the combination models. Then we repeated the above procedure to create another independent set of data (i.e., the validation set) to validate the resultant models.
Setting 2: This setting assumed that the two covariates, sex and age, had fixed effects to the features and the final trait. We generated 5 features by adding fixed effects to the 5 useful features created in Setting 1. Because fixed effects do not change h^{2} of a trait, we computed a composite trait y_{2} using the same prespecified weight vector w that was used in Setting 1. We then created 45 other features with only covariate effects using the procedure described early on. Five other datasets were generated consisting of 10, 20, 30, 40 and 50 features. Note that the optimal weight vector for these datasets should have zero entries for all features except the first 5 features that were synthesized. Similarly, a discovery suite of the six datasets and another suite of them were synthesized using the same procedure for training and validation, respectively.
We estimated the chip h^{2} of the features created in the synthetic datasets using GCTA software. The four features synthesized with a prespecified h^{2} = 0.2 had empirical chip h^{2} values 0.2 ± 0.01 in these datasets. The chip h^{2} estimate of the fifth feature was 0.57 in the discovery set and 0.6 in the validation set. Because fixed effects do not affect trait heritability, the five relevant features and the final quantitative traits had the same empirical chip h^{2} in Settings 1 and 2. The features simulated with no genetic effects had h^{2} estimates that ranged from 0 to 0.05, and most of these features had estimates less than 10^{−5}.
The proposed analyses
We first validated the proposed approach in a variety of experiments with the synthetic data. Then we applied our approach to the reallife cocaine use data to identify important components or subtypes of the disease defined by linear combinations of clinical features. Such a combination can be used to define a disease subtype because it produces a quantitative trait for each individual, which amounts to the membership likelihood of the individual in a subtype. Because the actual causal variants were known for synthetic data, we calculated the GRM of the individuals directly using the causal variants. In the case study for CD, because the real causal variants were unknown, we followed the commonlyused procedure in the literature on chip heritability estimation [30] and computed the GRM using all 690,864 SNPs that remained in the data. All of the reported chip heritability was estimated using GCTA software.
Tuning of the hyperparameter: For both the simulation and the CD case study, we performed 10 times threefold cross validation (CV) to help determine a proper value of λ. At each fold of the CV, a linear model was derived by running the proposed method on 2/3 of the data in the dataset, and then tested on the remaining 1/3 of the data. The cross validated h^{2} of the derived trait was estimated using only subjects in the remaining 1/3 of the data which was not used to train the model. We ran the same CV process for each prespecified choice of λ (the choices we used are reported in the results section) and chose the λ value that gave a trait of the highest cross validated h^{2} for each experimental setting.
Evaluation metrics: We reported and investigated the CV performance (including the mean values and standard deviations of the validation h^{2} obtained in the CV process described above) for each λ choice in each experiment. Once the best value of λ was chosen through the cross validation for a dataset, we applied the proposed approach with the best λ to the entire data in the dataset to derive a quantitative trait. The chip heritability of this trait was estimated using the separatelysynthesized validation datasets in simulations and by another cross validation process in the case study. In other words, in simulations, we estimated the valiation chip h^{2} using the trait values computed by the linear model on the newlysynthesized validation samples. In the CD case study, we computed the trait h^{2} using SNPs of 2/3 of the subjects randomly sampled from the dataset and repeated the random sampling 10 times to report the averaged h^{2} value. We named this process the evaluation CV. Moreover, besides the heritability as a major evaluation metric, we also measured the effectiveness of our approach by comparing the derived trait models and the linear model implanted in the simulated data. We calculated the squared difference between the learned weights $\widehat{\mathbf{\text{w}}}$ and the true weights w, i.e., $\mathsf{\text{SE}}\left(\mathbf{\text{w}}\right)=\phantom{\rule{2.36043pt}{0ex}}\left\right\mathbf{\text{w}}\widehat{w}{}_{2}^{2}$ and the mean of squared residuals $SE\left(\mathbf{\text{y}}\right)=\left(1/n\right)\sum _{i=1}^{n}{\left({y}_{i}{x}_{i}\hat{\mathbf{\text{w}}}\right)}^{2}$, and reported the values in plots. Additional evaluation steps were conducted in the case study to clinically interpret the resultant quantitative trait (see the later paragraph).
Comparison: The validated chip h^{2} of our derived traits was compared with that of all quantitative features in the data in both simulations and the CD case study. In each of the experiments, our derived trait was also compared with the commonlyused disease phenotype, often referred to as a symptom count, which was the quantitative trait created by equal weighted aggregation of all features in the data. Given that no prior method existed to identify heritable disease components using the genomewide SNPs, on the reallife cocaine use data, we compared our approach with a recently published method [23] that aimed to derive linearlycombined traits using pedigrees of related individuals. This comparison considered whether a pedigreebased heritable component analysis method can identify a disease component with a chip heritability comparable to that found by our approach. As multimember families were included in the original cocaine use dataset, i.e., a superset of the sample used by our approach, it was feasible to apply the method in [23] to derive a trait. Then we computed the trait values on the unrelated individuals used by our approach to compare the chip h^{2} of the two approaches using the evaluation CV. Note that the prior pedigreebased approach was actually given a favor because it used the superset of 2,674 subjects (unrelated individuals were treated as onemember pedigrees) to derive the trait in comparison with our approach that used only the 1,752 unrelated individuals.
Clinical interpretation: It is very important to understand the clinical implications of the quantitative trait (or an empirical subtype) derived by our approach from the aggregated CD study data. From prior work [17, 31], we identified three key steps to ensure the clinical validity of an empirical subtype. We first examined the specific features selected by our approach for use in the model. Second, we studied the distribution (or histogram) of the quantitative scores among the 1,752 subjects. From the distribution plot, we examined whether there were obvious subgroups of the scores. Third, the subgroups of subjects were characterized and compared on 11 of the most important clinical variables reflecting cocaine use and related behaviors including both the features selected and those not selected for use in the linear model. The individuals receiving very high or very low values of the quantitative trait may show the most representative features of the subtype.
Results
Simulations
We prespecified 21 different λ values ranging from 0 to 0.04 with step size 0.002 for use in the crossvalidation tuning process. The validation or test h^{2} for each λ choice was plotted for each of the six datasets in Figure 1 (for Setting 1 where data were generated without covariate effects) and Figure 2 (for Setting 2 where data were generated with covariate effects). The mean, median, and the standard deviations of the test h^{2} values in the cross validation were plotted for each tested λ. These two figures show that our approach could identify components (quantitative traits) with test h^{2} estimate of ~ 0.8, which was the heritability of the implanted heritable component (the simulated true model), for all datasets even with many irrelevant features in some of the datasets for both settings. This result demonstrates that our approach identified highly heritable disease components and could successfully correct for fixed covariate effects.
A case study of cocaine dependence
Figure 8 shows the distribution of the trait values (i.e., the membership scores) of the subjects. It shows that based on the scores, the samples can be partitioned into four subgroups. There were 250 subjects (14.27% of total) in Group 1, which had a mean score of 2.22. Group 2 consisted of 323 subjects and comprised 18.44% of the entire sample set. Its mean score was 0.8. Group 3 was the largest one and consisted of 821 subjects (46.86% of the sample). The mean score of this group was 0.2. Group 4 was the smallest, comprised of 237 individuals (13.53% of the sample), with a mean score of 1.22.
Characteristic of the three subject groups on important clinical variables related to cocaine use.
Variable  Group1 250(14.27)  Group2 339(19.35)  Group3 926(52.85)  Group4 237(13.53) 

Tolerance to cocaine  124(49.60)  80(23.60)  807(87.15)  123(51.90) 
Withdrawal from cocaine  14(5.60)  275(81.12)  813(87.80)  192(81.01) 
Using cocaine in larger amounts or over longer period than intended  249(99.60)  323(95.28)  816(88.12)  103(43.46) 
Persistent desiring or unsuccessful cutting down cocaine use  223(89.20)  326(96.17)  839(90.60)  233(98.31) 
Great amount of time spent in activities related to cocaine  240(96.00)  290(85.55)  823(88.88)  82(34.60) 
Gave up or reduced important activities because of cocaine use  170(68.00)  212(62.54)  806(87.04)  156(65.82) 
Cocaine use despite knowledge of problems likely caused by cocaine  209(83.60)  315(92.92)  817(88.23)  170(71.73) 
Number of CD criteria endorsed  4.92(1.07)  5.37(1.08)  6.18(2.10)  4.47(1.50) 
Age when first used cocaine  21.93(6.34)  22.32(6.51)  21.44(5.75)  22.97(8.42) 
Age onset of DSM4 cocaine dependence  28.24(7.88)  28.31(7.95)  26.63(6.70)  28.32(8.06) 
Transition time in years from first cocaine use to first CD diagnosis  8.01(12.05)  8.07(12.91)  12.15(21.30)  15.19(24.65) 
Conclusion
We developed an approach to identify composite traits from multivariate phenotypes that are highly heritable, as estimated using genomewide SNPs. The trait we derived is in the form of a linear combination of variables related to the phenotype, that is y = Xw. A quadratic optimization problem was formulated, in which optimal w was sought to optimize the log likelihood for estimating variance components in REML. In this formulation, variance components are set to their ideal values with the additive genetic variance component ${\sigma}_{g}^{2}$ equal to 1 and other components equal to 0. To avoid overfitting, we incorporated a regularization term in our formulation. An efficient algorithm based on the sequential quadratic programming framework was developed to solve the proposed optimization problem. We evaluated the proposed approach on both synthetic and real world data. The empirical results demonstrate the effectiveness of our approach as a means to identify traits with much higher chip h^{2} than commonlyused disease phenotypes.
In this paper, the pairwise genetic relationship among subjects was estimated from genomewide SNPs. However, it can also be estimated from SNPs restricted to a specific region, such as on a particular chromosome or in genes related to a pathway, to explore the genetic architecture of a trait. When SNPs within a specific region are used, the trait resulting from the proposed approach will achieve the maximized genetic variance component corresponding to this region. In an application, such as substance dependence, there are known pathways involved, so it may be of utility to determine whether there is a composite trait, the variance of which can be largely explained by the variants within the pathways. This will be a future application of our approach.
Declarations
Acknowledgements
This work was supported by NIH grant R01DA037349, NSF grant DBI1356655, and IIS1447711. Jinbo Bi was also supported by NSF grants IIS1320586 and IIS1407205. Henry R. Kranzler was also supported by NIH grants R01DA030976, R01AA021164, R01AA023192, and R01CA184315.
We thank Joel Gelernter, M.D. from Yale University who was instrumental in recruiting, characterizing, and genotyping the subjects in the SSADDA dataset used here. Kathleen Brady, M.D., Ph.D. of the Medical University of South Carolina, Roger Weiss, M.D., of McLean Hospital and Harvard Medical School, and David Oslin, M.D., of the University of Pennsylvania Perelman School of Medicine oversaw study recruitment at their respective sites. That work was supported by NIH grants R01AA011330, R01AA017535, R01DA012690, R01DA018432, and R01DA012849.
Publication costs for this article were funded by NIH grant R01DA037349.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 3, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3.
Authors’ Affiliations
References
 Gelernter J, Sherva R, Koesterer R, Almasy L, Zhao H, Kranzler HR, Farrer L: Genomewide association study of cocaine dependence and related traits: Fam53b identified as a risk gene. Mol Psychiatry. 2013Google Scholar
 Gelernter J, Kranzler HR, Sherva R, Koesterer R, Almasy L, Zhao H, Farrer LA: Genomewide association study of opioid dependence: multiple associations mapped to calcium and potassium pathways. Biol Psychiatry. 2014, 76 (1): 6674. 10.1016/j.biopsych.2013.08.034.View ArticlePubMedGoogle Scholar
 Treutlein J, Rietschel M: Genomewide association studies of alcohol dependence and substance use disorders. Curr Psychiatry Rep. 2011, 13 (2): 14755. 10.1007/s1192001101764.View ArticlePubMedGoogle Scholar
 PierucciLagha A, Gelernter J, Chan G, Arias A, Cubells JF, Farrer L, Kranzler HR: Reliability of dsmiv diagnostic criteria using the semistructured assessment for drug dependence and alcoholism (ssadda). Drug Alcohol Depend. 2007, 91 (1): 8590. 10.1016/j.drugalcdep.2007.04.014.View ArticlePubMedPubMed CentralGoogle Scholar
 Balding DJ, Bishop MJ, Cannings C: Handbook of Statistical Genetics. 2007, John Wiley & Sons, Chichester, England; Hoboken, NJ, 3View ArticleGoogle Scholar
 de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of wholegenome markers. Nature Reviews Genetics. 2010, 11 (12): 880886. 10.1038/nrg2898.View ArticlePubMedGoogle Scholar
 Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157 (4): 18191829.PubMedPubMed CentralGoogle Scholar
 Yang J, Montgomery GW, Goddard ME, Visscher PM, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG: Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010, 42 (7): 56510.1038/ng.608.View ArticlePubMedPubMed CentralGoogle Scholar
 Makowsky R, Pajewski NM, Klimentidis YC, Vazquez AI, Duarte CW, Allison DB, de los Campos G: Beyond missing heritability: prediction of complex traits. PLoS Genetics. 2011, 7 (4): 100205110.1371/journal.pgen.1002051.View ArticleGoogle Scholar
 Hill WG, Wray NR: Heritability in the genomics eraconcepts and misconceptions. Nature Reviews Genetics. 2008, 9 (4): 255266. 10.1038/nrg2322.PubMedGoogle Scholar
 Yang J, Lee SH, Goddard ME, Visscher PM: Gcta: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011, 88 (1): 7682. 10.1016/j.ajhg.2010.11.011.View ArticlePubMedPubMed CentralGoogle Scholar
 Speed D, Hemani G, Johnson MR, Balding DJ: Improved heritability estimation from genomewide SNPs. American Journal of Human Genetics. 2012, 91 (6): 10111021. 10.1016/j.ajhg.2012.10.010.View ArticlePubMedPubMed CentralGoogle Scholar
 Kranzler HR, Wilcox M, Weiss RD, Brady K, Hesselbrock V, Rounsaville B, Farrer L, Gelernter J: The validity of cocaine dependence subtypes. Addict Behav. 2008, 33 (1): 4153. 10.1016/j.addbeh.2007.05.011.View ArticlePubMedGoogle Scholar
 Bi J, Gelernter J, Sun J, Kranzler HR: Comparing the utility of homogeneous subtypes of cocaine use and related behaviors with dsmiv cocaine dependence as traits for genetic association analysis. Am J Med Genet B Neuropsychiatr Genet. 2014, 165B (2): 14856.View ArticlePubMedGoogle Scholar
 Sun J, Bi J, Chan G, Oslin D, Farrer L, Gelernter J, Kranzler HR: Improved methods to identify stable, highly heritable subtypes of opioid use and related behaviors. Addictive Behaviors. 2012Google Scholar
 Gelernter J, Panhuysen C, Wilcox M, Hesselbrock V, Rounsaville B, Poling J, Weiss R, Sonne S, Zhao H, Farrer L, Kranzler HR: Genomewide linkage scan for opioid dependence and related traits. American Journal of Human Genetics. 2006, 78 (5): 759769. 10.1086/503631.View ArticlePubMedPubMed CentralGoogle Scholar
 Babor TF, Caetano R: Subtypes of substance dependence and abuse: implications for diagnostic classification and empirical research. Addiction (Abingdon, England). 2006, 101: 10410.View ArticleGoogle Scholar
 Hu VW, Addington A, Hyman A: Novel autism subtypedependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published gwas data. PloS ONE. 2011, 6 (4): 1906710.1371/journal.pone.0019067.View ArticleGoogle Scholar
 Ott J, Rabinowitz D: A principalcomponents approach based on heritability for combining phenotype information. Hum Hered. 1999, 49 (2): 10611. 10.1159/000022854.View ArticlePubMedGoogle Scholar
 Wang Y, Fang Y, Jin M: A ridge penalized principalcomponents approach based on heritability for highdimensional data. Hum Hered. 2007, 64 (3): 18291. 10.1159/000102991.View ArticlePubMedGoogle Scholar
 Klei L, Luca D, Devlin B, Roeder K: Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol. 2008, 32 (1): 919. 10.1002/gepi.20257.View ArticlePubMedGoogle Scholar
 Oualkacha K, Labbe A, Ciampi A, Roy MA, Maziade M: Principal components of heritability for high dimension quantitative traits and general pedigrees. Statistical Applications in Genetics and Molecular Biology. 2012, 11 (2):Google Scholar
 Sun J, Bi J, Kranzler HR: Quadratic optimization to identify highly heritable quantitative traits from complex phenotypic features. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD '13. 2013, ACM, New York, NY, USA, 811819.View ArticleGoogle Scholar
 Patterson HD, Thompson R: Recovery of interblock information when block sizes are unequal. Biometrika. 1971, 58 (3): 545554. 10.1093/biomet/58.3.545.View ArticleGoogle Scholar
 Verbyla AP: A conditional derivation of residual maximum likelihood. Australian Journal of Statistics. 1990, 32 (2): 227230. 10.1111/j.1467842X.1990.tb01015.x. doi:10.1111/j.1467842X.1990.tb01015.xView ArticleGoogle Scholar
 Vapnik VN: An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999, 10 (5): 988999. 10.1109/72.788640.View ArticlePubMedGoogle Scholar
 Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New York, 2Google Scholar
 Nocedal J, Wright SJ: Numerical Optimization. 2006, Springer, New YorkGoogle Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 94559.PubMedPubMed CentralGoogle Scholar
 Yang J, Benyamin B, Visscher PM: Common snps explain a large proportion of the heritability for human height. Nat Genet. 2010, 42 (7): 5659. 10.1038/ng.608.View ArticlePubMedPubMed CentralGoogle Scholar
 Hesselbrock VM, Hesselbrock MN: Are there empirically supported and clinically useful subtypes of alcohol dependence?. Addiction. 2006, 101 (Suppl 1): 97103.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.