Discovering weaker genetic associations guided by known associations

Background The current understanding of the genetic basis of complex human diseases is that they are caused and affected by many common and rare genetic variants. A considerable number of the disease-associated variants have been identified by Genome Wide Association Studies, however, they can explain only a small proportion of heritability. One of the possible reasons for the missing heritability is that many undiscovered disease-causing variants are weakly associated with the disease. This can pose serious challenges to many statistical methods, which seems to be only capable of identifying disease-associated variants with relatively stronger coefficients. Results In order to help identify weaker variants, we propose a novel statistical method, Constrained Sparse multi-locus Linear Mixed Model (CS-LMM) that aims to uncover genetic variants of weaker associations by incorporating known associations as a prior knowledge in the model. Moreover, CS-LMM accounts for polygenic effects as well as corrects for complex relatednesses. Our simulation experiments show that CS-LMM outperforms other competing existing methods in various settings when the combinations of MAFs and coefficients reflect different scenarios in complex human diseases. Conclusions We also apply our method to the GWAS data of alcoholism and Alzheimer’s disease and exploratively discover several SNPs. Many of these discoveries are supported through literature survey. Furthermore, our association results strengthen the belief in genetic links between alcoholism and Alzheimer’s disease.

This command will first fit CS-LMM with the SNPs that are validated to be associated with the phenotype (stored in knownMarkers.txt, an id per line), and then run CS-LMM second phase to select 20 SNPs (together with the known ones) to be associated with the phenotype. The results will be stored in data/snps.132k.output. Naive Usage python cslmm . py −n data / mice . p l i n k This command does not require the user to specify any arguments. CS-LMM will first conduct Wald test to identify the most significantly associated SNPs and then fit the rest SNPs with the residue of fitting these SNPs. CS-LMM will also perform cross validation to select the appropriate regularization weight.

Longer List of Discovered Alcoholism SNPs
We run the our method with the same setting in the main paper except querying for a longer list (200 SNPs) of potential associated SNPs.

AUC scores of the main result
We report the AUC scores of the the Precision-recall curves in the main manuscript to have a clearer understanding of the comparison of these methods in Table S8. The results demonstrate the strength of our method when heritability is intermediate, consistent with the main message delivered in our manuscript.
5 Simulation over a Larger Range of Settings. Figure S1 shows the result of a larger range of simulation settings. Figure S2 shows the result when we tune the parameters by mis-specifying the number of associated SNPs as half of the actual number of associated SNPs and Figure S3 shows the result when we mis-specify that number as twice the number of associated SNPs. These figures show that directly tuning the parameter based on the number of associated SNPs selected is a valid and stable manner. More importantly, these figures show that CS-LMM is a promising method even the number of associated SNPs to select is mis-specified.  Figure S1: Simulation results of CS-LMM compared to other models in terms of the precision-recall curve when we query the models for K = 2k SNPs. The x-axis is recall and y-axis is precision.  Figure S2: Simulation results of CS-LMM compared to other models in terms of the precision-recall curve when we query the models for K = k/2 SNPs. The x-axis is recall and y-axis is precision.  Figure S3: Simulation results of CS-LMM compared to other models in terms of the precision-recall curve when we query the models for K = 2k SNPs. The x-axis is recall and y-axis is precision.

Evaluation of Other Configurations (Different Number of Samples and Different Number of Associated SNPs)
We test the performance of our method compared to other models with different configurations of k (k ∈ {5, 10, 50}) and n (n ∈ {250, 500, 1000}) in the situations where coefficient e = {5, 25} and MAF m = {0.005, 0.01}. Fig. S4 and Fig. S5 show that our method is superior to other methods in most of these different configurations. . The x-axis is recall and y-axis is precision.

Evaluation of Other Configurations (Large Scale Data)
We also tested the performance of CS-LMM in comparison with other methods with large scale data (when there are 10000 samples). As shown in Fig. S6, LMM based methods behave comparably with CS-LMM. CS-LMM still has slight advantages in most cases.

Other Evaluation Metrics Applied to the Methods
We also evaluate CS-LMM and the competing methods using true positives, false positives and area under ROC (auROC) curves for the simulation experiments illustrated in Figure 2 in the main text. The results are shown in Table 8-Table S12. Note that these methods are evaluated only with unknown SNPs. This is a challenging task considering that other SNPs with stronger coefficients may mask the signals of the unknown SNPs, which explains the reason why many methods, in particular the Lasso and its variants, only perform slightly better than chance in terms of the auROC score.
By looking at the 'auROC' column, we can see that CS-LMM has an advantage over the other methods particularly when the cases are more challenging: rare variants with strong effects (Table 8), rare variants with small effects (Table S11), common variants with small effects (Table S12), respectively). As the coefficients and MAFs increase (Table S11 representing the unusual scenario of common variants with large effects), the unknown SNPs have as strong signals as the known SNPs, the Wald test and LMM become comparable and perform even better than CS-LMM in terms of the auROC score. However, it can be seen from the 'FP' column, both the Wald test with the FDR control and LMM report too many false positives, making these methods unrealistic in the real life applications. Also, as shown in the 'FP' column, CS-LMM outperforms other methods in terms of controlling false positives in most cases. We also notice that MLMM and FarmCPU also behave well in controlling false positives, especially when the signals are very small.