Integrative analysis of loss-of-function variants in clinical and genomic data reveals novel genes associated with cardiovascular traits

Background Genetic loss-of-function variants (LoFs) associated with disease traits are increasingly recognized as critical evidence for the selection of therapeutic targets. We integrated the analysis of genetic and clinical data from 10,511 individuals in the Mount Sinai BioMe Biobank to identify genes with loss-of-function variants (LoFs) significantly associated with cardiovascular disease (CVD) traits, and used RNA-sequence data of seven metabolic and vascular tissues isolated from 600 CVD patients in the Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) study for validation. We also carried out in vitro functional studies of several candidate genes, and in vivo studies of one gene. Results We identified LoFs in 433 genes significantly associated with at least one of 10 major CVD traits. Next, we used RNA-sequence data from the STARNET study to validate 115 of the 433 LoF harboring-genes in that their expression levels were concordantly associated with corresponding CVD traits. Together with the documented hepatic lipid-lowering gene, APOC3, the expression levels of six additional liver LoF-genes were positively associated with levels of plasma lipids in STARNET. Candidate LoF-genes were subjected to gene silencing in HepG2 cells with marked overall effects on cellular LDLR, levels of triglycerides and on secreted APOB100 and PCSK9. In addition, we identified novel LoFs in DGAT2 associated with lower plasma cholesterol and glucose levels in BioMe that were also confirmed in STARNET, and showed a selective DGAT2-inhibitor in C57BL/6 mice not only significantly lowered fasting glucose levels but also affected body weight. Conclusion In sum, by integrating genetic and electronic medical record data, and leveraging one of the world’s largest human RNA-sequence datasets (STARNET), we identified known and novel CVD-trait related genes that may serve as targets for CVD therapeutics and as such merit further investigation. Electronic supplementary material The online version of this article (10.1186/s12920-019-0542-3) contains supplementary material, which is available to authorized users.


Overview of Annotation of LoF variants
Genotyping and subsequent imputation yielded data for more than 37 million genetic variants in the BioMe Biobank cohort. There are a number of tools available to annotate effects of mutations on gene function and previous related studies have used various tools for their annotation procedures. As different annotators have their strengths and weaknesses, we decided to utilize three widely-utilized annotators to identify the predicted effects of these mutations: VAT [1], ANNOVAR [2] and SnpEff [3].
Following the example of previous studies, we limited LoF annotation to "stop gain", "splice site", and "frame shift" only with "high" or "full" impact. Notably, the proportion of three types of the predicted  Table 1. There are 10 variants that are not seen in these public datasets, but they were directly measured on the exome chip.

Determining genetic ancestry
We utilized Principal Component Analysis (PCA) for dimensionality reduction on genetic data of the cohort to obtain a metric for genetic ancestry. We performed the following QC steps on the genotype data: we cleaned the data for individual and site level missingness to a threshold of 95%, kept only common variants that had >=1% minor allele frequency, thinned linkage disequilibrium to a r 2 of 0.3, and removed human leukocyte antigen and lactase genetic regions. We ran this cleaned genotype data into the EIGENSOFT [4] smartpca tool to generate PCs for each individual by first training the space on the European ancestry individuals from Utah (CEU), Yoruba in Ibidan, Nigeria (YRI), and Han Chinese in Beijing, China (CHB), which were used as European, African, and East Asian ancestry reference panels from 1,000 genomes project, respectively [5]. For the models of our analyses, we used the top five PCs for each individual as covariates.
Trait measurement processing, quality control, and confounding factor filtering Furthermore, while there are some standardized procedures that exist for these purposes, such as adding 15 and 10 mmHg to systolic and diastolic measurements respectively to patients taking a blood pressure lowering medication [6], we opted to perform identical quality control procedures for each trait for a few reasons: first, our filtering methodology is more conservative; second, it is more comprehensive (including many potentially relevant medications); third, it follows the same rules for all traits, enhancing our ability to compare associations across traits; and lastly, it makes no assumptions on effect of drug (i.e. blood pressure medications might affect individuals differently, which is masked by a standard adjustment procedure). As an initial step, we only included measurements that were taken during an "Outpatient" encounter, excluding both "Inpatient" and "Ambulatory/Emergency" encounters as the latter are more likely to represent active disease states. Two physician cardiologists were in charge of identifying specific classes of medications that could affect each trait (Supplementary Table 2). If there was any disagreement, a third expert made the final decision. As such, we compiled a list of all medications that can affect each trait through mapping from Therapeutic class. Additionally, we identified classes of medications that can affect all traits of interest (e.g. cancer treatment medication). To enact this filtration procedure, we compared dates of measurement collection to dates of drug prescription. If an associated medication was prescribed three months before or three months after (180 day total window) a measurement was taken, we flagged it for removal (see Supplementary Fig. 2B for a visual description of this filtering procedure).
The three month window was selected as a typical timeline for which a prescription refill would be necessary. As mentioned, we also included a window to exclude the measurement if a relevant medication was prescribed within three months after the collection date. This was enacted to safeguard against latent or undiagnosed disease states that could still affect the trait level during pathogenesis.
Trait measurement exclusion criteria and procedure. We first filtered trait data for mismatched units.
However, clerical errors (i.e. incorrect measurement label or typos) could still allow physiologically impossible measurements to be in our dataset. Accordingly, we first compiled normal reference ranges per trait and then filtered values that were: greater than 3x the upper reference range and lower than 1/3 the lower reference range. We present the filtering figures due to outliers in Supplementary Table 4.
Once all incorrectly labeled, outlier, and medication/disease-affected measurements were filtered out, we combined any remaining measurements into a single median value per person. Any individuals without any remaining clean measurements were excluded from the particular trait analysis. We present a list of the number of individuals and affected measurements that were excluded per trait for drug filtering along with the total number of individuals and measurements that were excluded due to drugs and outliers in Supplementary Table 3 and 4 respectively. We compile the total number of measurements and individuals used before and after these filtration steps per trait in Supplementary   Table 5.

Gene ontology annotation
In order to identify knowledge-based associations of interest as further refined candidates for therapeutic validation, we intersected the combined significant associations between both cohorts with relevant Gene Ontology (GO) terms [7] (Supplementary Table 8). Specifically, we highlighted any association in which the gene had a related "Biological Process" to the traits of interest (e.g. lipid metabolic process, regulation of blood pressure).

Effect of medication on gene-trait association results
In total, 10,072 (95.8%) individuals in the utilized cohort had at least one prescription instance, which encompasses drug name, dosage, and route of administration. On average, individuals in our cohort had 132.5 ± 264.4 (std) prescription instances. If limited to unique prescription instance (i.e. drug name-dosage-route of administration combination), individuals in our cohort had an average of 50.6 ± 58.5 (std) prescription instances. The distribution of prescriptions per individual ( Supplementary Fig. 3) reflects the imperative need for this type of filtering.
To demonstrate the importance and utility of the rigorous quality control process for trait data, we performed the same analysis on the raw trait data prior to removal of medication/disease-affected measurements. While the overall numbers for each condition seem similar, the differences reveal the importance for this correction ( Supplementary Fig. 5). In cases where an association was significant in the raw data but not in the cleaned data, these can be interpreted as potential false positives. In total, we found 136 instances of these potential false positive associations (p-clean>0.05, p-raw<0.05).
Associations that were significant in the cleaned data but not in the raw can be interpreted as potential false negatives. We found 145 instances of potential false negatives (p-clean<0.05, p-raw->0.05). To illustrate these points, we present an example of the association between CPA1 gene and total cholesterol levels ( Supplementary Fig. 6). Using the raw data, we identified a significant association between LoF in the CPA1 gene and reduced cholesterol level (p=0.019; Supplementary   Fig. 6A). However, the association is no longer statistically significant after the data was corrected by removing medication-affected measurements (p=0.36; Supplementary Fig. 6B). Therefore this association represents a potential false positive result.

Novel genes implicated in cholesterol and triglyceride homeostasis
RNMTL1 (RNA methyltransferase-like protein 1) is a member of the RNA methyltransferase family responsible for methylation of G (1370) of the human 16S rRNA complex [8]. Inactivation of RNMTL1 in HeLa cells by RNA interference resulted in respiratory inefficiency due to reduced mitochondrial translation [9]. In humans, RNMTL1 is specifically expressed in liver [10]. In contrast, SCRN2 (secernin-2) is ubiquitously expressed across many tissues and is involved in exocytosis in mast cells [11]. Other than that, SCRN2 in muscle cells has been negatively correlated with plasma triglyceride levels in F2 mice [12]. PCK2 (phosphoenolpyruvate carboxykinase 2) is a mitochondrial gene also expressed in a variety of human tissues (mainly in liver, kidney, pancreas, intestine and fibroblasts). It has been implicated in glucose homeostasis in the liver (in its cytosolic form) [13].
SLC39A5 (solute carrier family 39 member 5) belongs to the ZIP family of zinc transporters and plays a crucial role in controlling intracellular zinc levels [14]. It is expressed in a variety of tissues including liver, kidney, pancreas, small intestine and colon [15]. In humans, mutations in this gene have been associated with autosomal dominant myopia [16] but its functional role requires further characterization. ABHD14B (abhydrolase domain-containing protein 14B) is also ubiquitously expressed across tissues [17], encoding an enzyme with largely unknown function. However, some evidence suggests it is involved in nuclear transcription activation [18]. Moreover, a genetic mutation in ABDH14B has been associated with a mitochondrial complex III enzyme deficiency [19]. NMRAL1 (NmrA like redox sensor 1) encodes a sensor protein that preferentially binds to NADPH. Association with argininosuccinate synthase (AS) impairs its activity and reduces the production of nitric oxide, which subsequently prevents apoptosis [20]. The encoded protein has also been shown to negatively regulate NF-kappaB in an ubiquitylation-dependent manner [21,22].  Table 9. siRNA and Taqman assays and the degree of siRNA inhibition. siRNA and TaqMan assays used to silence genes and measure the degree of gene silencing of HepG2 cells.

SUPPLEMENTARY TABLE AND FIGURES LEGENDS
Supplementary