Association between triglycerides, known risk SNVs and conserved rare variation in SLC25A40 in a multi-ancestry cohort

Background Elevated triglycerides (TG) are associated with, and may be causal for, cardiovascular disease (CVD), and co-morbidities such as type II diabetes and metabolic syndrome. Pathogenic variants in APOA5 and APOC3 as well as risk SNVs in other genes [APOE (rs429358, rs7412), APOA1/C3/A4/A5 gene cluster (rs964184), INSR (rs7248104), CETP (rs7205804), GCKR (rs1260326)] have been shown to affect TG levels. Knowledge of genetic causes for elevated TG may lead to early intervention and targeted treatment for CVD. We previously identified linkage and association of a rare, highly conserved missense variant in SLC25A40, rs762174003, with hypertriglyceridemia (HTG) in a single large family, and replicated this association with rare, highly conserved missense variants in a European American and African American sample. Methods Here, we analyzed a longitudinal mixed-ancestry cohort (European, African and Asian ancestry, N = 8966) from the Electronic Medical Record and Genomics (eMERGE) Network. We tested associations between median TG and the genes of interest, using linear regression, adjusting for sex, median age, median BMI, and the first two principal components of ancestry. Results We replicated the association between TG and APOC3, APOA5, and risk variation at APOE, APOA1/C3/A4/A5 gene cluster, and GCKR. We failed to replicate the association between rare, highly conserved variation at SLC25A40 and TG, as well as for risk variation at INSR and CETP. Conclusions Analysis using data from electronic health records presents challenges that need to be overcome. Although large amounts of genotype data is becoming increasingly accessible, usable phenotype data can be challenging to obtain. We were able to replicate known, strong associations, but were unable to replicate moderate associations due to the limited sample size and missing drug information.

We previously identified linkage and association of a rare, highly conserved missense SNV, rs762174003, in SLC25A40 with HTG in a large family, and replicated the association with rare missense SNVs at highly conserved sites in SLC25A40 in a European American (EA) and African American (AA) sample [21]. SLC25A40 encodes a mitochondrial transmembrane protein [22]. In drosophila, SLC25A40 loss-of-function was shown to contribute to mitochondrial damage due to excessive oxidation [23]. Here, we attempt to replicate the association between deleterious variation at SLC25A40, as well as other reported associations with variation at APOE, GCKR, APOA1/C3/A4/A5 gene cluster, INSR, and CETP, with TG in a multi-ancestry population from the Electronic Medical Records and Genomics (eMERGE) Network phase 3 study [24,25].

Participants and phenotype data
Participants were ascertained from each of nine phase 3 eMERGE sites, a study tasked with investigating return of results from the sequencing of a panel of ~ 100 genes in a large (N = 25,000), diverse cohort of participants [24]. Participants ages 18 years and older were included, as the distribution for TG levels in children differs from that of adults. We ascertained relevant patient data from the clinical records, including TG level measured in mg/dL (N = 12,229 adult participants), age, sex, weight, height and body mass index (BMI), measured at the same clinical visit. We also ascertained relevant ICD9 and ICD10 codes for lipidemias, pancreatic cancer, and pancreatitis. Thirteen participants with chylomicronemia (ICD9 code 272.3, ICD10 code E78.3), 12 participants with Kaposi's sarcoma (ICD9 code 176 and ICD10 code C46) and 2,539 participants with morbid obesity (ICD9 code 278.01 and ICD10 code E66.01) or bariatric surgery were removed as their TG values are likely associated with unaccounted for exposures. Quality control was performed within participants as well. Participants who had maximum TG (maxTG) < 40 and had more than one record verifying this low TG were retained. However, one participant who had a single record with TG = 19 mg/dL was removed. Records within 2 weeks prior or 4 weeks after a diagnosis of pancreatitis (ICD9 code 577.0, ICD10 code K85.9) for an individual were removed (382 participants), as high TG are correlated with pancreatitis and is likely due to an environmental exposure such as alcohol. Similarly, individual level records coincident and after a diagnosis of pancreatic cancer were removed (147 participants). Quality control of BMI was performed using height and weight data as described by Goodloe et. al [26]. For participants with only BMI data (i.e., no height or weight data), values more than 5 standard deviations from the participant specific median BMI were removed.

Genotype data
A subset of individuals (N = 24,956) in eMERGE were genotyped using a custom capture target containing sequence for 788 genes with partial to full exon coverage, including 58 of the 59 genes considered actionable by the American College of Medical Genetics (ACMG) [27,28], and detects 62,051 SNVs. Sequencing was performed at two sites: Partners Healthcare Laboratory for Molecular Medicine (LMM) and Baylor. Samples from dropped out participants, that failed quality control or had sex discrepancy were removed (N = 229). After removing duplicate data for a single participant, sequence data for all participants was aligned to a merged target sequence map, using the Burrows-Wheeler Aligner version 0.7.10 [29]. Genotypes were jointly called on all participants using the Genome Analysis Toolkit version 3.5 [30]. 896 Variants with genotype missingness rate > 5% were removed. Individual level genotyping rate was > 98% for all participants. After cleaning the data, 8970 participants with both phenotype and genotype data remained in the analysis. Four additional participants were removed due to their influence on the residual model (see results).
Principal components (PC) of ancestry was calculated on a pruned set of 1571 SNVs (r 2 < 0.7, MAF > 0.05). Selfreported ancestry aligned well with the first two PCs.

Analysis
The phenotype of interest, medTGres, was calculated as the residuals from the median log10 transformed TG for each individual, adjusted for median age, sex, median BMI, presence of hypertriglyceridemia (HTG, ICD9 code 272.1 and ICD10 code E78.1), presence of hyperlipidemia (HL, ICD9 codes 272.2, 272.4, 272.9 and ICD10 codes E78.2, E78.4, E78.9), APOE genotype, known risk SNVs, known APOC3 protective SNVs, the first two principal components (PCs) of ancestry, and site. The variable for median age was centered around age 50, as this is when TG levels tend to descend, and then squared, as the relationship between TG levels and age is parabolic. We log10 transformed TG as the distribution is highly skewed and the transformation makes it nearly Normal. We used the median transformed TG (medTG) value over all records for each individual because the median protects against outliers that could be due to non-fasting measurements. The resulting residuals from the above linear model (medTGres) were used as the primary phenotype. We also used a similar residual from the adjusted maximum log10 transformed TG (maxTGres), as using the maximum protects against errors introduced from treatment with Niacin, fibrates or binders for high TG. Given our more complete local data from the University of Washington and Kaiser Permanente Washington (UWKP) site, we expect about 3% of the participants to be receiving such treatment at some time point in their clinical records [31]. However, this data was not available to us study-wide.
In addition to assessing the significance of the known risk variation, we analyzed variants in SLC25A40 that are likely to affect TG levels [21]. This included rare (maximum population MAF < 0.005) variants that either caused an early termination in the first 90% of the coding region, or changed the amino acid sequence. Missense variants were further constrained to be evolutionarily conserved (Genomic Evolutionary Rate Profiling (GERP) score > 4.8) [32]. The maximum population MAF was derived from populations in the gnomAD database that have not experienced a bottleneck (i.e., ignored MAF from Ashkenazi Jewish or Finnish populations) [33,34]. We performed four separate gene-wise tests for association with medTGres and maxTGres using 1. Genotype at single SNVs with more than 10 heterozygotes, 2. All other SNVs collapsed into a single indicator variable, 3. All other rare missense SNVs collapsed into a single indicator variable and 4. All other early terminations collapsed into a single indicator variable.

Demographics
Demographics of the participants included in these analyses are given in Table 1. There were slightly more females than males at most sites. The distribution of age varied widely among the sites, with younger participants from Children's Hospital of Philadelphia (CHOP) and Cincinnati Children's Hospital Medical Center. Median BMI tended to be above normal (> 25), but there were some participants with low BMI (< 18.5) (Columbia, Mayo). There were 51 participants with very low median BMI < 11, all from Mayo, with at least 3 records supporting this measurement. The median TG ranged between 19 and 1371 mg/dL over all participants, but the majority of participants had median TG < 150 mg/dL, which is defined as normal (Fig. 1). The number of TG records available per participant varied widely, with a median of 6 overall. Most participants were of European ancestry (77%) followed by African ancestry (7%) and Asian ancestry (6%) ( Table 2, Fig. 2). Most of the participants of Asian ancestry (82%) were from UWKP and most of the participants of African ancestry (62%) were from Northwestern and Cincinnati.

SLC25A40 genotype data
Twenty-one rare SNVs, consisting of 2 stop gains, 3 frame shifts and 16 evolutionarily conserved missense variants, were heterozygous in the data set for 32 participants (Table 5). There were 119 participants who were heterozygous at the missense rs724665 and this SNV is considered separately from the others. Each participant was heterozygous for at most one of  the SNVs. Three participants were heterozygous at the frame shifts and 2 participants were heterozygous at the stop gains, limiting the power of the tests.

Residual model
The final residual model, containing only significant covariates is shown in Table 6. Although centeredmedian-age-squared was significantly associated with log10(medTG), its effect is near zero (β = − 4e−05). Males had higher log10(medTG) (β = 0.02). Increased BMI was associated with higher log10(medTG) (β = 0.01). Similarly, individuals with HTG or HL had higher log10(medTG) (β = 0.3 and 0.08, respectively) than other participants. SITE was also highly associated with log10(medTG) (F-test p < 2e−16). Neither the INSR or CETP SNVs were significantly associated with median or maximum TG (p > 0.16), and were not included in the model (data not shown). Four potentially influential outlying participants (as defined by their cook's distance [35]) in either model (medTGres or maxTGres) were removed, leaving a total of 8,966 participants with complete phenotype and genotype data for analysis. None of the participants with low BMI (median BMI < 15) appeared to influence the models.

Primary analysis
None of the SLC25A40 SNVs were significantly associated with medTGres, either separately or in combination  Fig. 3). Furthermore, the estimated effect of the variants was negative in all situations (rs724665 β = − 0.01; all SNVs excluding rs724665 β = − 0.02; all rare missense β = − 0.01; rare early terminations β = − 0.07). Similar results were observed for maxTGres and for analysis including the EA-only subset.

Discussion
We replicated association between TG and risk SNVs at APOE, GCKR, APOA5 and APOC3, with the respective directions of effect as expected. Association between variation at APOA5 and APOC3 with TG has been well   documented, and our ability to detect these effects supports the reliability of these TG data. Similarly, replication of an association between APOE and TG lends further support to the association of APOE ε2/3/4 and lipid traits in general. The association between GCKR and TG further supports a biological mechanism between diabetes and TG [36]. However, we did not replicate the reported association between TG and risk SNVs at INSR and CETP. The association between INSR rs7248104 and TG was first reported by a large GWAS with a sample size of 176,000 subjects [18]. It is possible that the sample size here is too small to detect any effect. Similarly, the association with CETP rs7205804 was discovered in a cohort of 95,000 subjects [37], and therefore it is possible that the sample size here is too small to detect any effect with TG. Additionally, we failed to replicate an association between putatively deleterious variation at SLC25A40 and TG. Previous studies relied on linkage in a large family and data from the Exome Sequencing Project (ESP). It is possible that the original signal found in the family is due to another variant in linkage with rs762174003, but unrelated to the function of SLC25A40. Further research into this locus may benefit from whole genome sequencing of selected individuals from this family. However, this would not explain why an association was detected in the relatively small sample from the ESP. Results for maximum TG also showed no association.
It is also possible that there is a relationship between SLC25A40 and TG, but we are not able to detect it in this sample. One major caveat is that we are missing data on treatment for high TG. In our local data set, UWKP, where treatment data were available, we discovered approximately 3% of individuals were being treated for high TG with niacin, fibrates or binders [31]. When we accounted for these treatments in the local data we found evidence supporting an association between TG and rare early termination and conserved missense variation at SLC25A40 [31]. As the number of participants who harbor such a variant in SLC25A40 is small (1.7% of our sample), an unaccounted for 3% treatment rate could obscure any signal. Although we used median TG to reduce the Fig. 3 Association between variation at SLC25A40 and medTGres. a rs724665; b all rare variation; c all rare missense variation; d all rare stop gains and frameshifts. Width of the boxplots is proportional to sample size effect of unknown treatment, it may have been insufficient as we do not know when treatment would have begun. We did not use the mean as it can be influenced by outliers and TG have a right skewed distribution. Furthermore, this is secondary use of data, which is prone to a high error rate in the data. Although we attempted to clean the data using multiple strategies, it is possible that enough error remained in the data to obscure any true association with SLC25A40. In addition, the small numbers of heterozygotes that we observe in this study limit the power of our tests.

Conclusion
Although we replicated known, strong associations with TG, these data do not replicate the previously reported association between SL25A40 variation and TG. Larger datasets with more complete data on fasting and medication use may be required to further explore this association. Furthermore, use of secondary data, such as EHR data, needs extensive quality control and would benefit from more comprehensive data extraction methods.