Skip to main content

Global-scale GWAS associates a subset of SNPs with animal-adapted variants in M. tuberculosis complex



While Mycobacterium tuberculosis complex (MTBC) variants are clonal, variant tuberculosis is a human-adapted pathogen, and variant bovis infects many hosts. Despite nucleotide identity between MTBC variants exceeding 99.95%, it remains unclear what drives these differences. Markers of adaptation into variants were sought by bacterial genome-wide association study of single nucleotide polymorphisms extracted from 6,362 MTBC members from varied hosts and countries.


The search identified 120 genetic loci associated with MTBC variant classification and certain hosts. In many cases, these changes are uniformly fixed in certain variants while absent in others in this dataset, providing good discriminatory power in distinguishing variants by polymorphisms. Multiple changes were seen in genes for cholesterol and fatty acid metabolism, pathways previously proposed to be important for host adaptation, including Mce4F (part of the fundamental cholesterol intake Mce4 pathway), 4 FadD and FadE genes (playing roles in cholesterol and fatty acid utilization), and other targets like Rv3548c and PTPB, genes shown essential for growth on cholesterol by transposon studies.


These findings provide a robust set of genetic loci associated with the split of variant bovis and variant tuberculosis, and suggest that adaptation to new hosts could involve adjustments in uptake and catabolism of cholesterol and fatty acids, like the proposed specialization to different populations in MTB lineages by alterations to host lipid composition. Future studies are required to elucidate how the associations between cholesterol profiles and pathogen utilization differences between hosts and MTBC variants, as well as the investigation of uncharacterized genes discovered in this study. This information will likely provide an understanding on the diversification of MBO away from humans and specialization towards a broad host range.

Peer Review reports


The Mycobacterium tuberculosis complex (MTBC) has afflicted human and animal health since the dawn of civilization. This ancient pathogen, typified by M. tuberculosis variant tuberculosis (MTB), infects humans primarily and is considered specialized for this niche [1, 2]. Its subversion of host immune responses, dormancy in granulomas for years or decades, and transmissibility suggest fine adaptation to humans, potentially to per-lineage adaptation to different human populations [3, 4]. MTB infects non-human hosts, including primates, and other animals (such as cattle) more infrequently [5, 6]. On the other hand, M. tuberculosis variant bovis (MBO) is a generalist pathogen – its host range includes foxes, seals, cattle, cervids, lions, dogs, mustelids, badgers, and others [1, 7,8,9,10]. The eponymous bovine reservoir is one of several for MBO [10], including white-tailed deer, elk, or bison in the US and Canada [9, 11, 12], red deer and wild boar populations in Spain [7], European badgers in the UK and Ireland [13, 14], and possums in New Zealand [14]. Despite host range differences, MTBC variants show a rigid population structure [15]. From the initial whole genome sequencing (WGS), researchers were surprised to find MTB and MBO shared 99.95% nucleotide identity, excluding genomic deletions in MBO [16]. Within a few years of the first MBO genome being sequenced, research began on what might drive variant differences, including gene expression (17), omics analysis [8], and metabolism [18, 19], among others. Meaningful variations have been reported, but it remains uncertain how MBO has evolved towards a generalist lifestyle away from a presumed MTB-like specialist ancestor.

To help address this gap in knowledge, WGS datasets were collected for MTBC variants from diverse hosts and countries. Paired-end read SRA datasets with metadata including country and host of isolation, and MTBC variant (n = 6,360 taxa, plus reference and outgroup) were used to create a set of 9,755 SNPs for a bacterial genome-wide association study (bGWAS). This sought to detect loci associated with classification as MTB or MBO, as well as any detectable host-specific markers (e.g., SNPs associated with isolation from cervids). Using RAxML-NG [20], prewas [21], and TreeWAS [22], bGWAS was performed with isolates classified by MBO (1) or not (0).


Core SNP extraction was successful for 6,362 isolates. Isolates were from 27 countries (Fig. 1A); included 2,096 MTB (including reference), 4,105 MBO, 152 variant caprae, and 8 variant orygis (Fig. 1B); and across 30 hosts (Fig. 1C). We additionally added M. canettii as the MTBC outgroup (n = 1). All sequence data used in this project are publicly available through NCBI and ENA. Accession IDs for all data are recorded in the table Additional file 1, with BioProjects in Column A, and corresponding SRA identifiers in Column B per sequence. The masked core SNP alignment was used for phylogenetic tree generation by RAxML-NG (Additional file 2), which shows splits based on MTBC variant, but is only used for GWAS and not intended for visualization due to its scale. The unmasked core SNP alignment is also provided for reference (Additional file 3). After prewas and maximum likelihood-based ancestral reconstruction, a final set of 7,524 variants over the 6,362 taxa was used for treeWAS input.

Fig. 1
figure 1

MTBC isolate collection metadata. A Geographical distribution of isolates worldwide, where darker colors represent more isolates from that country. B MTBC variant makeup of the collection. Values are based on NCBI/ENA SRA designations. C Host origin for the collection. Bos taurus is the dominant host type, followed by Homo sapiens and Meles meles

TreeWAS runs three tests for statistical significance – the terminal, simultaneous, and subsequent tests. The terminal test identifies broad associations between genotype and phenotype looking only at terminal nodes in the tree; the simultaneous test more stringently identifies deterministic relationships of genotype and phenotype, without necessitating the relationship occur at all branches; the subsequent test utilizes the terminal test but adds ancestral state reconstruction to analyze all nodes of the tree [22]. A thorough explanation of these tests is provided by Dr. Collins on the treeWAS GitHub page [23]. The simultaneous and subsequent tests were used initially, as an ancestral reconstruction was available. When analyzing by phenotype of MBO (1) and Not MBO (0), treeWAS produced significant loci for both simultaneous and subsequent statistical scoring metrics (Fig. 2). Analysis by phenotype of Bovidae also produced the same 120 significant loci by the same subsequent test but did not produce loci for the simultaneous test (Fig. 3). Another search by phenotype of Homo sapiens produced the same 120 loci again by subsequent test and no loci by simultaneous test (Fig. 4). A phenotype of “non-standard hosts” (where Homo sapiens is the standard host for MTB, Bovidae for MBO, Capra for MCP, and Oryx for MOR) yielded no significant hits (Additional file 4). Analysis by phenotype Cervidae returned no significance (Additional file 5), but phenotype Meles meles (European badger) showed an unusual pattern by subsequent statistical test where nearly all SNPs clustered just around the significance cutoff, yielding hundreds of loci technically exceeding the cutoff yet are tightly clustered with those under it (Additional file 6A). The simultaneous score did not similar hits (Additional file 6B). A similar pattern was seen for Sus (Additional file 7). To investigate if these associations reflected geographic effects for Meles meles, as all badger samples were from the UK, GWAS analysis was performed by phenotype of UK origin, which yielded no significance by the subsequent test, and a single locus by simultaneous test (Additional file 8). This hit, for a variant present in only two isolates, is a spurious result. For Meles meles and Sus, it would appear the dataset composition makes it difficult to detect significance against a background already tightly associated with specific genotypes. MCP produced no significant hits (Additional file 9), and MOR was not attempted due to low representation on M. orygis samples in the dataset (n = 8).

Fig. 2
figure 2

Manhattan plots of bGWAS results for phenotype “M. tuberculosis variant bovis.A Simultaneous test of association, showing 32 loci ranked to be significant, of which 10 are of dubious quality. B Subsequent test of association, showing 120 loci are ranked to be significant. Details for each locus are available in Tables 1 (simultaneous) and 2 (subsequent)

Fig. 3
figure 3

Manhattan plots of bGWAS results for phenotype “Bovidae” host. A Simultaneous test of association, showing no significantly ranked loci. B Subsequent test of association, showing 120 loci are ranked to be significant. The 120 loci identified here are identical to those seen in Table 2 and Fig. 2

Fig. 4
figure 4

Manhattan plots of bGWAS results for phenotype “Homo sapiens” host. A Simultaneous test of association, showing no significantly ranked loci. B Subsequent test of association, showing 120 loci are ranked to be significant. The 120 loci identified here are identical to those seen in Table 2 and Figs. 2 and 3

The 32 loci identified by the simultaneous test for MBO are listed in Table 1.

Table 1 SNPs significantly associated with classification as MBO by Simultaneous statistical test

As mentioned for the spurious hit for UK samples, the simultaneous test can report loci as associated even if a SNP is only present in a few isolates. These false positives are included in the data table but are marked by bolding to indicate they are not meaningful. Subtracting these spurious hits, the simultaneous test identifies 22 loci, all of which are also identified by the subsequent test. The 120 loci concordantly identified as associated by Bovidae, MBO, and Homo sapiens by the subsequent test are listed in Table 2.

Table 2 SNPs significantly associated with classification as MBO by Subsequent statistical test

The subset of loci called by both tests in MBO (n = 22) are presented in Table 3.

Table 3 SNPs concordantly significantly associated with classification as MBO by Subsequent and Simultaneous statistical tests

After GWAS, several apparent genotypic edge cases arose. FadD11 was highlighted as significantly associated by a single non-synonymous variant, FadD11 L286S, which appeared fixed in MBO and MCP, while MTB and MOR showed WT nearly exclusively. Of 4,105 MBO isolates and 152 MCP isolates, only 1 MBO isolate showed WT at this position, suggesting a reversion in this genome. Likewise, of 2,087 MTB and 8 MOR isolates, only 9 MTB isolates showed L286S. These genotypic exceptions were checked further: a Chinese cattle isolate SRR16278270 for the 1 MBO outlier, and 9 UK MTB isolates from humans for the MTB outliers (Table 4).

Table 4 SNP typing improperly labeled MTBC isolates by MBO-lineage markers

These 10 isolates’ VCF files were checked against the SNP barcode [24], with lineage-determining positions searched per VCF through Unix command “grep” and the SNP coordinate. The MBO isolate bore no MBO lineage-determining SNPs, and instead was cleanly typable as MTB lineage 2.2.1 (Table 5). Evidently, this isolate is a case of bovine MTB being incorrectly identified as MBO when uploaded to NCBI. Likewise, of the 9 human MTB cases that stood out, all bore the 3 lineage-determining SNPs for MBO (Table 4), and no MTB lineage-determining markers. In these cases, 9 cases of human MBO were misclassified as MTB. These results were validated using SNP-IT software [25], which also typed ERR387001 as a BCG strain, suggesting a case of BCG-osis misdiagnosed or mislabeled as TB. After correcting these calls, there is a perfect divide between MTB/MOR and MBO/MCP, with 100% of isolates bearing the WT for MTB/MOR, and 100% of isolates bearing the SNP for MBO/MCP. This discrepancy may have affected robustness of GWAS based on phenotype of “MBO variant.” However, it is noted that comparisons for “host Bovidae” and “host Homo sapiens” are unaffected by this, and all 3 analyses produced perfect concordance of their 120 associated loci by subsequent tests, suggesting this mislabeling had minimal influence. In summary, a set of 120 discriminatory loci was identified, which was also identified by bGWAS of phenotypes classified as a host of Bovidae (1) or not (0), and Homo sapiens (1) or not (0).

Table 5 SNP typing improperly labeled MBO isolate by MTB-lineage markers


Despite remarkable similarity between MTB and MBO, evidence of clear divides was present in SNPs highlighted as associated by treeWAS analysis.

The Fad family of proteins are important in MTBC, with MTB known to carry 36 FadD and FadE loci [26, 27]. GWAS identified SNPs in FadD11, FadE5, FadE27, and FadE32 associated with differentiation into MBO. These genes are involved in fatty acid and cholesterol handling inside the environment of the host [28]. Mycobacterial reliance on cholesterol is known to be critical for pathogenesis, and MTB features around 80 genes involved with cholesterol balance and metabolism [29]. Disruption of cholesterol import is severely disruptive to infection and persistence [29, 30].

Cholesterol intake in MTBC requires a functional Mce4 system [30]. A missense mutation (A734G, Asp254Gly) in Mce4F was seen in all MBO/MCP isolates, and only a single MTB isolate from Russia (ERR108427) which bore a unique SNP signature: G3836739A (lineage 4.8), and G1759252T (lineage 4.9). No other SNPs for lineage 4 or any other lineage were identified. A literature search turned up Congo type MTB that can present lineage 4.8 and 4.9 SNPs, but only in combination with 4.7 SNPs that were absent in this sample [31]. Except this atypical MTB specimen, another split by this mce4f SNP separated MTB/MOR and MBO/MCP.

Catabolized cholesterol products fuel core acyl-CoA metabolism pathway, as well as polyketide synthesis, a pathway already known to differ between MBO and MTB [32]. GWAS identified two separate missense mutations in ppsD, and synonymous changes in ppsB and pks15, all polyketide synthase genes. Genes annotated in roles of cholesterol and fatty acid metabolism or in pathways downstream of these processes are among all loci identified by the MBO subsequent tests are shown in Table 6. These associated SNPs are scattered across lipid metabolic pathways and include members whose exact function is unclear. Ten out of fourteen of these SNPs are non-synonymous.

Table 6 Genes identified by GWAS associated with fatty acid and cholesterol metabolism

Other identified loci with functions separate from cholesterol and lipid metabolism include AccD1, involved in leucine degradation [33], which bore a fixed SNP of Phe343Leu. SNP 1739294 in the essential isoleucyl-tRNA synthetase IleS causes a Pro926Ala substitution, SNP 3152421 in the essential prolyl-tRNA synthetase ProS yields His177Arg, SNP 3371365 in the essential glutamyl-tRNA amidotransferase subunit GatA causes Ala24Thr, and a synonymous change is seen at 1,260,537 for methionine synthesis gene MetE. Related to translational machinery, SNP 3198332 causes Thr259Met in essential elongation factor Tsf. SNP 1129160 impacts both RpfB (Resuscitation promoting factor B) and KsgA (a dimethyladenosine transferase) genes, resulting in RpfB Ala357Val and a synonymous mutation in KsgA. RpfB is thought to be involved in the transition from dormancy to active replication, and is co-transcribed with ksgA and ispE, genes involved in ribosome maturation and cell wall synthesis, respectively [34]. After accounting for mislabeling cases in deposited isolates, these SNPs are all fixed and exclusive in this dataset either in MBO and MCP, or MBO alone.

MTBC physiology and function remain uncertain, and 48/120 loci identified are in genes annotated only by locus identifier and generically, like “conserved protein” or “possible oxidoreductase” even after PE/PPE gene filtering, removing a largely uncharacterized family comprising ~ 10% of MTBC genes. Even among genes with fuller annotations, nearly all include “probable” in their descriptions. The genes presented in Table 6 are not comprehensive and given the uncertainty in function across many loci, other important genes both inside and outside lipid metabolism almost certainly exist in the bGWAS output of Table 2. Loci associated with adaptation towards new hosts and lifestyles are useful then to highlight for characterization, as it narrows the still vast pool of MTBC genes with uncertain functions towards a subset of genes with fixed changes in some variants.

Any associated loci may signify adaptive roles in differentiation from a specialist infection by MTB and a pathogen with a much broader host range, like MBO. It is well-reported that members of the MTBC are clonal, and not only is horizontal gene transfer vanishingly rare, mutation rates in members of this complex are low (~ 2 × 10–10 mutations per cell division) [35]. Given 99.95% genetic identity between MTB and MBO, < 2000 polymorphisms differentiate divergent variants (disregarding RDs/large sequence polymorphisms), and fixed changes are an even smaller subset. While this research cannot draw conclusions about how specific changes might alter metabolism or virulence to better reflect new host environs, it does highlight multiple SNPs across multiple genes in MTBC metabolic pathways. Metabolic differences are known to exist between variants and even between lineages of MTB, including in lipid profile [36, 37]. While these data are only associations, they may support findings by Griffin et al. reporting cholesterol utilization in MTB is key to host adaptation [29]. MTB drives macrophages to import lipids for utilization as an energy and carbon source [38,39,40,41]. The human cholesterol profile is LDL-dominant [42], as is the guinea pig [42], a model of tuberculosis that better recapitulates human disease [43] vs. the mouse model [5, 43, 44], an animal model with an HDL-dominant cholesterol profile and a lower overall cholesterol load [45, 46]. The MBO bovine host is HDL-dominant, for comparison [47]. Others have reported MTB infections are influenced differently by HDL vs. LDL cholesterol [48]. MTB is known to exploit lipid-rich “foamy macrophages,” and research has shown MTB trehalose dimycolate and other factors are associated with lipid droplet and foamy macrophage formation [38,39,40]. Foamy macrophage formation is associated with higher levels and intake of circulating LDL cholesterol, but recent research found MTB-infected macrophages have different lipid profiles from foamy macrophages characterized in atherosclerosis and other diseases [45], indicating disease-specific responses lead to buildup of certain lipids [41]. Finally, higher HDL levels are known to counteract foamy macrophage formation through classical LDL intake, HDL suppresses TNFα production in MTB-infected macrophages, and mice are more resistant to foamy macrophage formation, compared to humans, guinea pigs, rabbit, or primate models [42, 45, 48]. Variation in use of cholesterol and fatty acids is known to exist between MTBC variants and lineages. Finally, though research is more limited in this area, studies have demonstrated mice are more susceptible to disease and death by MBO infection than by MTB [49, 50]. Biological reality is undoubtedly far more complex, but from existing literature, host lipid profiles differ, lipid availability and sequestration are key to MTB virulence, animal models with a lipid profile closer to humans better reproduce “classic” granulomatous tuberculosis by MTB as seen in humans, and MTB lineages and MTBC variants utilize lipids differentially. Cholesterol/fatty acid metabolic pathways associated by bGWAS showing variant-specific between MTB and MBO are suggestive of a potential contributor towards host adaptation.

Adaptation to specific hosts was not detectable with this approach, which could be improved through routine sequencing of isolates from non-standard host types, which are currently rare and geographically biased. While efforts were taken to select from countries worldwide, the dataset itself is necessarily a biased sampling as well – very few isolates of the vast number of infections worldwide are ever sequenced, and of those that are, even fewer meet the inclusion criteria for metadata and sequencing coverage. Furthermore, in many clinical MTB genomes utilized in this work, evolutionary pressures imparted by antibiotic use are a powerful influence on SNP distribution, and drug-resistance associated SNPs were not filtered from the input dataset. This adds an additional layer of bias in that antibiotic selection pressures vary wildly across MTBC members. For future studies and adaptations of these findings, it is critical to note that our use of MTB H37Rv as a reference to study the entire MTBC and its evolutionary history is necessarily missing data. While the use of this reference as a standard is common practice and the H37Rv coordinates are typically how MTB SNPs are defined, the H37Rv lineage is a modern one, and any regions of difference absent in H37Rv but present in other MTBC members will not be SNP-called by our method, excluding potentially tens of thousands of base pairs. Future studies should consider using more “ancestral” variants like M. canettii, performing an MBO-specific bGWAS to try to identify host adaptation within the variant, or even by constructing an MTBC pangenome to call SNPs against for the maximum possible number of informative protein-coding changes. We hope our research serves as a launching point for these studies, for more routine sequencing and better metadata classification of isolates across the MTBC, and for better understanding of MTBC divergence, including through generation of time-measured phylogenies to estimate SNP occurrences through evolutionary history.


Analysis of the genetic differences between MTBC isolates and their association with classification as certain variants or isolation from certain hosts by GWAS was performed. The 120 SNPs identified through this analysis provide a trove of genes and pathways implicated in adaptation towards a generalist lifecycle, including loci across cholesterol and fatty acid uptake, catabolism, and downstream processing pathways, important for central metabolism in MTBC organisms and critical for pathogenesis [28, 36, 38, 39, 51]. The understanding of host adaptation in Mycobacterium is a major outstanding knowledge gap. While tremendous work has been performed by groups worldwide, the mechanisms of how exactly MTBC members present different and infect different hosts is unresolved. Detail transcriptomic profiling has been performed in cattle infections by MTB and MBO, which demonstrate that macrophages have very distinct response profiles to each variant, as well as what appears to be a fine-tuned engagement with the bovine innate immune system for MBO [8]. It is known that MTB is attenuated in cattle [8], and that cellular-level differences manifest in overall disease course differences in other hosts like mice [49]. As discussed above, one rare aspect of TB pathogenesis is its critical utilization of host cholesterol for survival and success [30], and prior research within MTB hypothesizes human population-specific cholesterol utilization adaptation has occurred across MTB lineages [4]. Multiple studies have found that cholesterol utilization is in fact fundamental to pathogenesis, as covered in the review by Moopanar and Mvubu [40]. While the findings in this work are only suggestive, they again highlight cholesterol and fatty acid-associated genetic pathways, and future research in the lab should assess how loci identified herein may functionally differ in the context of different host pathways for cholesterol metabolism. Available lipid and cholesterol pools in bovine, murine, and other non-human hosts, differences between host cholesterol and lipid profiles, and the possible lineage-specific adaptation identified in the loci in this analysis could contribute to the elusive mechanisms of pathogen host preferences.

In summary, these SNPs reliably differentiate MTBC variants, and importantly, they may inform research into genes that differ between variants, narrowing the pool of uncharacterized proteins to study in the MTBC.


Existing datasets were collected, including those where bGWAS was performed to answer other questions. Dong et al. (2022) genome-sequenced 74 Chinese cattle MBO isolates, and performed bGWAS analysis on 3,227 total MBO isolates from around the world [52]. Additionally, sequences used by Coll et al. in designing the MTBC SNP barcode are a validated set of primarily human MTB isolates [24]. Both datasets were included in this analysis, along with many smaller sets. FASTQ download URLs were acquired through SRA-Explorer [53], formatted for Globus-CLI [54, 55] and downloaded to MSU’s High Performance Computing Center (HPCC) for processing. After transfer, single-end read data were excluded, and Snippy [56] run with default parameters (BWA base quality = 13, minimum SNP coverage = 10 reads, minimum VCF SNP quality = 100), paired-end read input, and MTB H37Rv as the reference genome (AL123456.3). A complete list of accession IDs for all sequences used are provided in Additional file 1. Rare cases of genomes with unusually high numbers of SNPs (> 3,000) were excluded, as were genomes with alignment coverage < 90% of the reference length (cutoff < 3.96mbp). Additionally, metadata were compiled for all isolates (Additional file 1), and only datasets with host, MTBC variant, and country of isolation were included. Taxonomic classification by Kraken 2 [57] revealed several isolates primarily (> > 50%) contained plant or insect genomes, and were excluded, at which point remaining samples contained classified reads mapping primarily to Mycobacterium. Remaining paired-end read sets were selected (n = 6,360) to build the final “snippy-core” core SNP alignment, along with the H37Rv reference and M. canettii (GCF_000253375.1) as an outgroup, while masking PE/PPE genes using the H37Rv-specific.bed file provided by default in Snippy. Core SNPs were used for phylogenetic tree generation in RAxML-NG [20] (substitution model GTR + G selected by ModelTest-NG [58]; bootstrap analysis: seed = 774,900,118, bootstrap trees = 300; for tree search analysis: seed = 4,949,250,770, 50 parsimony-based, 50 random-based starting trees for tree search; applying bootstrap support to best ML tree: –consense MRE). On a Windows 10 desktop PC, RStudio (2022.07.2 + 576) [59], R (v4.0.5) [60], and the R package vcfR (v1.12.0) [61] were used to generate a vcfR object for import with prewas [21]. In prewas (v1.1.1), the VCF object containing variant calls was processed with an input tree generated from RAxML-NG, the H37Rv GFF3 file, and with ancestral reconstruction flag set to TRUE, for maximum likelihood-based reconstruction through the ape R package it uses. On an HPCC cluster, a Conda [62] environment was created containing GCC (v11.2.0) [63], OpenMPI (v4.1.1) [64], and R (v4.1.2). The R package devtools (v2.4.5) [65] was installed, and used to install prewas (v1.1.1) [66] and treeWAS (v1.1) [67] from Github. The RData object containing prewas output from the desktop PC was uploaded to HPCC and used for ancestral reconstruction state, binary variant matrix, and phylogenetic tree inputs, along with binary metadata phenotype matrices. All other parameters were left at their defaults. For MTBC lineage determination, the Coll et al. SNP barcode and SNP-IT tool were used [24, 25]. SnpEff (v4.2) was used to annotate variants separately [68]. TreeWAS generated default Manhattan plots and distribution graphics, and text output was collected in.csv files.

Availability of data and materials

All sequence data used in this project are publicly available through NCBI and ENA. Accession IDs for all data are recorded in the table Additional file 1, with BioProjects in Column A, and corresponding SRA identifiers in Column B per sequence.


  1. Mostowy S, Inwald J, Gordon S, Martin C, Warren R, Kremer K, et al. Revisiting the evolution of Mycobacterium bovis. J Bacteriol. 2005;187(18):6386–95.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Stucki D, Brites D, Jeljeli L, Coscolla M, Liu Q, Trauner A, et al. Mycobacterium tuberculosis lineage 4 comprises globally distributed and geographically restricted sublineages. Nat Genet. 2016;48(12):1535–43.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Freschi L, Vargas R, Husain A, Kamal SMM, Skrahina A, Tahseen S, et al. Population structure, biogeography and transmissibility of mycobacterium tuberculosis. Nat Commun. 2021;12(1):6099.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gagneux S, DeRiemer K, Van T, Kato-Maeda M, De Jong BC, Narayanan S, et al. Variable host-pathogen compatibility in mycobacterium tuberculosis. Proc Natl Acad Sci U S A. 2006;103(8):2869–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kaushal D, Mehra S, Didier PJ, Lackner AA. The non-human primate model of tuberculosis. J Med Primatol. 2012;41(3):191–201.

  6. Lombard JE, Patton EA, Gibbons-Burgener SN, Klos RF, Tans-Kersten JL, Carlson BW, et al. Human-to-Cattle Mycobacterium tuberculosis Complex Transmission in the United States. Front Vet Sci. 2021;8(July):1–11.

  7. Naranjo V, Gortazar C, Vicente J, de la Fuente J. Evidence of the role of European wild boar as a reservoir of mycobacterium tuberculosis complex. Vet Microbiol. 2008;127(1–2):1–9.

    Article  PubMed  Google Scholar 

  8. Malone KM, Rue-Albrecht K, Magee DA, Conlon K, Schubert OT, Nalpas NC, et al. Comparative ’omics analyses differentiate mycobacterium tuberculosis and mycobacterium bovis and reveal distinct macrophage responses to infection with the human and bovine tubercle bacilli. Microb Genomics. 2018;4(3).

  9. Wobeser G. Bovine tuberculosis in Canadian wildlife: an updated history. Can Vet J La Rev Vet Can. 2009;50(11):1169–76.

    Google Scholar 

  10. Ayele WY, Neill SD, Zinsstag J, Weiss MG, Pavlik I. Bovine tuberculosis: an old disease but a new threat to Africa. Int J Tuberc Lung Dis. 2004;8(8):924–37.

    CAS  PubMed  Google Scholar 

  11. VerCauteren KC, Lavelle MJ, Campa H. Persistent spillback of bovine tuberculosis from white-tailed deer to cattle in Michigan, USA: Status, Strategies, and Needs. Front Vet Sci. 2018;5(NOV):1–13.

  12. Sunstrum J, Shoyinka A, Power LE, Maxwell D, Stobiersky MG, Signs K, et al. Zoonotic Mycobacterium bovis disease in deer hunters – Michigan, 2002–2017. Morb Mortal Wkly Rep. 2019;68(37):807–8.

    Article  Google Scholar 

  13. Gormley E, Corner LAL. Pathogenesis of Mycobacterium bovis Infection: The Badger model as a paradigm for understanding tuberculosis in animals. Front Vet Sci. 2018;4(JAN):1–11.

    Google Scholar 

  14. Buddle BM, Vordermeier HM, Chambers MA, de Klerk-Lorist LM. Efficacy and safety of BCG vaccine for control of tuberculosis in domestic livestock and wildlife. Front Vet Sci. 2018 Oct 26;5(OCT):1–17.

  15. Zimpel CK, Patané JSL, Guedes ACP, de Souza RF, Silva-Pereira TT, Camargo NCS, et al. Global distribution and evolution of Mycobacterium bovis lineages. Front Microbiol. 2020;11(May):1–19.

    Google Scholar 

  16. Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, et al. The complete genome sequence of Mycobacterium bovis. Proc Natl Acad Sci U S A. 2003;100(13):7877–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Rehren G, Walters S, Fontan P, Smith I, Zárraga AM. Differential gene expression between Mycobacterium bovis and Mycobacterium tuberculosis. Tuberculosis. 2007;87(4):347–59.

    Article  CAS  PubMed  Google Scholar 

  18. Sohaskey CD, Modesti L. Differences in nitrate reduction between mycobacterium tuberculosis and mycobacterium bovis are due to differential expression of both narGHJI and narK2. FEMS Microbiol Lett. 2009;290(2):129–34.

    Article  CAS  PubMed  Google Scholar 

  19. Lofthouse EK, Wheeler PR, Beste DJV, Khatri BL, Wu H, Mendum TA, et al. Systems-based approaches to probing metabolic variation within the mycobacterium tuberculosis complex. PLoS ONE. 2013;8(9):1–14.

    Article  Google Scholar 

  20. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35(21):4453–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Saund K, Lapp Z, Thiede SN, Pirani A, Snitkin ES. Prewas: data pre-processing for more informative bacterial gwas. Microb Genomics. 2020;6(5):1–8.

    Article  CAS  Google Scholar 

  22. Collins C, Didelot X. A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. McHardy AC, editor.  PLOS Comput Biol. 2018;14(2):e1005958.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Collins C. How treeWAS works: Tests of Association [Internet]. GitHub repo for treeWAS. 2018. Available from:

  24. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, et al. A robust SNP barcode for typing mycobacterium tuberculosis complex strains. Nat Commun. 2014;5:4–8.

    Article  Google Scholar 

  25. Lipworth S, Jajou R, De Neeling A, Bradley P, Van Der Hoek W, Maphalala G, et al. SNP-IT tool for identifying subspecies and associated lineages of mycobacterium tuberculosis complex. Emerg Infect Dis. 2019;25(3):482–8.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Glickman MS, Jacobs WR. Microbial pathogenesis of mycobacterium tuberculosis: Dawn of a discipline. Cell. 2001;104(4):477–85.

    Article  CAS  PubMed  Google Scholar 

  27. Wipperman MF, Yang M, Thomas ST, Sampson NS. Shrinking the fadE proteome of mycobacterium tuberculosis: insights into cholesterol metabolism through identification of an α2β2 heterotetrameric acyl coenzyme a dehydrogenase family. J Bacteriol. 2013;195(19):4331–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Fieweger, Wilburn, Van der Ven. Comparing the Metabolic Capabilities of Bacteria in the Mycobacterium tuberculosis Complex. Microorganisms. 2019;7(6):177.

  29. Griffin JE, Gawronski JD, DeJesus MA, Ioerger TR, Akerley BJ, Sassetti CM. High-resolution phenotypic profiling defines genes essential for mycobacterial growth and cholesterol catabolism. PLoS Pathog. 2011;7(9):1–9.

    Article  Google Scholar 

  30. Pandey AK, Sassetti CM. Mycobacterial persistence requires the utilization of host cholesterol. Proc Natl Acad Sci U S A. 2008;105(11):4376–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Malm S, Linguissi LSG, Tekwu EM, Vouvoungui JC, Kohl TA, Beckert P, et al. New mycobacterium tuberculosis complex sublineage, Brazzaville. Congo Emerg Infect Dis. 2017;23(3):423–9.

    Article  CAS  PubMed  Google Scholar 

  32. Marri PR, Bannantine JP, Golding GB. Comparative genomics of metabolic pathways in mycobacterium species: gene duplication, gene decay and lateral gene transfer. FEMS Microbiol Rev. 2006;30(6):906–25.

    Article  CAS  PubMed  Google Scholar 

  33. Ehebauer MT, Zimmermann M, Jakobi AJ, Noens EE, Laubitz D, Cichocki B, et al. Characterization of the Mycobacterial Acyl-CoA Carboxylase Holo Complexes Reveals Their Functional Expansion into Amino Acid Catabolism. Schnappinger D, editor. PLOS Pathog. 2015;11(2):e1004623.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Schwenk S, Moores A, Nobeli I, McHugh TD, Arnvig KB. Cell-wall synthesis and ribosome maturation are co-regulated by an RNA switch in mycobacterium tuberculosis. Nucleic Acids Res. 2018;46(11):5837–49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Ford CB, Lin PL, Chase MR, Shah RR, Iartchouk O, Galagan J, et al. Use of whole genome sequencing to estimate the mutation rate of mycobacterium tuberculosis during latent infection. Nat Genet. 2011;43(5):482–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Moopanar K, Mvubu NE. Lineage-specific differences in lipid metabolism and its impact on clinical strains of Mycobacterium tuberculosis. Microb Pathog. 2020;146(April):104250.

  37. Gonzalo-Asensio J, Malaga W, Pawlik A, Astarie-Dequeker C, Passemar C, Moreau F, et al. Evolutionary history of tuberculosis shaped by conserved mutations in the PhoPR virulence regulator. Proc Natl Acad Sci U S A. 2014;111(31):11491–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Muñoz S, Rivas-Santiago B, Enciso JA. Mycobacterium tuberculosis Entry into Mast Cells Through Cholesterol-rich Membrane Microdomains. Scand J Immunol [Internet]. 2009 Sep;70(3):256–63.

  39. Kim MJ, Wainwright HC, Locketz M, Bekker LG, Walther GB, Dittrich C, et al. Caseation of human tuberculosis granulomas correlates with elevated host lipid metabolism. EMBO Mol Med. 2010;2(7):258–74.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Moopanar K, Mvubu NE. Lineage-specific differences in lipid metabolism and its impact on clinical strains of Mycobacterium tuberculosis. Microb Pathog. 2020;146(April).

  41. Guerrini V, Prideaux B, Blanc L, Bruiners N, Arrigucci R, Singh S, et al. Storage lipid studies in tuberculosis reveal that foam cell biogenesis is disease-specific. PLoS Pathog. 2018;14(8):1–27.

    Article  Google Scholar 

  42. Fernandez ML, Volek JS. Guinea pigs: a suitable animal model to study lipoprotein metabolism, atherosclerosis and inflammation. Nutr Metab. 2006;3:1–6.

    Article  Google Scholar 

  43. Orme IM, Ordway DJ. Mouse and Guinea Pig Models of Tuberculosis. In: Tuberculosis and the Tubercle Bacillus [Internet]. Washington, DC: ASM Press; 2017. p. 143–62. Available from:

  44. Cooper AM. Mouse model of tuberculosis. Cold Spring Harb Perspect Med. 2015;5(2):1–8.

    Article  CAS  Google Scholar 

  45. Oppi S, Lüscher TF, Stein S. Mouse models for atherosclerosis research—Which is my line? Front Cardiovasc Med. 2019;6(April):1–8.

    Google Scholar 

  46. Gordon SM, Li H, Zhu X, Shah AS, Lu LJ, Davidson WS. A Comparison of the Mouse and Human Lipoproteome: Suitability of the Mouse Model for Studies of Human Lipoproteins. J Proteome Res. 2015;14(6):2686–95.

  47. Duran MJ, Kannampuzha-Francis J, Nydam D, Behling-Kelly E. Characterization of particle size distribution of plasma lipoproteins in dairy cattle using high-resolution polyacrylamide electrophoresis. Front Anim Sci. 2021;2(July):1–10.

    Google Scholar 

  48. Inoue M, Niki M, Ozeki Y, Nagi S, Chadeka EA, Yamaguchi T, et al. High-density lipoprotein suppresses tumor necrosis factor alpha production by mycobacteria-infected human macrophages. Sci Rep. 2018;8(1):1–11.

    Article  CAS  Google Scholar 

  49. Dong H, Lv Y, Sreevatsan S, Zhao D, Zhou X. Differences in pathogenicity of three animal isolates of mycobacterium species in a mouse model. PLoS ONE. 2017;12(8):1–17.

    Article  Google Scholar 

  50. Medina E, Ryan L, LaCourse R, North RJ. Superior virulence of Mycobacterium bovis over mycobacterium tuberculosis (Mtb) for Mtb-resistant and Mtb-susceptible mice is manifest as an ability to cause extrapulmonary disease. Tuberculosis. 2006;86(1):20–7.

    Article  PubMed  Google Scholar 

  51. Gatfield J, Pieters J. Essential role for cholesterol in entry of mycobacteria into macrophages. Science (80-). 2000;288(5471):1647–50.

    Article  CAS  Google Scholar 

  52. Dong Y, Feng Y, Ou X, Liu C, Fan W, Zhao Y, et al. Genomic analysis of diversity, biogeography, and drug resistance in Mycobacterium bovis. Transbound Emerg Dis. 2022;69(5):e2769–78. Available from:

  53. Ewels P. SRA-Explorer [Internet]. Available from:

  54. Foster I. Globus online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput. 2011;15(3):70–3.

    Article  Google Scholar 

  55. Allen B, Bresnahan J, Childers L, Foster I, Kandaswamy G, Kettimuthu R, et al. Software as a service for data scientists. Commun ACM. 2012;55(2):81–8.

  56. Seemann T. snippy: fast bacterial variant calling from NGS reads [Internet]. 2015. Available from:

  57. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. bioRxiv. 2019;1–13.

  58. Darriba D, Posada D, Kozlov AM, Stamatakis A, Morel B, Flouri T. ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Mol Biol Evol. 2020;37(1):291–4.

    Article  CAS  PubMed  Google Scholar 

  59. Allaire J. RStudio: integrated development for R [Internet]. RStudio Team. Boston, MA; 2012. Available from:

  60. RDC T. A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. 2010. Available from:

  61. Knaus BJ, Grünwald NJ. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour. 2017;17(1):44–53.

  62. Anaconda. Anaconda Software Distribution. [Internet]. Computer software. 2016. p. Vers. 2–2.4.0. Available from:

  63. GCC Team. GCC, the GNU Compiler Collection [Internet]. 2013. Available from:

  64. Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) [Internet]. 2004. p. 97–104. Available from:

  65. Wickham H, Hester J, Chang W. Tools to make developing R packages easier - Package “devtools” [Internet]. 2021. Available from:,

  66. Saund K, Lapp Z, Thiede SN, Pirani A, Snitkin ES. Prewas: Data pre-processing for more informative bacterial gwas [Internet]. Vol. 6, Microbial Genomics. GitHub; 2020. p. 1–8. Available from:

  67. Collins C, Didelot X. treeWAS: A phylogenetic tree-based approach to genome-wide association studies in microbes [Internet]. GitHub; 2022. Available from:

  68. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly (Austin) [Internet]. 2012 Apr 27;6(2):80–92. Available from:

Download references




Research in the Sreevatsan lab is funded by USDA (2018–67015-28288) and start-up funds provided by College of Veterinary Medicine, Michigan State University.

Author information

Authors and Affiliations



Evan Brenner performed all data download, data coding, and data analysis, wrote the first draft of the paper, and developed the database. Sreevatsan provided the idea (that Evan built on), got funding, and edited the manuscript.

Corresponding author

Correspondence to Srinand Sreevatsan.

Ethics declarations

Ethics approval and consent to participate

NA. No animal or human subjects were involved in this study. The study was all computational and as such no IACUC, IRB, or IBC approvals were necessary.

Consent for publication


Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brenner, E.P., Sreevatsan, S. Global-scale GWAS associates a subset of SNPs with animal-adapted variants in M. tuberculosis complex. BMC Med Genomics 16, 260 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: