This article has Open Peer Review reports available.
Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data
© Hettne et al.; licensee BioMed Central Ltd. 2013
Received: 18 May 2012
Accepted: 25 January 2013
Published: 29 January 2013
Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles.
We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals.
Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals.
Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.
In toxicogenomics, gene and protein activity within a particular cell or tissue of an organism in response to toxicants is studied with the aim to predict in vivo effects from in vitro models. The assumption that similar gene expression profiles dictate similar physiological responses underlies the use of gene expression profiling in toxicogenomics to discern the toxicological properties of a chemical entity. Connectivity mapping is a promising method, based on gene-expression similarity, that can be applied to toxicogenomic data to understand mechanisms of toxicity (for a recent review see ). In essence, toxicological properties of a chemical entity are discerned by connecting a query gene signature generated as a result of exposure of a biological system (whole animal, tissue or cell line) to the chemical, to other, already known, chemical entities via a database of chemical compound reference gene expression signatures (pioneered in ). In contrast to gene expression data repositories such as Chemical Effects in Biological Systems Knowledge Base (http://www.niehs.nih.gov/research/resources/databases/cebs/) that contains data from different laboratories running different experimental platforms on different biological samples, the data in a reference gene expression signature database comes from systematic screening of chemical compounds against specific cell lines simulating biological conditions. Unfortunately, building such a reference gene expression signature database with many different cell types and compound concentrations represented is not easily feasible due to high costs and long development time . For example, in the case of the Connectivity Map (cMap) (http://www.broadinstitute.org/cmap/), which is the largest public reference gene expression signature database (7,000 expression profiles representing 1,309 compounds), not all small molecules were tested in every cell model, and not all were tested across the same spectrum of concentrations. Moreover, how to best interpret the result of a query is still an open question.
As an alternative to such reference databases of gene expression signatures, genes could be annotated to a chemical response through text mining (TM) techniques. Text mining is the use of automated methods for exploiting the enormous amount of information available in the biomedical literature. Next-generation text mining (next-gen TM) refers to knowledge discovery based on concept profile matching (see section "Next-gen TM for chemical-gene associations"), in comparison to first-generation text mining where associations are extracted purely from concept co-occurrence. The biomedical literature contains information about small molecules that is not currently stored in gene expression databases, and this information could be used to build chemical response-specific signatures. For example, searching the biomedical literature database PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) with the CAS number and substance name search string “79983-71-4 OR hexaconazole” results in 69 entries, while no information about the chemical can be found in the CEBS Knowledge Base or cMap (search performed Nov 21, 2011). In addition, next-gen TM-derived chemical response-specific signatures would not be specific to a certain biological system or compound concentration, and might therefore provide a less biased view on compound action than standard connectivity maps. Given a gene expression experiment where a biological system has been exposed to a chemical, these next-gen TM-generated chemical response-specific signatures could then be tested against the gene expression data set using gene set analysis (GSA) methods  that utilize statistic tests suitable for this purpose. The test would produce a ranking of chemicals in a way similar to the results from a connectivity-mapping query, with the difference that the chemicals are represented by gene sets derived from the literature instead of gene expression signatures from a reference database. Gene set analysis has fast become one of the standard methods in bioinformatics, and there are many tools available (for a review of tools, see ). Most GSA tools provide gene sets based on the Gene Ontology (GO) , with only a few providing additional sources of gene sets such as metabolic pathways, protein domains, disease associations, tissue expression, transcription factors sequence motifs, miRNA sequences, drug-gene associations  and toxicologically relevant gene sets (Boorsma et al., submitted). We hypothesize that more gene sets can be created by next-gen TM, and that these can be used with GSA methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles.
GSA with literature-derived chemical response-specific gene sets has been used before to relate chemical structures to gene expression patterns in microarray experiments. Minguez et al. used their tool MarmiteScan to associate chemicals with the characteristics of acute myeloid leukemia cell differentiation. However, there is no information about the size and scope of the chemical dictionary they used to mine the literature and their gene sets are not separately available, thus forcing the researcher to use their GSA method. There is also no possibility to test a sub-set of the gene sets, for example only those that are relevant for evaluation of developmental toxicity. In contrast, we provide chemical response-specific gene sets that can be used with any GSA tool that allows for user-supplied gene sets. Patel and Butte  used the hypergeometric test to associate gene sets derived from curated chemical-gene interactions from > 4,000 chemicals in the Comparative Toxicogenomics Database (CTD)  with six gene expression data sets selected based on their diversity with respect to species, chemical exposure and cell type. Manual curation of chemical-gene interactions from publications is a very time-consuming process producing high-quality information but with limited coverage, reflected by the number of chemicals that Patel and Butte could create gene sets for (1338 chemicals). CTD's chemical dictionary is a modified subset of descriptors from the “Chemicals and Drugs” category and Supplementary Concept Records from the U.S. National Library of Medicine (NLM) Medical Subject Headings, a hierarchical vocabulary used to index PubMed articles containing > 100 000 chemicals. We aim to increase the number of chemicals that can be annotated with genes by using next-gen TM instead of manual curation. Jelier et al. used the weighted global test, with weights based on drug-gene associations generated from next-gen TM, to associate known peroxisome proliferator-activated receptor alpha (PPARalpha) agonists with the gene expression differences between PPARalpha-null and wild-type mice exposed to the fibrate WY14643. Although Jelier et al. demonstrated the usefulness and performed extensive evaluation of the next-gen TM technology for gene set testing, its reliability still remains to be investigated by evaluating other chemical classes than drugs and considering a wider range of gene expression data sets using other GSA methods; this validation is one of the aims of the present study.
According to a predefined procedure which avoids researcher bias we generated gene sets using next-gen TM technology, and tested these gene sets using three different GSA tools: ToxProfiler (Boorsma et al., submitted; http://ntc.voeding.tno.nl/toxprofiler_test/), which implements the unpaired t-test to score the difference in gene expression between the genes from a pre-defined gene set and the remainder of the genes, the weighted global test (http://www.bioconductor.org/packages/release/bioc/html/globaltest.html), which implements a regression model where the gene expression measurements correspond to the covariates and the phenotype corresponds to the response, and GeneCodis (http://genecodis.dacya.ucm.es), which implements the hypergeometric test for gene set over-representation in a list of differentially expressed genes with respect to a reference gene list. Hereby we would like to show that the usefulness of next-gen TM-based gene sets is not tied to a specific GSA method. To show the versatility of data for which the technology is applicable, we test our next-gen TM-based gene sets with the GSA tools in three different case studies. The first study focuses on identification of the chemical treatment, the second on pharmacological mechanism elucidation, and the third on compound toxicity profile comparison. We name these case studies 1, 2 and 3 in the rest of this paper.
Case study 1 comprises the same gene expression experiments used by Patel and Butte , as mentioned earlier. We aim to compare the performance of CTD-based gene sets with next-gen TM-based gene sets in predicting the particular chemical treatment response. In case study 2 we analyze the gene expression experiment used by Jelier and coworkers , as mentioned earlier. For this case study, we aim to predict the PPARalpha agonism characteristic of fibrates. Fibrates have a profound pharmacological response with well-defined molecular toxicological properties, and constitute a clear test case in which both methods (CTD, next-gen TM) together with different gene-set testing approaches (ToxProfiler, weighed global test, GeneCodis) should perform well. To compare our approach to connectivity mapping, we compare the results from the GSA tools with the results from the cMap. Since case study 2 is focused on associating drugs with a gene expression profile and cMap contains mainly drugs, this case study was considered an appropriate choice for which to include a comparison with cMap. In case study 3, we demonstrate the power of the next-gen TM-based gene sets to link chemicals with similar gene expression response, where the manually curated gene sets from the CTD fail due to lacking chemical-gene annotations. We do this by analyzing a recently published in vitro gene expression data set on three chemicals belonging to the triazole class of developmental toxicants  with the aim to find chemicals with known, or unknown (to our knowledge), links to the class. Using next-gen TM, we make chemical response-specific gene sets for the chemicals contained in the ToxRefDB_DevTox database (http://www.epa.gov/ncct/toxrefdb/)  and link these gene sets to the data set for the triazoles. For the same gene expression data set we show that next-gen TM-derived gene sets also can be used for other purposes than chemical similarity matching. As an example, gene sets associated with embryonic structures are used to discriminate triazoles from other developmental toxicants, and from non-developmental toxicants.
Associating chemicals with genes in the literature
Next-gen TM for chemical-gene associations
The next-gen TM-based gene sets were compiled based on a literature-derived matching score for the chemical-gene association. The technology uses the vector space model to relate two concepts (such as chemicals and genes) to each other . Concepts originated from a thesaurus (see section “Thesaurus”) that contains terms referring to biomedical and chemical concepts, and a list of term synonyms. The vector space model yields a measure of the strength of the association through the matching score determined by the cosine of the angle between the two concept vectors . We call the vectors “concept profiles”. More precisely, a concept profile is a list of concepts with for every concept a weight to indicate its association to the root concept, based on concept co-occurrence statistics from the scientific literature database PubMed (http://www.ncbi.nlm.nih.gov/pubmed). A total of 13,834,150 PubMed IDs from January 1, 1980 to January 6, 2011 were used to build the concept profiles (PubMed IDs referring to experiments used in the case studies in this work were excluded). The co-occurrence statistics were calculated as follows. Peregrine (https://trac.nbic.nl/data-mining/), a software that performs thesaurus-based indexing and disambiguation of concepts in text, was used together with the thesaurus to identify concepts in PubMed records. Stop words were removed and words were stemmed to their uninflected form by the lexical variant generator (LVG) normalizer (http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/userDoc/tools/norm.html). For all concepts except genes the PubMed records were comprised of the texts in which the concept was mentioned (titles, abstracts and Medical Subject Headings). For genes only a subset of PubMed records were used in order to limit the impact of ambiguous terms and distant homologs. Gene Ontology (GO) terms are sometimes given as words or phrases that are infrequently found in the normal texts. To still provide broad coverage of GO terms, the PubMed records that were used as evidence for annotating genes with this GO term were added. For every concept in the thesaurus that was associated to at least five PubMed records, a concept profile was created. This concept profile is thus in reality a vector containing all concepts associated with the root concept by direct co-occurrence, weighted by the symmetric uncertainly coefficient . The top 200 concepts in a concept profile were used when calculating the matching scores. To make the gene sets we used a concept profile matching score cut-off of 1e-04  in combination with a maximum of 1000 gene associations per chemical. To allow for comparison between the CTD-based and next-gen TM-based gene sets, only chemicals with Chemical Abstract Service (CAS) numbers were included. Using these restrictions, we were able to create 30,211 chemical response-specific gene sets for mouse and human.
The thesaurus was composed of four parts: the 2010AB version of Unified Medical Language System (UMLS) (http://www.nlm.nih.gov/research/umls/), a gene thesaurus derived from multiple databases ), a chemical thesaurus derived from multiple databases , and a toxicology thesaurus derived from the International Union of Pure and Applied Chemistry (IUPAC) glossary of terms used in toxicology (http://sis.nlm.nih.gov/enviro/iupacglossary/frontmatter.html). For each part, we gather the synonyms and the definition for each concept. We then execute a number of rewriting and suppression rules based on term structure [13, 14], and perform a manual analysis step of the top 250 frequent terms from a PubMed-indexation using Peregrine. Next, terms in the thesaurus are checked for in-thesaurus homonyms, and the 250 terms with the most homonyms are inspected manually for removal. When forming the master thesaurus, the UMLS, gene, chemical and toxicity thesauri are merged based on term overlap and patterns for recognizing gene and protein names . The different steps are performed by a series of coupled java scripts.
Processing the CTD for chemical-gene associations
The CTD includes manually curated cross-species interactions between chemicals, genes, and diseases. CTD specifically targets individual chemicals for curation from a priority list; in addition, CTD also incidentally curates all interactions appearing in individual articles irrespective of whether they happen to have been targeted. We downloaded the chemical–gene interaction database from the CTD on January 6, 2011. The database contained 266,266 interactions in total. After filtering for H. sapiens and M. musculus (the two species included in the gene expression experiments used in this paper), 185,792 interactions remained. Neither the nature nor the curation level of the chemicals were taken into account. All different gene interactions with a specific chemical were regarded as associations, and summarized into one chemical response-specific gene set. During this step, all interactions based on any of the gene expression experiments used in the case studies in this work were removed. From the single chemical-gene associations in the CTD, we created gene sets with at least five genes (similar to Patel and Butte ) per chemical. For every chemical-gene association, CTD provides the CAS number for the chemical (if available), the Entrez Gene ID for the gene, and the PubMed ID and organism for which the association was reported. When filtering for chemicals with a CAS number and restricting the organisms to human and mouse we were able to make a total of 1,189 (H. sapiens) and 588 (M. musculus) gene sets. Sometimes in the CTD, H. sapiens Entrez Gene IDs are incorrectly annotated to M. musculus. These H. sapiens Entrez Gene IDs were mapped to M. musculus Entrez Gene IDs using the Homologene database (http://www.ncbi.nlm.nih.gov/homologene).
We selected three different GSA tools that all allow for user-supplied gene sets but implement different statistical tests. 1) ToxProfiler, which implements the unpaired t-test to score the difference in gene expression between the genes from a pre-defined gene set and the remainder of the genes . 2) The weighted version of the global test , which implements a regression model where the gene expression measurements correspond to the covariates and the phenotype corresponds to the response. In the weighted global test the next-gen TM-derived matching scores are used to weigh the contribution of each gene in a gene set to the test . 3) GeneCodis , where biological annotations, or relationships among annotations based on co-occurrence patterns, are tested for over-representation in a list of differentially expressed genes with respect to a reference gene list using the hypergeometric test or the chi-square test. ToxProfiler and the weighted global test do not require a pre-selection of differentially expressed genes while GeneCodis does require such a list. For all tools, p-values were corrected for multiple testing using the False Discovery Rate (FDR) , and corrected p-values < 0.05 were considered significant.
In case study 2 we use the text-mining tool Anni (http://www.biosemantics.org/anni)  to explore the relevance of the top significant results on a biological process level. Anni is based on the same concept profile technology that was used to generate the next-gen TM-based gene sets, and provides direct links to the literature for the produced annotations.
Gene expression microarray data sets selection and pre-processing
Gene expression profiling of Bisphenol A effects on human Ishikawa cells (GEO accession number: GSE17624, set short name: BPA)
The aim of the BPA study was to provide a comprehensive evaluation of changes in gene expression during treatment with Bisphenol A in vitro, and the study was performed using five doses at three different time points with four replicates each . RNA was hybridized on an Affymetrix Human Genome U133 Plus 2.0 Array. For this study, we used the high dosed (10 nM) cells at 48 hours (four treated samples and four control samples).
Gene expression profiling of 17 beta-Estradiol effects on human MCF7 breast cancer cells (GEO accession number: GSE11352, set short name: ESThsa)
The aim of the ESThsa study was to identify 17 beta-Estradiol responsive genes in the estrogen-receptor positive breast cancer cell line, MCF7 . MCF7 cells were exposed to 10 nM Estradiol (or vehicle only) at 12, 24, and 48 hours. Each time point was performed in triplicate. RNA was hybridized on an Affymetrix Human Genome U133 Plus 2.0 Array. For this study, we used only the samples from 24 hours with their corresponding controls (three treated samples and three control samples).
Gene expression profiling of 17 beta-Estradiol effects on mouse thymus (GEO accession number: GSE2889, set short name: ESTmmu)
The aim of the ESTmmu study was to compare the effects of Estradiol and its analog Genistein on mouse thymus . Control samples were from the mice that were untreated (day 0). Two treatments (Estradiol injection and Genistein diet) and three time points were studied. Two replicated samples at each time point and each treatment were collected. RNA was hybridized on an Affymetrix Mouse Expression 430A Array. For this study, we used only the samples from the mice treated with Estradiol (day 2) and their corresponding controls (two treated samples and two control samples).
Gene expression profiling of 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) effects on mouse liver (GEO accession number: GSE10082, set short name: TCDD)
The aim of the TCDD study was to map the complete spectrum of aryl hydrocarbon receptor (Ahr) dependent genes in male adult liver by contrasting mRNA profiles of Ahr-null mice (Ahr−/−) with those in mice with wild-type Ahr (Ahr+/+) . Transcript profiles were determined both in untreated mice and in mice treated 19 h earlier with 1000 μg/kg TCDD. RNA was hybridized on an Affymetrix Mouse Genome 430 2.0 Array. For this study, we used only the data from the wild-type mice (six treated samples and five control samples).
Gene expression profiling of 1alpha,25-Dihydroxyvitamin D3 (VitD3) effects on bronchial smooth muscle cells (GEO accession number: GSE5145, set shortname: VitD3)
The aim of the VitD3 study was to study gene regulation in bronchial smooth muscle cells following VitD3 stimulation . The cells were from the same patient in all hybridization. Cells were treated for 24 hours with 100 nM of VitD3 or with the same concentration of vehicle (ethanol at 0.05%). The experiment was done in triplicates and a total of six samples (three treated and three control) were analyzed. RNA was hybridized on an Affymetrix Human Genome U133 Plus 2.0 Array.
Gene expression profiling of zinc sulfate (ZnSO4) effects on human bronchial epithelial cells (GEO accession number: GSE2111, set short name: ZnSO4)
The aim of the ZnSO4 study was to discriminate vanadium (VOSO4) from ZnSO4 using gene profiling . Human bronchial epithelial cells were treated with vehicle (control), VOSO4 (50 uM) or ZnSO4 (50 uM) for four hours (four replicates each). RNA was hybridized on an Affymetrix Human Genome U133A Array. For this study, we used only the data from the four samples treated with ZnSO4 and their corresponding four control samples.
Gene expression profiling of WY14643 effects on mouse small intestine (GEO accession number: GSE9533, set short name: PPARA)
The aim of the PPARA study was to examine the effects of acute nutritional activation of PPARalpha on expression of genes encoding intestinal barrier proteins . Male, four months old Wild-type (129S1/SvImJ) and PPARalpha −/− mice (129S4/SvJae) were exposed to dietary fatty acids with WY14643 as a reference during an exposure time of six hours, after which the small intestines were removed. RNA was hybridized on an Affymetrix Mouse Genome 430 2.0 Array. Here we use the samples corresponding to the PPARalpha activation by WY14643 (four treated samples and four control samples).
Gene expression data processing
The Affymetrix CEL files for all the different gene expression data sets listed above were downloaded from GEO and pre-processed using GenePattern . CEL files were normalized applying Robust Multichip Average (RMA), using the ExpressionFileCreator module. In addition, the MBNI Custom CDF, which contains updated probe set definitions for Entrez GeneIDs, was applied: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/11.0.1/entrezg.asp. After normalization, the data was floored by setting a threshold value of 50 for all probe sets. Log(2)ratios were created by dividing the median of the treatment values of each probe set by the median of the control values. Probe sets were discarded if both values were equal to 50.
No supplementary CEL files were available for the zinc sulfate dataset (GEO: GSE2111) and the Estradiol mouse dataset (GEO: GSE2889), instead the provided MAS5-calculated signal intensities were used to calculate the log(2)ratios. Affymetrix probeset identifiers were mapped to entrez gene identifiers using the appropriate affymetrix CDF. Log(2)ratios were created by dividing the median of the values of each probe set by the median of the time matched control where only probesets were selected that contained a “present” flag.
Gene expression profiling of triazole effects on mouse embryonic stem cell differentiation (ArrayExpress accession number: E-MTAB-300, set short name: triazoles)
In the triazoles experiment, M. musculus embryonic stem cells were exposed to a range of developmental toxicants and non-developmental toxicants (carbamazepine, flusilazole, hexaconazole, methotrexate, methylmercury chloride, monobutyl phthalate, monoethylhexyl phthalate, monomethyl phthalate, nitrofen, saccharine, triadimefon, warfarin) with the aim to evaluate developmental toxicant identification using gene expression profiling in embryonic stem cell test (EST) differentiation cultures . RNA was hybridized on an Affymetrix Mouse Genome 430 2.0 Array. For the analysis with the weighted global test, we used the data at the 24 hour timepoint for the three different triazoles (flusilazole, hexaconazole, and triadimefon), one negative control (saccharine), and one unexposed control (dimethyl sulfoxide), each with eight replicates. The Affymetrix CEL files for the triazoles gene expression data set were normalized using the expresso package in R with the default settings. Due to the large number of replicates, no probe filtering was performed. Probesets were annotated with Entrez gene IDs using the bioconductor 4302.db package. To summarize the data at the Entrez gene ID level, read-outs from probesets with the same Entrez gene ID were averaged. For the PCA analysis, we used the data from all groups. The Affymetrix CEL files were normalized using Robust Multichip Average (RMA) normalization and probe to gene mapping was performed as described previously . Probe sets for Affymetrix internal controls or probe sets that did not correspond to an Entrez gene ID were not used in further analyses.
Case study 1
Chemical treatment prediction ranks
Weighted global test
GeneCodis single annotation
The CTD-based gene sets ranked highest when using the GeneCodis single annotation option (Table 1), but for some data sets the gene set representing the treatment did not score significant. For CTD, the Estradiol and ZnSO4 treatments were not significant. When opting for co-occurring annotations, GeneCodis additionally predicted the CTD-based Estradiol mouse gene set for the ESTmmu gene expression data set as significant with a rank of 14, together with the co-occurring gene sets TCDD and Tretinoin. Zinc and zinc chloride CTD-based gene sets were reported as significant at rank 1 together for the ZnSO4 gene expression data set, but not ZnSO4. For next-gen TM, three additional gene sets were reported significant when opting for co-occurring annotations: the Estradiol human gene set for the ESThsa gene expression data set with a rank of 164 together with nine other chemicals; the Estradiol mouse gene set for the ESTmmu gene expression data set with a rank of 41 in a cluster with 11 other chemicals; the ZnSO4 gene set for the ZnSO4 gene expression data set with a rank of 1 in a cluster together with 12 other chemicals.
Among the other tools, the weighted global test had the highest number of significant scoring gene sets while ToxProfiler had the better ranking in most cases (Table 1).
In short, the CTD-based gene sets ranked higher than the next-gen TM-based gene sets for all gene expression data sets except the ZnSO4 data set, but t-profiling indicated a similar significance pattern for 50% of the next-gen TM-based and CTD-based gene sets.
Case study 2
In this case study we analyzed a PPARalpha-knock-out gene expression data set with the aim to predict the PPARalpha agonism characteristic of fibrates. When intersecting the CAS numbers for the CTD and next-gen TM gene sets, 585 M. musculus gene sets remained. The resulting 585 M. musculus CTD-based and next-gen TM-based gene sets were tested with the GSA tools against the PPARA gene expression data set, and compared to the results from the cMap. As mentioned before, ToxProfiler and the weighted global test do not require a list of differentially expressed genes, but GeneCodis does. To generate the list of differentially expressed genes needed for the analysis with GeneCodis, we used the topTable function from the bioconductor Limma package in R to obtain a ranked list of probes with the most evidence of differential expression between the knockout and wild-type samples. Probes with an adjusted p-value of less than 0.05 where kept (930 probes). Probesets were annotated with EntrezGene IDs using the bioconductor 4302.db package. To compare the results with cMap, we started with the same probesets and proceeded to make the query signature. cMap only accepts probes from the Affymetrix GeneChip Human Genome U133A Array. We therefore used the NetAffx tool supplied by Affymetrix (http://www.affymetrix.com/analysis/netaffx/index.affx) to map the probe IDs from our signature to Affymetrix GeneChip Human Genome U133A Array IDs.
Fibrate drug prediction ranks
Weighted global test
GeneCodis single annotation
None of the fibrate signatures in cMap scored significant against the PPARA gene expression set signature.
In short, the next-gen TM-based fibrate gene sets ranked similar to or better than the CTD-based fibrate gene sets using two of the GSA tools, and all next-gen TM-based fibrate gene sets were significant using all GSA tools. The top-scoring chemicals from the GSA analyses represented underlying biological processes relevant to the gene expression experiment, both for the CTD-based and next-gen TM-based gene sets. In contrast, none of the fibrate signatures in cMap scored significant against the PPARA gene expression set signature, and fewer relevant biological processes were found for the top-ranking chemicals from cMap.
Case study 3
The triazoles gene expression data set was analyzed with two aims: 1) to link chemicals with biological effects derived from -omics data similar to triazoles, and 2) to discriminate triazoles from other developmental toxicants (carbamazepine, methotrexate, methylmercury chloride, monobutyl phthalate, monoethylhexyl phthalate, nitrofen, warfarin), and from non-developmental toxicants (monomethyl phthalate and saccharine) using gene sets associated with embryonic structures.
Only the weighted global test was applied in this case study. The triazoles gene expression data set show very small variation in gene expression levels . The weighted global test has a very strong null hypothesis, asserting that no gene with a positive importance weight is associated with the response, making it an appropriate test for analyzing such a data set.
For aim 1, a subset of the 30,211 next-gen TM gene sets derived from chemicals with CAS numbers was used. We only considered mouse genes, since the gene expression experiment was performed on mouse embryonic stem cells. Chemicals were selected based on the list of 384 compounds tested for developmental toxicity contained in the ToxRefDB_DevTox database. The number of similar morphological developmental toxicity endpoints in vivo in rabbit or rat (as recorded in the ToxRefDB_DevTox database) between the significantly scoring chemicals and the triazoles was used as outcome similarity measure. The morphological developmental toxicity endpoints were the following: cleft lip and/or cleft palate; variations or abnormalities of the limb, including scapula, clavical and pelvis; variations or abnormalities of the vertebral column, ribs or sternum; variations or abnormalities of the cranium; abnormalities of the metanephric kidney; and abnormalities of the ureter. The triazoles also caused general developmental toxicity endpoints: change in weight of fetus at near-term of pregnancy; histopathological, clinical, or unclassified abnormalities in fetus; preimplantation loss, postimplantation loss (resorptions) or fetal death impacting litter size; and pregnancy loss or maternal wastage, but these endpoints were not considered specific enough to be included in the similarity outcome measure.
Number of similar toxicological endpoints
# similar general developmental toxicity endpoints
# similar morphological developmental toxicity endpoints
Monosodium methane arsenate
The top 25 embryonic structure gene sets
neural plate and or tube
Posterior neuropore open
second branchial arch structure
Branchial bars deformed
fourth branchial arch structure
Branchial bars deformed
entire fourth branchial arch
Branchial bars deformed
structure of first pharyngeal pouch
Otic vesicles deformed
entire first pharyngeal pouch
Otic vesicles deformed
entire neural tube
Posterior neuropore open
Branchial bars deformed
entire second branchial arch
Branchial bars deformed
structure of tympanic annulus
Otic vesicles deformed
entire tympanic annulus
Otic vesicles deformed
Posterior neuropore open
entire cloacal membrane
No corresponding scoring parameter available in the WEC.
branchial arch structure
Branchial bars deformed
structure of cloacal membrane
No corresponding scoring parameter available in the WEC.
entire ostium secundum
Heart ventrally turned
structure of third aortic arch
Heart ventrally turned
entire branchial arch
Branchial bars deformed
Otic vesicles deformed
entire auditory vesicle
Otic vesicles deformed
structure of embryo stage 6
No corresponding scoring parameter available in the WEC.
structure of early somite stage
entire early somite stage
third branchial arch structure
Branchial bars deformed
entire third branchial arch
Branchial bars deformed
The neural plate/tube gene set
Entrez gene ID
In short, 33 chemicals could be linked to the gene expression changes induced by triazoles. Toxicants corresponding to 21 of these 33 had a similar morphological developmental in vivo toxicity pattern as the triazoles. Using next-gen TM-derived gene sets for embryonic structures, we confirmed the relation between gene expression patterns and embryotoxic effects seen in the Whole Embryo Culture, and discriminated triazoles from other chemicals in a principal component analysis.
In the present study we perform identification of the chemical treatment, pharmacological mechanism elucidation, and compound toxicity profile comparison by testing chemical response-specific next-gen TM-based gene sets with GSA methods against different gene expression data sets. Such experiments have been performed before but were limited to evaluation of one specific tool and used a limited number of literature-based gene sets [3, 7, 9]. In contrast to the tool-specific gene sets that were used for identification of the chemical treatment in Minguez et al., we provide the size and scope of our chemical response-specific gene sets, and make them available for download in a generic format (http://www.biosemantics.org/index.php?page=chemical-response-specific-gene-sets). We extend the evaluation of literature-based gene sets for identification of the chemical treatment performed by Patel and Butte to include next-gen TM. In doing this, we were able to create many more chemical response-specific gene sets (~30,200 gene sets for both human and mouse using next-gen TM compared to ~1,200 gene sets for human and ~600 gene sets for mouse using the CTD). While Jelier and co-workers tested next-gen TM-based gene sets for drugs on one gene expression data set using one GSA method, we tested next-gen TM-based gene sets on diverse sets of gene expression data with a variety of methods. No GSA method consistently outperformed the other, and we recommend using more than one GSA method when analysing gene expression data. We showed that gene sets generated by next-gen TM are less compound-specific than gene sets based on manual curation efforts but perform equal to or better than manual curation efforts in elucidating the pharmacological mechanism of fibrates. In addition, we show that next-gen TM allows for toxicity profile comparison of compounds for which manual annotation efforts are lacking (triazoles). We also successfully use next-gen TM-based gene sets for embryonic structures to describe effects induced by the triazoles hexaconazole, triadimefon and flusilazole in the EST, and to discriminate the triazoles from other chemicals using PCA.
A possible limitation of our approach is that we do not take the nature of the relation (for example expression (negative or positive) or phosphorylation) between the genes and chemicals or genes and embryonic structures into account when creating the gene sets. In the CTD, such information has been manually curated for the chemical-gene interactions. For the gene sets generated by text mining, such relations could be mined from the literature by using an ontology of known biological relations. Even though it might seem logical to only include curated expression relations, it also limits the possibility of finding new relations. From our omics point of view, all associations are of possible interest. In the future, it could be worth investigating how the nature of the relation influences the results of the GSA.
Another limitation relates to the number of gene expression data sets we used to test our methods on and the species variety (mouse and human). We choose these two species as a starting point since the annotation information available is more complete than for other genomes, but the amount of gene expression data and gene annotations for other model organisms in toxicology such as rat and zebrafish is increasing, and gene set creation and testing for these and other organisms is a topic for future research.
The amount of information in PubMed and CTD continue to increase. For example, following our “79983-71-4 OR hexaconazole” PubMed search example where we were able to retrieve 69 articles on November 21 (2011), the same search now (October 6 (2012)) retrieves 80 articles. A topic for future research would be to investigate how this information growth affects the gene set predictions.
Also, the process of updating the concept association scores is non-trivial. A conversion of the methods to a Web service environment, preferably with a Web interface, would greatly enhance its usability and is a project we are currently undertaking.
Identification of chemical treatment
In this case study, we aimed to show that next-gen TM-based gene sets compare well to CTD-based gene sets in associating chemicals with gene expression data sets. The CTD-based gene sets ranked higher than the next-gen TM-based gene sets for all gene expression data sets except the ZnSO4 data set. This might be expected since the CTD-based gene sets are curated. However, t-profiling of the CTD-based and next-gen TM-based gene sets showed similar t-profiles for three of the six gene sets, indicating a similar gene set significance pattern for these gene sets despite the higher ranks of the CTD-based gene sets. The ranks for the CTD-based and next-gen TM-based gene sets differed between the different tools and between the experiments, indicating that in general, not one tool is to be preferred over the other when performing GSA analysis-based connectivity mapping. One needs to select the tool best suited for the experimental design. For example, ToxProfiler performs well with small number of replicates, while the weighted global test deals well with small changes in gene expression levels. The CTD-based gene sets had the highest significant ranks when using GeneCodis, which is in line with what Patel and Butte  reported for these experiments (using the hypergeometric test). The co-occurring annotations option in GeneCodis proved useful compared to the single annotation option in some cases.
Pharmacological mechanism elucidation
In this case study, we aimed to associate PPARalpha agonists (fibrates) with a PPARalpha knock-out gene expression data set, and to show that the significantly scoring chemicals relate to relevant biological processes. The next-gen TM-based fibrate gene sets ranked similar or better than the CTD-based fibrate gene sets using two of the GSA tools, and all next-gen TM-based fibrate gene sets were significant using all GSA tools. Clearly, for this experiment, next-gen TM presents a comparable alternative to manual inspection of the literature. This might be the case since there is pharmacological and toxicological consensus on the action of peroxisome proliferators, and as such the literature accumulated over the past is more clearly described. The CTD-based gene sets for gemfibrozil and bezafibrate did not score significant using ToxProfiler and GeneCodis. These fibrates also had the lowest number of genes annotated to them: only four for gemfibrozil and nine for bezafibrate could be mapped to the current gene expression experiment. The size of the gene set has less influence when using the weighted global test compared to the other tools, which might explain that these gene sets scored significant using this method. The next-gen TM fibrate gene sets ranked much lower when using the weighted global test compared to the other tools. However, rankings were improved compared to the results by Jelier and coworkers . The non-significant results for the fibrates when using cMap indicates that GSA tools present a good alternative when performing connectivity mapping.
We annotated the top-scoring chemicals for CTD and next-gen TM in Anni with biological processes to investigate if these corresponded to the ones annotated by hand in the PPARA study. For all GSA tools, biological processes corresponding to all categories reported in the PPARA study were found, except intestinal motility concepts. Intestinal motility concepts could be annotated to the significantly scoring signatures in cMap, while only one more (fatty acid oxidation) of the other five biological processes could be found using this method. These results show that the top scoring chemicals from the GSA analyses represent underlying biological processes relevant to the gene expression experiment, both for the CTD-based and next-gen TM-based gene sets.
Compound toxicity profile comparison
The triazoles gene expression data set was analyzed with the aim to predict chemicals with a toxicity profile similar to triazoles and to discriminate triazoles from other developmental toxicants, and non-toxic compounds. Many (64%) of the predicted chemicals had an in vivo toxicity pattern corresponding to that of the triazoles. The highest ranking compound with a toxicology pattern different from the triazoles according to ToxRefDB_DevTox was Monosodium methane. Three of the eight genes that contributed most to the significance of this compound where annotated with the concept “axial skeletal structure” in Anni. Triazoles cause malformations in this structure according to the ToxRefDB_DevTox. Even though Monosodium methane does not cause malformations in the axial skeletal structure of rats or rabbits according to the ToxRefDB_DevTox, our results suggest that the compound should be tested for in vivo toxicity also in mouse. On the level of risk assessment we are interested in the situation in man, and the more information about the toxic effects of compounds in different systems the better. The highest ranking non-toxic compound (Zoxamide) scored significant mainly because of the presence of the gene CYP51A1 in the gene set. The fungicidal mode of action of triazoles is based on the inhibition of this gene . We noted five significantly scoring chemicals for the saccharine gene expression experiment of which four were developmental toxicants, which confirms that gene set testing with chemical response-specific gene sets should be considered as hypothesis generation and that every significantly scoring compound need to be investigated further. This is something that also applies to standard connectivity mapping. Using the weighted global test, further investigation would constitute inspection of the leaf nodes in the gene sets, since these genes contribute most to the test. Anni can be used to infer the function of the leaf nodes, give information on how these are connected to each other, and via direct links to the literature more thorough information can easily be found.
De Jong and co-workers  showed that the embryonic stem cell test (EST) is able to give a relatively good potency ranking compared with the in vivo developmental toxicity potency of triazoles, but pointed out that the system gives little information on the type of effects that can occur after exposure to the chemicals. We show that by associating embryonic structure gene sets with a triazole gene expression profile obtained through the EST, effect information becomes readily available. Also, the genes in the embryonic structure gene set that contributed most to the weighted global test could separate triazoles from other chemicals in a PCA. Genes that effectively can separate between chemical classes are usually searched for by applying a statistical test on differential gene expression (see for example ). Here, we show that such genes can also be found by applying the weighted global test with relevant gene sets (in this case embryonic structure gene sets).
In conclusion, GSA with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.
This research was supported by the Dutch Technology Foundation STW, applied science division of NWO and the Technology Program of the Ministry of Economic Affairs. We would like to acknowledge Herman van Haagen and Erik Schultes for advice on the cut-off for the number of concepts in a concept profile used for matching.
- Smalley JL, Gant TW, Zhang SD: Application of connectivity mapping in predictive toxicology based on gene-expression similarity. Toxicology. 2010, 268: 143-146. 10.1016/j.tox.2009.09.014.View ArticlePubMedGoogle Scholar
- Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, et al: The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006, 313: 1929-1935. 10.1126/science.1132939.View ArticlePubMedGoogle Scholar
- Patel CJ, Butte AJ: Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics. 2010, 3: 17-10.1186/1755-8794-3-17.View ArticlePubMedPubMed CentralGoogle Scholar
- Bussemaker HJ, Ward LD, Boorsma A: Dissecting complex transcriptional responses using pathway-level scores based on prior information. BMC Bioinformatics. 2007, 8 (Suppl 6): S6-10.1186/1471-2105-8-S6-S6.View ArticlePubMedPubMed CentralGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13. 10.1093/nar/gkn923.View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.View ArticlePubMedPubMed CentralGoogle Scholar
- Minguez P, Al-Shahrour F, Montaner D, Dopazo J: Functional profiling of microarray experiments using text-mining derived bioentities. Bioinformatics. 2007, 23: 3098-3099. 10.1093/bioinformatics/btm445.View ArticlePubMedGoogle Scholar
- Davis AP, King BL, Mockus S, Murphy CG, Saraceni-Richards C, Rosenstein M, Wiegers T, Mattingly CJ: The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 2010, 39 (Database issue): D1067-72.PubMedPubMed CentralGoogle Scholar
- Jelier R, Goeman JJ, Hettne KM, Schuemie MJ, den Dunnen JT, ‘t Hoen PAC: Literature-aided interpretation of gene expression data. Brief Bioinform. 2011, 12: 518-529. 10.1093/bib/bbq082.View ArticlePubMedGoogle Scholar
- van Dartel DA, Pennings JL, Robinson JF, Kleinjans JC, Piersma AH: Discriminating classes of developmental toxicants using gene expression profiling in the embryonic stem cell test. Toxicol Lett. 2011, 201: 143-151. 10.1016/j.toxlet.2010.12.019.View ArticlePubMedGoogle Scholar
- Knudsen TB, Martin MT, Kavlock RJ, Judson RS, Dix DJ, Singh AV: Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA’s ToxRefDB. Reprod Toxicol. 2009, 28: 209-219. 10.1016/j.reprotox.2009.03.016.View ArticlePubMedGoogle Scholar
- Jelier R, Schuemie MJ, Veldhoven A, Dorssers LC, Jenster G, Kors JA: Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol. 2008, 9: R96-10.1186/gb-2008-9-6-r96.View ArticlePubMedPubMed CentralGoogle Scholar
- Schuemie MJ, Mons B, Weeber M, Kors JA: Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform. 2007, 40: 316-324. 10.1016/j.jbi.2006.09.002.View ArticlePubMedGoogle Scholar
- Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJ, Schijvenaars BJ, Mulligen EM, Kleinjans J, Kors JA: A dictionary to identify small molecules and drugs in free text. Bioinformatics. 2009, 25: 2983-2991. 10.1093/bioinformatics/btp535.View ArticlePubMedGoogle Scholar
- Boorsma A, Foat BC, Vis D, Klis F, Bussemaker HJ: T-profiler: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res. 2005, 33: W592-W595. 10.1093/nar/gki484.View ArticlePubMedPubMed CentralGoogle Scholar
- Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004, 20: 93-99. 10.1093/bioinformatics/btg382.View ArticlePubMedGoogle Scholar
- Nogales-Cadenas R, Carmona-Saez P, Vazquez M, Vicente C, Yang X, Tirado F, Carazo JM, Pascual-Montano A: GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Res. 2009, 37: W317-W322. 10.1093/nar/gkp416.View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological). 1995, 57: 289-300.Google Scholar
- Naciff JM, Khambatta ZS, Reichling TD, Carr GJ, Tiesman JP, Singleton DW, Khan SA, Daston GP: The genomic response of Ishikawa cells to bisphenol A exposure is dose- and time-dependent. Toxicology. 2010, 270: 137-149. 10.1016/j.tox.2010.02.008.View ArticlePubMedGoogle Scholar
- Lin CY, Vega VB, Thomsen JS, Zhang T, Kong SL, Xie M, Chiu KP, Lipovich L, Barnett DH, Stossi F, et al: Whole-genome cartography of estrogen receptor alpha binding sites. PLoS Genet. 2007, 3: e87-10.1371/journal.pgen.0030087.View ArticlePubMedPubMed CentralGoogle Scholar
- Selvaraj V, Bunick D, Finnigan-Bunick C, Johnson RW, Wang H, Liu L, Cooke PS: Gene expression profiling of 17beta-estradiol and genistein effects on mouse thymus. Toxicol Sci. 2005, 87: 97-112. 10.1093/toxsci/kfi219.View ArticlePubMedGoogle Scholar
- Tijet N, Boutros PC, Moffat ID, Okey AB, Tuomisto J, Pohjanvirta R: Aryl hydrocarbon receptor regulates distinct dioxin-dependent and dioxin-independent gene batteries. Mol Pharmacol. 2006, 69: 140-153.PubMedGoogle Scholar
- Bosse Y, Maghni K, Hudson TJ: 1alpha,25-dihydroxy-vitamin D3 stimulation of bronchial smooth muscle cells induces autocrine, contractility, and remodeling processes. Physiol Genomics. 2007, 29: 161-168.View ArticlePubMedGoogle Scholar
- Li Z, Stonehuerner J, Devlin RB, Huang YC: Discrimination of vanadium from zinc using gene profiling in human bronchial epithelial cells. Environ Health Perspect. 2005, 113: 1747-1754. 10.1289/ehp.7947.View ArticlePubMedPubMed CentralGoogle Scholar
- den Bosch HM d V-v, Bunger M, de Groot PJ, Bosch-Vermeulen H, Hooiveld GJ, Muller M: PPARalpha-mediated effects of dietary lipids on intestinal barrier gene expression. BMC Genomics. 2008, 9: 231-10.1186/1471-2164-9-231.View ArticleGoogle Scholar
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet. 2006, 38: 500-501. 10.1038/ng0506-500.View ArticlePubMedGoogle Scholar
- van Dartel DA, Pennings JL, de la Fonteyne LJ, van Herwijnen MH, van Delft JH, van Schooten FJ, Piersma AH: Monitoring developmental toxicity in the embryonic stem cell test using differential gene expression of differentiation-related genes. Toxicol Sci. 2010, 116: 130-139. 10.1093/toxsci/kfq127.View ArticlePubMedGoogle Scholar
- de Jong E, Barenys M, Hermsen SA, Verhoef A, Ossendorp BC, Bessems JG, Piersma AH: Comparison of the mouse Embryonic Stem cell Test, the rat Whole Embryo Culture and the Zebrafish Embryotoxicity Test as alternative methods for developmental toxicity testing of six 1,2,4-triazoles. Toxicol Appl Pharmacol. 2011, 253: 103-111. 10.1016/j.taap.2011.03.014.View ArticlePubMedGoogle Scholar
- Vanden Bossche H, Marichal P, Gorrens J, Coene MC: Biochemical basis for the activity and selectivity of oral antifungal drugs. Br J Clin Pract Suppl. 1990, 71: 41-46.PubMedGoogle Scholar
- Pennings JL, van Dartel DA, Robinson JF, Pronk TE, Piersma AH: Gene set assembly for quantitative prediction of developmental toxicity in the embryonic stem cell test. Toxicology. 2011, 284: 63-71. 10.1016/j.tox.2011.03.017.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/6/2/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.