SEURAT: Visual analytics for the integrated analysis of microarray data
- Alexander Gribov†1,
- Martin Sill†2,
- Sonja Lück3,
- Frank Rücker3,
- Konstanze Döhner3,
- Lars Bullinger3,
- Axel Benner2 and
- Antony Unwin1Email author
© Gribov et al; licensee BioMed Central Ltd. 2010
Received: 11 March 2010
Accepted: 3 June 2010
Published: 3 June 2010
In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required.
We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms.
The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data.
The rapid development of microarray technologies in recent years has led to the possibility of acquiring a large spectrum of different molecular data types. In translational cancer research, gene expression data are usually collected together with additional clinical information and genomic data from other high throughput technologies such as microarray-based comparative genomic hybridization (array CGH) or SNP (single nucleotide polymorphism) arrays. The availability of these related, mostly high-dimensional data sets calls for software tools which can analyze them all together in an integrated fashion. Currently there is a lack of such applications that enable exploratory analysis of integrated data sets. Most visualization and clustering tools are limited in their ability to handle gene expression, genomic and clinical data together. To our knowledge only a few software tools are able to perform an integrated analysis.
The VAMP software  is able to visualize genomic gain and loss information together with gene expression data. The focus of VAMP is on the comparison of the genomic information between tumors and thus all data types are displayed along the physical position in the genome. It is not possible to reorder the gene expression data according to the expression patterns and clustering algorithms can only be applied to cluster different tumors. A single graphic allows the display of additional clinical data by a simple color code and this representation is limited to categorical variables. In addition the graphics are not linked, so that each graphic has to be interpreted separately.
Other tools able to visualize gene expression data together with genetic variations and other molecular data types like RNAi data and methylation data are the Integrative Genomic Viewer (IGV)  developed by the Broad Institute and the Integrated Genome Browser . These tools organize the different data types in the form of tracks within a browser window similar to the well known UCSC Genome Browser. The different data types are displayed one below the other along the physical positions of the genome. This visualization allows the user to examine relations between different molecular data at specific known genomic locations, but it is impossible to reveal new trans-regulative relations. Furthermore, with an increasing number of subjects and molecular data types the comparison of the many tracks becomes complicated. IGV additionally offers the possibility of aligning clinical data using color codes. For continuous data and especially for time to event data like survival times such a representation is not sufficient.
Besides these open source software solutions, some proprietary software tools are able to perform an integrated analysis, e.g., the Genomic Workbench (Agilent Technologies, Santa Clara, California) or Acuity (Enterprise Microarray Informatics). However, although they can handle the different data types, visualizations are limited to stand alone graphics, not linked to other displays such as clustering results or summary statistics of clinical variables. In order to reveal new biologically meaningful relations possibly hidden inside the different data sets, we follow the philosophy of exploratory data analysis . Our approach to this problem was to develop open source software capable of performing in-depth exploratory analyses with the help of interactive graphics. In contrast to other software tools that usually aim to visualize the information of the different data types within a single graphic, we display each data type in its own graphic and link them using interactive graphics. Each graphic corresponds to the usual visualization of the corresponding data type and can easily be interpreted. Combining these dynamic graphics by linking, so that objects selected are highlighted in all other graphics, and providing unsupervised statistical methods enables users to perform very effective exploratory analyses. The proposed software does not compete with usual software approaches that offer inferential statistics, but provides a complementary analytical approach. The advantage of our exploratory software regarding the analysis of high-dimensional integrated data sets is demonstrated by an analysis of data collected from acute myeloid leukemia (AML) patients.
To ensure portability and platform independence, SEURAT has been written in Java. Most of the GUI elements are based on JAVA Swing packages so that SEURAT has a uniform look and feel independent of the underlying platform. The software establishes a connection to the R statistical software  via Rserve . Rserve is a TCP/IP server which allows other programs to communicate with R. This connection potentially provides access to all functions implemented in R and Bioconductor . For clustering and seriation algorithms SEURAT uses the facilities of the R-packages amap , seriation  and biclust . In order to use SEURAT, R, the relevant R packages, and the Java Runtime Environment (JRE) 1.6 need to be installed on the user's computer. The software focuses on performing exploratory, visual analyses. To simplify the data import all datasets are assumed to be preprocessed and being in tab-delimited ASCII form. Preprocessing includes the data management and quality control of the different microarray data as well as the normalization, gene filtering and annotation of the data. SEURAT was tested with different data sets, and works well with both data from custom two color gene expression arrays (Stanford 40 k DNA microarrays) and CGH arrays (2.8 k BAC/PAC microarrays) as well as with Affymetrix exon (GeneChip Human Exon 1.0 ST Arrays) and SNP arrays (Genome-Wide Human SNP Arrays 6.0). The preprocessing was performed using R and Bioconductor. For the preprocessing of the exon arrays and for extracting raw copy numbers from the SNP array data we used statistical methods available within the R package aroma.affymetrix [11, 12]. To extract the genomic regions showing the same genomic variations from array CGH and SNP data we applied the GLAD (Gain and Loss Analysis of DNA) algorithm . This algorithm is available within the Bioconductor package GLAD as well as within the R package aroma.affymetrix. Alternatively other methods could also be used within this context, e.g. a hidden Markov model approach  or the fast binary segmentation algorithm . Additional annotations not available from the Affymetrix annotation files have been added by using the capabilities of BioMart that are accessible with the Bioconductor package biomaRt . Detailed R scripts describing each step of the preprocessing are available at the project website.
Results and Discussion
SEURAT is a new software tool which is capable of integrated analysis of gene expression, array CGH and SNP array and clinical data using interactive graphics. The focus of SEURAT is on exploratory analysis that enables biological and medical experts to uncover new relations in high-dimensional biological and clinical datasets and thus supports the process of hypothesis generation. To our knowledge, no other software that aims to perform integrated analysis of microarray data offers such a high level of interactivity. The concept of combining many interactive graphics by logical linking and the broad spectrum of unsupervised methods is unique. Because of the object oriented design of the software it will be possible to add additional graphics like parallel coordinates with interactive capability. In addition, with the use of Rserve, the complete functionality of R and Bioconductor is available to include more statistical methods in SEURAT. In particular, further clustering algorithms (e.g. model-based clustering) will be investigated for adoption in later versions of SEURAT. In the future, we plan to adapt SEURAT to integrate other microarray based data types such as loss of heterozygosity data, also available from SNP arrays, as well as information from protein arrays or epigenetic data arising from methylation arrays. While this will further improve SEURAT, the current version already provides a powerful means for the integrative and interactive analysis of complex genomics data sets. Therefore, SEURAT will likely contribute to refined insights into cancer biology such as acute myeloid leukemia.
Availability and requirements
Project name: SEURAT
Project home page: http://seurat.r-forge.r-project.org/
Operating system(s): Platform independent
Programming language: Java and R
Other requirements: Java 1.6 or higher, R 2.8 or higher, R-packages: Rserve, amap, seriation and biclust License: GNU GPLv3
Any restrictions to use by non-academics: None
acute myeloid leukemia
American Standard Code for Information Interchange
bacterial artificial chromosome/P1-derived artificial chromosome
comparative genomic hybridization
- FAB classification:
French American British classification
graphical user interface
inversion mutation at chromosome 16
Java Runtime Environment
single nucleotide polymorphism
Transmission Control Protocol/Internet Protocol
21): translocation mutation between chromosome 8 and 21.
This project was supported by the Deutsche José Carreras Leukämie-Stiftung e.V. (Project Number 07/30v and 07/09v).
- Rosa PL, Viara E, Huppé P, Pierron G, Liva S, Neuvial P, Brito I, Lair S, Servant N, Robine N, Manipé E, Brennetot C, Janoueix-Lerosey I, Raynal V, Gruel N, Rouveirol C, Stransky N, Stern MH, Delattre O, Aurias A, Radvanyi F, Barillot E: VAMP: visualization and analysis of array-CGH, transcriptome and other molecular profiles. Bioinformatics. 2006, 22 (17): 2066-2073. 10.1093/bioinformatics/btl359.View ArticlePubMed
- Broad Institute: Integrative Genomics Viewer. [http://www.broadinstitute.org/igv]
- Nicol JW, Helt GA, Blanchard SG, Raja A, Loraine AE: The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics. 2009, 25 (20): 2730-2731. 10.1093/bioinformatics/btp472.PubMed CentralView ArticlePubMed
- Tukey JW: Exploratory Data Analysis. 1977, Reading, Mass, Addison-Wesley
- Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics. 1996, 5 (3): 299-314. 10.2307/1390807.
- Urbanek S: Rserve -- A Fast Way to Provide R Functionality to Applications. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). 2003, 20-22.
- Gentleman RC, Carey VJ, Batesa DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Garnier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossi AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Ynag JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004, 5 ((R80)).
- Lucas A: amap: Another Multidimensional Analysis Package 2010, Version 0.8-5. [http://cran.r-project.org/web/packages/amap/index.html]
- Hahsler M, Hornik K, Buchter C: Getting Things in Order: An Introduction to the R Package seriation. Journal of Statistical Software. 2008, 25.
- Kaiser S, Santamaria R, Sill M, Theron R, Quintales L, Leisch F: biclust: BiCluster Algorithms. Version 0.9.1. 2009, [http://cran.r-project.org/web/packages/biclust/index.html]
- Bengtsson H, Simpson K, Bullard J, Hansen K: aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. 2008, Tech. rep., Department of Statistics, University of California, Berkeley
- Bengtsson H, Irizarry R, Carvalho B, Speed TP: Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008, 24 (6): 759-767. 10.1093/bioinformatics/btn016.View ArticlePubMed
- Hupé P, Stransky N, Thiery JPP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (Oxford, England). 2004, 20 (18): 3413-3422. 10.1093/bioinformatics/bth418.View Article
- Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain : Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.View Article
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657-663. 10.1093/bioinformatics/btl646.View ArticlePubMed
- Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005, 21 (16): 3439-3440. 10.1093/bioinformatics/bti525.View ArticlePubMed
- Goldman A: EVENTCHARTS: Visualizing Survival and Other Timed-Events Data. The American Statistician. 1992, 46: 13-18. 10.2307/2684402.
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/3/21/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.