PhylOnt aims to characterize selected "phylogenetic resource" concepts and the relationships among these concepts. In this context, we define a "phylogenetic resource" as any uniquely identifiable object or procedure from the domain of phylogenetic research, ranging from the granular, e.g. a specific node in a tree, to the holistic, e.g. a study, or a step in an analysis workflow. PhylOnt includes concepts for estimation programs, models of evolution, methods of analysis, search algorithms, support assessments, and relevant provenance information. PhylOnt will grow as new tree estimation technologies are developed and used in published phylogenetic studies. Developing an ontology and using it to annotate the data and services in analysis workflows can provide a foundation for other semantic technologies, such as concept-based searches and comprehensive federated queries over data sources.
Systematic approach for ontology development
In developing PhylOnt we worked closely with phylogeneticists and computer scientists to iteratively validate the ontology based on community feedback. As shown in Figure 1, development of PhylOnt started with data collection and organization of concepts in relational diagrams. Concept maps drawn from the primary literature included descriptions, properties, metadata, usage of concepts and relations between them. Subsequently, the version of PhylOnt presented here was developed in Protege 4.1.0 (Figure 2), which supports the Web Ontology Language (OWL). PhylOnt is accessible at NCBO through BioPortal [21]. A phylogentics domain specific extension of the Kino annotation package [12, 13] was used in PhylAnt platform to facilitate annotation and faceted search over the annotated resources including scientific literature and NeXML format data files.
Data collection
Data resources in phylogenetic studies can be classified into primary and metadata categories. Primary data exist as published data files, literature with text, images, Excel files, and other supplementary materials. Primary data can also refer to methods, models, programs and even parameters used in applications and web services. Metadata includes information such as when and where the primary data were created. This information plays a very important role in enabling reusability.
To perform data extraction, a well-framed approach was required to identify and capture steps in phylogenetic workflows described in published phylogenetic studies [17]. We used PhyloWays [22], as a set of interpreted phyloinformatic workflows described in the primary phylogenetics literature. We identified all the information required to repeat the analysis presented in the PhyloWays papers, including the phylogeny estimation programs used in each paper, methods of analysis, evolutionary models and provenance information. These descriptions paved the way for classification of concepts associated with phylogenetic data (including provenance information), phylogenetic workflows, and the results of phylogenetic analysis.
Based on discussions with domain experts, literature reviews and the data in PhyloWays we then created concept maps describing methods of phylogenetic analysis, evolutionary models used in applications of these methods, and widely used phylogenetic software.
Methods of phylogenetic analysis
Phylogenetic methods vary considerably in approaches for assessing alternative hypotheses (i.e. trees), traversing through the complex universe of alternative hypotheses (i.e. tree and model parameter landscapes) and characterizing the degree of support for an optimized solution. As shown in Figure 3, a hierarchical classification of optimality criteria, search algorithms and uncertainty assessment concepts is implemented in PhylOnt. For example, tree inference methods based on maximum parsimony, maximum likelihood or Bayesian statistics rely on the analysis of a homologized character state matrix, whereas UPGMA, neighbor joining and distance-Wagner are based on sets of pairwise distances that may be estimated from a character state matrix or computed in some other way.
The universe of possible trees is extremely complex and identifying the optimal tree in this tree landscape is an NP-hard computational problem. Therefore, there is a variety of heuristic approaches for traversing the tree space in search of the optimal tree. Most maximum parsimony and maximum likelihood analysis methods build an initial tree and then iteratively test for improvement by rearranging the tree topology using branch-swapping algorithms such as nearest neighbor interchange (NNI), subtree pruning and regrafting (SPR), tree bisection and reconnection (TBR), or combinations thereof. Bayesian inference methods also include a branch-swapping process within a Markov Chain Monte Carlo (MCMC) strategy for sampling tree space.
Assessment of support for a phylogenetic inference is key in deciding whether an optimized solution is acceptable [23]. Bayesian inference methods provide posterior probabilities for the relationships conveyed in a phylogenetic tree, whereas other methods typically use bootstrap or jackknife resampling to assess the degree of support for hypothesized relationships. Resampling approaches can be combined with MCMC sampling in Bayesian analyses and the process of randomly resampling the original data matrix typically reduces posterior probabilities relative to those reported for MCMC searches without resampling [4].
Models in phylogenetic analysis
All phylogenetic analyses are performed with an explicit or implicit model of character evolution. Maximum likelihood, Bayesian inference and most distance-based methods rely on nucleotide or amino acid substitution models. Branch lengths for phylogenetic trees often represent time or evolutionary change. Correct interpretation of branch lengths requires an understanding of the models used to estimate time or evolutionary change. Separate substitution models are used for analyses of DNA and protein sequence alignments. Nucleotide substitution models include JC69, K80, HKY85, SYM, F81, and GTR [24, 25]. Commonly used amino acid substitution models include PAM [26], JTT [27] and WAG [28]. Gene sequences typically include conserved domains and less conserved regions. The resulting among-site variation in substitution rates is often modeled in phylogenetic analysis of either nucleotide or amino acid alignments using a discrete approximation of the gamma distribution [29], a fraction of invariant sites [30], or a combination thereof. Both of these forms of rate variation can be layered upon the nucleotide and amino acid substitution models described above. Figure 4 shows a hierarchy of concepts used to describe evolutionary models most commonly used in phylogenetic studies.
Phylogenetic methods and the models they use are constantly changing as the phylogenetics community works to make more accurate and precise inferences about relationships and evolutionary processes. Therefore PhylOnt is necessarily incomplete, but easily extended to include additional models.
Programs in phylogenetic analysis
At time of writing, there are approximately 400 phylogeny packages and more than 50 free web servers for phylogenetic analysis [17]. PhylOnt currently identifies the most commonly used phylogenetic inference programs such as MrBayes [31], and PAUP* [32]. Programs can be categorized based on the methods they use. For example, PAUP* can be used to perform most major methods of analysis such as maximum parsimony and maximum likelihood. For more details about the programs, such as description for each and relation between programs, models and methods readers are referred to the PhylOnt project page on BioPortal [21].
PhylAnt, a platform for semantic annotation, indexing and searching of phylogenetic resources
Semantic annotation maps target data resources to concepts in ontologies. In the process of annotation, extra information is added to the resource to connect it to its corresponding concept(s) in the ontology. PhylAnt offers a semi-automatic approach for such annotation of phylogenetic resources with the help of a suite of tools called Kino-Phylo. The complete suite of tools and instructions can be found at [33].
Annotating phylogenetic documents with Kino-Phylo
Kino for phylogenetics, also known as Kino-Phylo [13, 17] is built on top of the Kino platform [12, 33]. It is an integrated suite of tools that enables scientists to annotate phylogeny related documents in the PhylAnt platform. Kino-Phylo can annotate documents by accessing PhylOnt and other NCBO ontologies, via the NCBO Web API.
Kino-Phylo presents a comprehensive architecture for annotating and indexing phylogenetic oriented resources that should be of great use for the phylogenetic community. This system includes two main components, a browser-based annotation front-end, integrated with NCBO and an annotation-aware backend index to provides faceted search capabilities. It is designed around a basic workflow consisting of three steps, annotation, indexing, and searching[17, 12]:
-
1.
Annotation: In the annotation step, users provide annotations via a browser plug-in. After the annotations are added, the augmented document can be directly submitted to the indexing engine.
-
2.
Indexing: Indexing is performed using Apache SOLR. It can be installed as an independent application and exposes multiple interfaces for client programs. SOLR provides the isolation for the index as well as support for faceting. Note that the SOLR interfaces are not directly exposed. They are wrapped by the Kino-Phylo submission API, described later in this paper. The annotation-aware back-end index is exposed via a RESTful API. It is designed such that the browser plug-in can directly submit the annotated web pages to the indexing engine.
-
3.
Search: The search is performed via a Web interface. It presents the notions of a typical search engine and additionally gives the ability to filter the results via the facets. The current UI is built upon the JSON based Kino search API, which can be used to integrate other tools as well.
Browser plug-in for phylogenetic annotation
To use the browser plug-in, the user opens a topical web document in her browser and highlights words and phrases of interest. The plug-in provides hints on matching concepts fetched from NCBO. The user can also opt to browse for a concept in any ontology in NCBO (Figure 5). Once the annotations are added, the user can submit the annotations to a predefined Kino-Phylo instance (configured via the plug-in configuration page), by selecting the "publish annotations" menu item.
The plug-in modifies the HTML source of the document and embeds annotations using the SA-REST specification[12]. At submission, the augmented document tree in the browser is serialized and submitted as XML to the back end index via the document submission API (See next section).
Kino-Phylo index and search manager
The Kino-Phylo index manager is based on the Java JSP/Servlets technology and includes two major components, Document Submission API and Search API. The submission API acts as the receiver for the submitted documents. After receiving a document via the Document Submission API, the document will be filtered for embedded annotations and indexed. The index runs full-text indexing and special indexing for the filtered-out concepts. Additionally, the indexing process extracts extra information (such as synonyms) via NCBO and inserts this information in the index as well.
The Kino-Phylo search API includes a selection window that helps users to filter search results. For example, a user can search for parsimony as a concept or as a word. Once she finds a set of documents, they can be further filtered by co-locating concepts. For example, she can filter out the documents that have annotations on parsimony only across the documents that contain parsimony as an annotation for the methods used in phylogeny study. The User Interface includes an intuitive facet selection tool that helps the user to filter the results.
Annotation of NeXML files with Kino-Phylo
Vos, et al. [11] proposed NeXML as an exchange standard for representing phylogenetic data, inspired by the commonly used NEXUS format [34], but more robust and easier to process. XML formats such as NeXML play an essential role in promoting the accessibility and reuse of data on the web. Using this technology can simplify and improve robustness in the processing of rich phylogenetic data and enable their reuse.
Annotations in NeXML are expressed using recursively nested "meta" elements that conform to RDFa syntax. The annotations thus form triples of subject, predicate, and object, where the subject is either a fundamental data resource from the NeXML document such as a tree, character state matrix, or taxon; or, transitively, the object of another triple. Instead of trying to provide vocabulary for all metadata types within the NeXML standard, users can thus use vocabularies or ontologies in common usage in the phyloinformatics community to annotate NeXML fles. To demonstrate this facility, we annotated NeXML documents using Kino-Phylo. With this approach, users can identify concepts from any NCBO ontology using exact or approximate searches to annotate selected element in a NeXML file (Figure 6). Users can then annotate a NeXML element to the desired triple, so that a statement can be made such as (subject NeXML element) "tree" (predicate) has − substitution − model (object) nucleotide − substitution − model.