Background

In 1859, Darwin produced one of the first illustrations of a phylogenetic tree, notably this was the only figure included in The Origin of Species [1]. Since, biologists have used trees to depict the relationships between organisms, genes and genomes. The number of studies depicting phylogenies exploded (see Figure 1) with the development of the polymerase chain reaction technique and journals were created specifically for publishing the molecular phylogenies generated by researchers (e.g., Molecular Phylogenetics and Evolution established in 1992). Whilst in the early years of morphological and molecular phylogenetics, embedding illustrations into manuscripts might have been the most appropriate way to disseminate knowledge, this has resulted in the locking up of phylogenetic hypotheses into the pages of journals and books without an easy way to access this information.

Figure 1
figure 1

Percentage of articles with phylogen* in the title. The percentage of articles with phylogen* in the title out of the total number of publication for each year since 1980 from PubMed.

Currently, the construction of the relationships between the 1.8 million currently estimated species largely depends on the unprecedented growth of molecular sequence data [2] and this makes GenBank the most accessible source of comparative data for most taxa in the tree of life [3]. Whilst more sequence data, more powerful computers and improved phylogenetic reconstruction algorithms will enable researcher to generate up-to-date phylogenies from the raw data in the future, past phylogenetic inferences will remain central to guiding researchers towards studying poorly supported relationships and under-sampled lineages. They are also central for studying the effects of new phylogenetic methodologies and new and larger datasets [2].

Not all phylogenetically informative data are confined to sequence databases. TreeBASE is a very valuable repository as it holds morphological or genetic data with the associated published phylogeny [4]. However, as few publishers require submission to TreeBASE as a pre-requisite for publication, a large number of phylogenies remain embedded as images in published articles. Indeed, the rapid growth of published phylogenies is not matched by the availability of those trees in databases (see Figure 1 in [5]).

The idea of using a program to convert a tree image into a computer-readable representation of that tree was first implemented in TreeThief [6] which required the user to trace a tree by clicking on each of its nodes in turn. The latter program is only available for the discontinued operating system Mac OS 9. Laubach and von Haeseler [7] provided a conceptual advance with a semi-automatic program called TreeSnatcher that has recently been updated [8]. TreeSnatcher uses image-processing methods to prepare a tree image and detect the tree structure, it works on rectangular and freeform trees (e.g., radial and star). The user supervises the tree recognition process by making corrections to the image. For example, the user can modify the image in order to make the foreground dark and background light, fill gaps in lines and identify the foreground. The program then determines inner node and tip locations. The user can add or remove further nodes and delete or add branches. The user is then required to assign species names to the tips before the program can build the Newick tree code.

Here, we will review the way researchers present their phylogenies, demonstrate the feasibility of a fully automated tree recognition software and provide a dataset of tree images and associated tree files for training and/or benchmarking future programs.

Implementation

The current version of TreeRipper opens tree-image files in the formats PNG, JPG/JPEG, or GIF.

  • The tree needs to have the root on the left and leaves on the right.

  • Horizontal branches.

  • The tree constitutes a dark foreground on a light homogenous background (no background boxes or shading).

  • The tree must be bi- or multifurcating (not a network)

  • The inner nodes are branching points between lines and have no circles, rectangles, etc. inscribed.

  • Tip branches must have branch lengths greater than 0.

TreeRipper is written in c++ using a set of Standard Template Library algorithms provided by Magick++. The image is first converted to black and white and rescaled so that horizontal lines are on average 2 pixels thick. The image is cleaned by removing a series of patterns such as black pixels surrounded by a box of white pixels and horizontal lines that are not connected to vertical lines. Lines and corners are then patched up before the contour is traced and the topology detected. The locations of branch tips are then used to crop the tip labels from the original image. Tip labels are converted to text using the freely available tesseract-ocr program. The steps in the program are depicted in Figure 2. The web application written in PHP enables the visualization of the tracing and allows editing of the labels.

Figure 2
figure 2

Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours.

Results and Discussion

We downloaded 322 images which had phylogen* or supertree in their caption from 249 articles published in the Open Access journal BMC Evolutionary Biology between 1997 and 2009. Only eleven out of these 249 articles have submitted their alignment and tree files to TreeBASE. All images were visually inspected to check whether the image met the prerequisites. Twenty-four images were not phylogenies, 26 were represented as radial tree layouts, 8 as polar tree layouts and 5 as cladograms. Of those represented with a rectangular tree layout, 40 had background boxes, 31 had lines intersecting branches or branches drawn with dotted or dashed lines, 32 had circles or boxes as nodes, 6 were illustrated over multiple pages, 4 had triangles as tip leaves, 3 had leaves with zero branch lengths. A further 29 would need some form of pre-processing (rotating or splitting into component images). Of the 298 images of phylogenies downloaded only 114 (38%) met the prerequisites for this program, which are very similar to those of the original semi-automatic recognition software TreeSnatcher [7]. This small proportion of the total phylogenetic images illustrates the plethora of ways trees are currently represented in one journal alone. Of the 114 phylogenies that meet the prerequisites, the topologies of 37 trees (i.e., 32%) were successfully recognized by TreeRipper without any prior processing. The proportion of successfully recognised images was higher for phylogenies with fewer leaves (Figure 3) and the largest phylogeny successfully recognised had 115 leaves. The average processing time was 127 seconds (ranging from 4 to 562 seconds) using a MacBook Pro (2.4 GHz Intel Core 2 Duo with 2 GB 667 MHz DDR2 SDRAM). We do not review the accuracy of the OCR here as it has been done elsewhere (see [9]).

Figure 3
figure 3

Proportion of images successfully recognised. The proportion of tree images successfully recognised by TreeRipper according to the number of leaves on the phylogeny.

The successfully recognised tree images along with a further 63 images manually converted to tree files are provided as supplementary material in NEXUS, Newick and phyloXML formats [10] (Additional file 1) for training and/or benchmarking future programs.

Conclusions

Although the program has a high failure rate, it is the first step towards an automated approach for optical tree recognition and proves the feasibility of an approach, which will allow us to defrost published phylogenetic hypotheses. We are unlikely to ever be able to create an application that recognises all possible trees due to the sheer diversity of ways phylogenies have been illustrated but at the very least, this program could be used for automating tree recognition of large sets of tree images before using manual conversion or semi-automated programs like TreeSnatcher for the trees that were not converted.

As phylogenetics enters a third phase of growth with the advent of next-generation sequencing, one hopes that the work of future phylogenetists will be published in a format that will enable the digital curation and preservation of their hard work.

Availability and requirements

Project name: TreeRipper (automated phylogeny recognition from images)

Webserver: http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper

Project home page: https://code.google.com/p/treeripper/

Programming language: C++ and PHP web interface

License: GNU GPL v3

Prerequisites

Tesseract-OCR licensed with the Apache 2.0 License except the tesseractTrainer.py, which is licensed with GPL: http://code.google.com/p/tesseract-ocr

Imagemagick, license is compatible with the GPL: http://www.imagemagick.org/