Software review: the KNIME workflow environment and its applications in genetic programming and machine learning
- 1.5k Downloads
Software comes in various forms, from the hair shirt style of the command line to fully blown, GUI-based commercial offerings. The former tends to give its users more control, but disenfranchises many other potential users who cannot themselves program yet who might otherwise benefit from it. A kind of halfway house is represented by software environments that provide both flexibility (power) and ease of use. A particular subset is represented by Workflow environments, in which loosely coupled, individual processing nodes can be ‘bolted together’ to permit complex computational operations. Taverna  (http://www.taverna.org.uk/) is a very well known scientific workflow system, especially in bioinformatics. It is a fully open environment, freely available, and workflows can be shared via its sister site myExperiment http://www.myexperiment.org/. It has some extensions for cheminformatics . A particular strength is the means by which it can use Web services to link federated Web-based resources, a particular feature of bioinformatics.
For cheminformatics (see also ), we have been using the KNIME environment [4, 5]. KNIME stands for the KonstaNz Information MinEr  and is pronounced ‘NIME’ (with a silent ‘K’, like knife). It is freely available via www.knime.org for unrestricted use on the desktop (and with versions that operate under MS-Windows, Linux and Mac OSX). As datasets may be large, a reasonably beefy machine is advised. The download itself is just over 1 Gb, and installation is both automated and simple. (There is an otherwise identical commercial offering available at www.knime.com; its chief differences are that the environment may be extended to servers, and to clusters that run the Sun Grid Engine). Our main experience is with the Windows desktop version. Under the hood, KNIME is built on the ECLIPSE environment, with Java as its main internal language. Many other languages can be used with it, however, as detailed below.
The particular beauty of KNIME for cheminformaticians is that a great many tools have been produced that allow standard procedures to be implemented without additional programming, e.g. converting chemical structures to computer-readable encodings. We regularly use the RDKit (e.g. ) nodes. Most nodes shown in the figure come with the vanilla-flavoured version of KNIME and/or the many free add-ons. One such is the ‘Tree Ensemble Learner’, which is from KNIME labs. However, for programmers it is possible to create nodes of arbitrarily complex function by ‘wrapping’ code in any nodes that ‘understand’ (parse) one of a number of languages, such as Matlab, R, Perl and Python (native nodes use a freely available SDK and are in Java). Thus the PLS regression node simply wraps a call to a standard R library, while the GP metanode wraps a fairly standard but detailed GP written (by SO’H) in Python. This metanode can easily be configured by its user. The chief disadvantage of this implementation is that one cannot see the GP running, but its progress can be recorded post hoc and exported (here to show fitness vs. time for training and validation sets). The final two windows show a list of available workflows (top left) and a list of frequently or recently used nodes.
To give an idea of speed, to write the GP metanode took a few days. Given this, however, to assemble the workflow of Fig. 1 took just a couple of hours, and to run it for 1000 GP generations with a population size of 200 and including niching (the slow step) took only 20 min on a standard desktop PC.
Where KNIME and related workflow systems come to the fore is in their ability to let ‘naïve’ users (re)create complex analyses just by reusing existing nodes or whole workflows, and even just by changing file names for instance. Thus some rather sophisticated workflows that compared the structures of ‘natural’ human metabolites with those of marketed drugs and other chemicals, outputting the analysis in the form of a 2D-biclustered heatmap [4, 5], were actually just a single workflow with simple filename changes. Given the base workflow, a novice could learn to do these changes in less than an hour, though of course time spent learning to create new workflows can be almost limitless. There are also API links to commercial software such as the Spotfire visualisation system.
Overall, this is a very sophisticated and professional piece of software. Because of its flexibility, it is nowadays our chief cheminformatics workhorse, and voting with one’s feet is surely the best possible endorsement. The KNIME philosophy and business model of mixed commercial and free (but Open) software, allows its continued improvement while making it freely available to desktop users. Some minor gripes relate to the fact that it seems only to read but not write .xlsx files—we are confident that someone will write a node to let it do so soon. There is a substantial community of users, increasing all the time, and many training schools and the like. Because of this, we think it will continue to grow in popularity. It is well worth a look for the GP community.
We thank the Biotechnology and Biological Sciences Research Council (BBSRC) for financial support under Grant BB/M017702/1. This is a contribution from the Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM).