1 Introduction

Modern molecular biology is a science dominated by very large quantities of data, yet most useful knowledge remains locked in the scientific literature. Though large this literature can be efficiently explored through text mining, the application of computational methods to identify and extract entities and their relationships from text (Ananiadou and McNaught 2006). Already there are many text mining services for biology that enrich papers with semantic annotations for richer querying and also to extract relations between annotated entities. To illustrate some of these to the uninitiated, in current biological text mining it is possible to identify proteins within text (Rebholz-Schuhmann et al. 2007; Nobata et al. 2008), pull out their physical interactions (Miyao et al. 2009) and associations with disease states, phenotypes and other terms (Hoffmann and Valencia 2005; Tsuruoka et al. 2008). One can also resolve biological abbreviations (Okazaki et al. 2010), resolve species ambiguity (Wang et al. 2010), or make semantically rich queries over the literature (“what activates p53?” being a more meaningful search than simply “p53 activation”) (Miyao et al. 2006). For most applications leading methods compare favourably to expert annotators, but of course can be applied on a much larger scale, which is simple using workflow systems (Kano et al. 2009, 2010). Indeed, owing to the increasing rate of scientific publication it is clear that increased automation through text mining is the only way to reach a useful understanding of the biological literature (Ananiadou et al. 2006, 2010).

In the post-genomic era we are beginning to be able to properly consider molecular biology as the integrated system it evidently is through the burgeoning discipline of systems biology (Kell 2009). Underpinning systems biology are ideas connecting data-rich experimental approaches and computational simulations (Mendes et al. 2009) of the underlying biochemistry to move toward ever more accurate depictions of how life operates at the molecular level. It is therefore necessary to understand the network of interactions and reactions that occur in the cell, most usefully in a standardized format such as SBML (Hucka et al. 2003, 2004). Such reconstructions have benefitted greatly from genome-driven identification of the metabolic enzymes and transporters that constitute the macromolecular ‘parts list’ of metabolism. The remaining molecular species required are the small endogenous molecules of the metabolome. While increasingly identified by high-throughput experiments, most knowledge of metabolites and their reactions is primarily reported in the scientific literature. Considerable manual efforts have extracted metabolite information from papers into metabolite databases such as HMDB (Wishart et al. 2009) and reaction databases such as KEGG (Kanehisa and Goto 2000; Kanehisa et al. 2006, 2010) and BioCyc (Karp et al. 2005) and major community efforts have led to robust and well-annotated reaction networks defined in SBML. It remains the case, however, that the very large literature around our organism of interest, the yeast Saccharomyces cerevisiae, still harbors many uncaptured metabolites. Through the application of methods of text mining and cheminformatics we have addressed this issue to detect and structurally identify novel metabolites to move toward defining the reported yeast metabolome for consideration in future metabolic network reconstructions.

Few small molecule resources are limited solely to metabolites (much less only to yeast metabolites) owing to a lack of consensus among biologists upon the definition of metabolism. Without a strict definition it is inevitable that drug, nutrient and other molecules at the boundary of metabolism are arbitrarily included or excluded from different metabolite resources according to idiosyncrasies of database requirements or annotator interpretation. Our challenge here is not to remedy this by the imposition of a particular definition of metabolism but to replicate the standard of manually-curated databases without too much focus upon the ill-defined region. In utilizing text mining approaches to emulate the somewhat subjective and rather ill-defined biologist’s concept of metabolism the challenge is different to related work on chemical text mining (Banville 2006; Corbett and Copestake 2008; Jiao and Wild 2009; Pirkola 2008; Townsend et al. 2004; Wilbur et al. 1999; Wren 2006), the scope of which, by contrast, is rather easier to define and mark up in documents. The broader scope of these tools leads to the identification of many chemicals that are not metabolites. While they are of relevance it is the narrower focus of the metabolite mining task that precludes the direct application of existing chemical name recognition tools.

The identification of metabolite names within the scientific literature by text mining is only one part of the problem. Once a metabolite name is found it is most useful to identify its molecular structure. Various mechanisms for converting chemical names into structures exist. Some rely upon systematic naming conventions (for example, IUPAC nomenclature) that can be interpreted (Corbett and Murray-Rust 2006; Eller 2006; Klinger et al. 2008), yet metabolites do not tend to follow rigorously such approaches, while others are primarily underpinned by large chemical dictionaries (Hettne et al. 2009; Klinger et al. 2008; Brecher 1999; Goebels et al. 1991; Wisniewski 1990). Given the tendency for non-standard naming of metabolites, plus the size and machine accessibility through web services of ChemSpider (2007) it forms the basis of our name to structure conversions.

Detecting metabolites in text is here posed as a Named Entity Recognition (NER) task. NER is a technique that finds the boundary and the semantic category of specific terms in text. NER was originally defined for information extraction from news-wire articles (MUC6 1995), in which NE categories are proper nouns (such as person names and location names) and numeral expressions (such as date and the amount of money). The same technique has also been applied in the biomedical domain to annotate protein, gene or organism names (Takeuchi and Collier 2002; Kim and Tsujii 2002; Kazama et al. 2002; Kim et al. 2003, 2004; Tsuruoka and Tsujii 2004; Cohen 2005; Finkel et al. 2005; Hirschman et al. 2005; Krallinger et al. 2005). In this paper, we report the manually annotated corpus we created for metabolite NER and the evaluation results of the NER system trained on the corpus. We also apply the NER system to a large set of un-annotated abstracts from the yeast literature to extract novel metabolites. These entities are then mapped by name to ChemSpider to resolve their structures and move toward definition of the reported yeast metabolome.

2 Materials and methods

2.1 Construction of manually annotated data

Two domain experts (Annotator A and B) annotated metabolite expressions in the MEDLINE (2007) abstracts. The target documents are 296 MEDLINE abstracts included in the version 1 of the yeast metabolic network reconstruction (Herrgård et al. 2008). Each domain expert annotated metabolite and also enzyme names in the abstracts. The annotations were restricted to only those names that appear in the context of metabolic pathways. For example, in the sentence “glucose is an economically important chemical in the food industry” the role of glucose is not as a metabolite. When a metabolite name appears as a part of an enzyme name, the metabolite part is not annotated (e.g. “diadenosine hexaphosphate hydrolases” is annotated as enzyme, and the part of “diadenosine hexaphosphate” is not annotated as metabolite).

In this work we have focused on the annotation of metabolite names and therefore the enzyme annotations were used only to exclude a metabolite name when part of an enzyme name. The gold-standard (consensus) data are created by integrating these two manual annotations.Footnote 1 The evaluation results of two manual annotations compared to the gold-standard data are shown in Table 1. The f-scoreFootnote 2 of the data is 88.49 (Annotator A), 78.35 (Annotator B). The difference in annotations between Annotators was mainly due to Annotator B annotating more compared to Annotator A. This also demonstrates the ambiguity that can arise in annotating metabolites, as defining “a metabolite” is itself inherently difficult.

Table 1 Evaluation of manually annotated data

2.2 Methods of named entity recognition

Our method of recognizing NEs is similar to the system described in (Sasaki et al. 2008), which consists of two components. The first part, dictionary-based tagging, finds candidates for entities using a dictionary, and the second part is a supervised method trained with the results of dictionary-based NER and manually annotated data.

The first part uses dictionary information of metabolites from the yeast consensus metabolic reconstruction, and annotates metabolite names included in the abstracts. It also uses enzyme term lists (Bairoch 2000) so that the system can ignore metabolite names as a part of enzyme names (Table 2).Footnote 3

Table 2 Numbers of entries used in dictionary-based NER

The second part, statistical sequential labeling is a supervised method with manually annotated data. The module uses results of dictionary-based NER as well as word, orthographic and Part-of-Speech information as features to predict the NE labels. We also added the results of the dictionary lookup with ChEBI (Degtyarenko et al. 2008) and HMDB (Wishart et al. 2009) data as one of features. Table 3 shows the statistics of entries obtained from the databases used in annotating metabolites.

Table 3 Numbers of entries extracted from databases

We use an open-source morphological analyzer Mecab (2008) as a POS tagger. Word features are the surface form of the word and the postfixes (the last two and four letters of the word). Orthographic features are the first letter and the last four letters of the word form, in which capital letters in a word are normalized to “A”, lower case letters are normalized to “a”, and digits are replaced by “0”, e.g., the word form of IL-2 is AA-0.

The NE labels adopts IOB2 format (Sang and Veenstra 1999), i.e. the first token of the target sequence is labeled with “B” of “Beginning” (e.g. B-metabolite), the intermediate and the last tokens in the target sequence are labeled with “I” of “Intermediate” (e.g. I-metabolite) and other tokens are labeled just as “O” of “Others”. For instance, the sequence “7-keto-8-aminopelargonic acid” is annotated as “B-metabolite I-metabolite I-metabolite”. Models of Conditional Random Fields (CRFs) (Lafferty et al. 2001) are used to predict the IOB2 labels with the above features.Footnote 4

2.3 Cheminformatics workflows

All cheminformatics workflows were implemented in Pipeline Pilot (Accelrys, San Diego, CA). ChemSpider searches and retrieval utilized the ChemSpider search web service.Footnote 5 Clustering of non-unique hits was performed using the ‘Cluster Molecules’ component in a connectivity fingerprint space (FCFP4) at a maximum Tanimoto distance of 0.15 (Dobson et al. 2009). Lists of molecules were collapsed using the ‘Merge molecules’ component that acts upon canonical SMILES representations of molecular structure.

3 Results and discussion

3.1 Evaluation of the NER system on a test corpus

We conducted an experiment to evaluate our NER system with a manually annotated corpus. As stated before, the corpus contains 296 abstracts included in the yeast consensus metabolic reconstruction with annotations of metabolites and enzymes. Enzyme annotations are only used to select proper metabolite names; the system does not annotate enzyme names. Table 4 shows the evaluation results of the annotation. The dictionary-based NER system (Dict-NER) identifies metabolites purely by reference to dictionaries, whereas the supervised NER system (CRF-NER) extends this through learning from linguistic cues using the annotated corpus. The CRF-NER system is evaluated withFootnote 6 ten-fold cross validation. By applying CRF to the result of dictionary-based NER, the system was able to improve the f-score from 63.66 to 78.49.

Table 4 Evaluation of experimental results with the test corpus

We also compared evaluation results to those available through Whatizit pipelines (Rebholz-Schuhmann et al. 2007). Whatizit is a text processing system that identifies molecular biology terms in text. Whatizit pipelines relevant to metabolite annotations are three pipelines that annotate chemical entities (whatizitOscar3, whatizitChebiDict, and whatizitChemicals). The whatizitOscar3 pipeline provides annotations of chemical entities by Oscar3 (Corbett and Murray-Rust 2006; Batchelor and Corbett 2007), and whatizitChebiDict provides annotations of ChEBI entities based on a dictionary approach. The whatizitChemicals pipeline contains annotations of both whatizitOscar3 and whatizitChebiDict pipelines as well as Drugs and Protein names. The evaluation results are shown in Table 5.

Table 5 Evaluation of Whatizit results with the test corpus

For the results from Oscar3, we compared entities annotated as CM (Chemical Molecules) and as ONT (ontology terms) with the test corpus, and ignored other categories (i.e. CPR, RN, ASE) because they don’t include metabolite annotations. For the results from chemicals, we showed two results. The best precision and f-score are achieved when only the CM and Drug names are used in evaluation (CM, DRUG). The best recall is achieved when all annotated categories except for Oscar 3’s ignored categories are used in evaluation (CM, ONT, DRUG, PROTEIN).

We see that though their recalls are higher than our system, but their precisions and f-scores are lower than our system because of annotations of non-metabolites. Because these pipelines intend to annotate general chemical entities, the annotation results include more false positives than our system for metabolite annotations.

For example, underlined expressions in the following sentence are annotated with the whatizitChemicals pipeline, but these are not suitable as annotation for metabolite names: The YJR019C product is highly similar to tesB, a bacterial acyl-CoA thioesterase, and carries a tripeptide sequence, alanine-lysine-phenylalanineCOOH, that closely resembles the consensus sequence for type-1 peroxisomal targeting signals.

3.2 Name to structure via ChemSpider

The metabolite recognition tool was applied to the corpus covering S. cerevisiae (Gene Literature 2010) taken from Saccharomyces Genome Database (SGD) (2010), which contains about 53,000 MEDLINE abstracts. The probability generated from the CRF model is attached to each entity to indicate its most plausible annotation. If the same entity is annotated more than once in the set of documents its highest probability is assigned. Results are summarized in Table 6. The tool identifies 4,326 unique metabolite names. 2,441 (56%) of these are known from the dictionaries used in training. 1,885 (44%) are potentially novel metabolites not found in the dictionaries. All names were searched against the chemical database ChemSpider to identify appropriate structures. Of the 1,885 potentially novel metabolites ChemSpider searches return one or more hits for 1,245 names, the service fails on 7 names, and does not match any record for 633 names (33% of novel names). These 633 names therefore represent molecules that are not named in ChemSpider (as of 29 March 2010) and can be found in supplementary table S1. From the training dictionaries 2,441 names are detected. ChemSpider searches return one or more hits for 2,338 of these names; the service fails once, and does not match any record for 102 names. These 102 names can be found in supplementary table S2.

Table 6 Summary of name to structure efforts. ‘All names’ is the number of names detected by NER, including duplicates

As is to be expected, the majority of names from the dictionaries (96%) match some record in ChemSpider, unlike one-third of novel names that do not. Of the 1,245 novel and 2,338 dictionary matches against ChemSpider, 1,228 and 2,305 structural records were retrieved (with web service failures accounting for the remainder). 620 (50% of 1228) novel and 1,267 (55% of 2305) names match one and only one structure. Collapsing these down to remove structural redundancy (where identical structures are merged into one record on the basis of their canonical SMILES representation) the 620 novel names become 528 unique structures, and the 1,267 dictionary names become 1,003 unique structures. Removing redundancy from the union of both sets yields 1,435 unique chemical structures. These can be found in supplementary table S3.

Searches can hit multiple structures owing to duplicated synonyms on multiple ChemSpider records (for example, ‘glucose’ matches structures that only differ in the detail to which stereochemistry is specified), or because upon failing to match exactly the search service breaks the query string into parts for approximate matching and retrieves multiple inappropriate structures that only correspond to parts of the name. In the former case minor structural variants (charge, stereochemistry, and so on) are likely and not overly difficult to reconcile manually, whereas the latter case is not useful in the context of name to structure conversion. To discriminate between these scenarios the results are clustered by molecular structure (maximum cluster Tanimoto distance of 0.15 in FCFP4 space). 608 novel and 1,038 dictionary non-unique matches form one and only one tight cluster 139 and 389 times respectively. The remaining 469 and 649 names with hits that form more than one tight cluster are likely ChemSpider mismatches that may further extend the database in the same way as names that do not match at all. These incorrect structural matches are reported in tables S5 (dictionary names) and S6 (novel names). All 528 queries that generate only one cluster are reported with structures and ChemSpider identifiers in supplementary table S4.

3.3 Discussion

Metabolic network reconstructions for systems biology require knowledge of the list of ‘molecular parts’ involved. Macromolecular components can largely be elucidated from the genome, yet the endogenous small molecules remain mostly understood through the scientific literature. Extraction of metabolites from the literature is an arduous task for a well-studied organism such as S. cerevisiae on which well over 100,000 papers have been published. Text mining approaches are absolutely required.

The task of metabolite recognition is inherently difficult as there is no real consensus among biologists on the definition of metabolites and metabolism. Despite this metabolic databases, even though potentially polluted with non-metabolites or liable to omit real metabolites, are of tremendous value in metabolomics and in guiding metabolic network reconstructions. Here we have constructed our metabolite recognition tool to emulate how these valuable databases are constructed by allowing the slack definition to remain.

The gold-standard corpus of 296 MEDLINE abstracts, created by two independent annotators working to agreed guidelines, contains 1,853 annotations. The corpus is provided as stand-off annotations in supplementary material S7. Agreement between annotators is slightly lower than for similar NER tasks, largely due to ambiguity in the definition of metabolites. Expert comparable performance on this corpus generates an f-score of 78.35 (the lower annotator’s f-score). The metabolite NER reported here achieves slightly better with an f-score of 78.49.

It is not possible to draw very close comparisons to competing methods as the NER method is the first of its kind to focus specifically upon recognition of endogenous metabolites. There are, however, chemical recognition tools to detect any type of small molecule in text. Being more general means such tools tend to out-perform our method by recall but are considerably bettered in terms of precision. For example, as a metabolite recognition tool (which it is not) the Whatizit Chemicals pipeline ‘erroneously’ identifies 1,695 chemicals that are not metabolites in our corpus. Here the benefit of a metabolite-focused NER becomes apparent as a far more focused and manageable 220 false positives are generated, many of which, on closer analysis, appear to be attributable to misannotations (particularly where the definition of metabolism is problematic). In the sentence “<metabolite>Thioredoxin</metabolite> has been implicated in the reduction of PAPS in Saccharomyces cerevisiae.”, for example, ‘thioredoxin’ has not been annotated as a metabojlite, presumably as it is a protein encoded in the genome, but in many ways it is functionally more akin to metabolites and, by some definitions, might be considered as such. Curation of metabolite NER false positives will be a more efficient way to extend metabolite dictionaries than general chemical recognition tools.

The metabolite recognition tool was applied to the abstracts of yeast papers collated by SGD to identify the reported yeast metabolome. Owing to its size and high level of curation the ChemSpider database formed the basis of efforts to annotate names with structural information. 735 identified names do not match any record and 1,118 names match improperly. These 1,853 names represent possible extensions to ChemSpider. 1,887 names match one and only ChemSpider record. 528 names match sets of very closely related structures, most likely due to similar structures improperly sharing synonyms, and require further curation to identify the correct form. In total some level of structural annotation has been attached to 2,415 names. The 1,887 uniquely-matched names collapse to 1,435 structures. Although 55% of unique names could be associated with some structural information, it is clear from NER performance on the manually annotated corpus and examination of names not matched by ChemSpider that many unmatched names are real yeast metabolites. To augment the outstanding 45% of names with structural information, improved name to structure methods are required. Given the non-standard nature of many metabolite names the most successful approach is likely to be based on extending ChemSpider (and similar databases) rather than algorithm-driven. Extensions will need to cover more metabolites and improve synonym listings. The ability to automatically mine metabolites from the literature in a robust and discriminating fashion is essential to this problem of efficiently extending metabolite databases and continuing to improve metabolic network reconstructions.

4 Concluding remarks

The author have created an NER system for metabolites using term lists from the yeast consensus metabolite reconstruction and trained with annotation data that we manually created from 296 MEDLINE abstracts. Our NER system generates an f-score of 78.49 with ten-fold cross validation, which is comparable to the lower annotator’s f-score 78.35. We have also applied our NER to about 53,000 MEDLINE abstract corpus covering S. cerevisiae, and the recognized names are searched against ChemSpider, a major chemical database, to identify appropriate structures. We have identified structures for 55% of unique names (2,415/4,326), and also found many real yeast metabolites among unmatched names, which are good candidates to extend metabolite databases. Defining the reported yeast metabolome has created a useful resource for the yeast and metabolomics communities. It is anticipated that the metabolite NER will also be of value for other organisms, although should be used with caution. An important future direction is to automatically recognize reactions and metabolic pathways described in documents, eventually moving toward a fully-automated network reconstruction platform for systems biology. NER for metabolites is a vital step toward this.