Natural language query in the biochemistry and molecular biology domains based on cognition search

Motivation: With the tremendous growth in scientific literature, it is necessary to improve upon the standard pattern matching style of the available search engines. Semantic NLP may be the solution to this problem. Cognition Search (CSIR) is a natural language technology. It is best used by asking a simple question that might be answered in textual data being queried, such as MEDLINE. CSIR has a large English dictionary and semantic database. Cognition’s semantic map enables the search process to be based on meaning rather than statistical word pattern matching and, therefore, returns more complete and relevant results. The Cognition Search engine uses downward reasoning and synonymy which also improves recall. It improves precision through phrase parsing and word sense disambiguation.Result: Here we have carried out several projects to "teach" the CSIR lexicon medical, biochemical and molecular biological language and acronyms from curated web-based free sources. Vocabulary from the Alliance for Cell Signaling (AfCS), the Human Genome Nomenclature Consortium (HGNC), the United Medical Language System (UMLS) Meta-thesaurus, and The International Union of Pure and Applied Chemistry (IUPAC) was introduced into the CSIR dictionary and curated. The resulting system was used to interpret MEDLINE abstracts. Meaning-based search of MEDLINE abstracts yields high precision (estimated at >90%), and high recall (estimated at >90%), where synonym information has been encoded. The present implementation can be found at http://MEDLINE.cognition.com.


Introduction
With the increasing complexity of Biomedical literature, several labs and companies have attempted to develop better search engines for MEDLINE (1)(2)(3)(4)(5). A few free sources are visible on the web e.g. Google Scholar (http://scholar.google.com/), Highwire press (http://highwire.stanford.edu/lists/freeart.dtl) and Medscape(http://www.medscape.com/home) whereas other relatively commercial sources of this information is present at Scopus (http://www.scopus.com/scopus/home.url), Ovid (http://www.ovid.com/site/index.jsp), and Infotrieve (http://www4.infotrieve.com/newMEDLINE/search.a sp). We think that semantic NLP is require to properly access the biomedical literature, a view shared with many others (4,(6)(7)(8)(9)(10)(11)(12)(13). To our knowledge, however Cognition semantic NLP is the only technology that has thoroughly unraveled the full complexity of ordinary English. The architecture and databases of the software are such that multiple meanings of ordinary words and synonymy are resolved. The goal in search technology is to create software that finds all the desired information (full recall) without producing undesired information (high precision). Cognition's Semantic MEDLINE has the ability to target and locate specific data that are otherwise hidden in masses of information. Its comprehensive Semantic Map includes words, phrases and idioms. CSIR is also able to select senses of ambiguous words, giving much better results than pattern matching.

Architecture of CSIR™
CSIR™ is a natural language processing (NLP) technology that has been under development for several years.
The patented meaning-based architecture and methods have been described previously (14)(15)(16). The technology contains a broad semantic map of English based on word senses, their synonyms (6), hypernyms (higher nodes in an ontology) (7) and sense contexts. The CSIR Indexer uses its NLP component to build a cognitive model of the text in which all of the concepts (word meanings) of a document are indexed as well as word strings. The NLP component relies on its dictionary, semantic map, and morphological and syntactic tags ( fig.1). At search time, CSIR interprets the query for meaning, and searches for the meaning of the query in the concept index. Since the original descriptions of this technology, significant improvements have been introduced, including sense disambiguation (8), phrase parsing (17), data compression and speed upgrades (18). The morphology and tokenization components were built in-house (patent pending). The software also uses relatively simple algorithms for phrasal parsing and document relevancy to improve precision. Demonstrations of CSIR are available at http://medline.cognition.com and http://wikipedia.cognition.com. The search engine should be used asking a straightforward question that might be answered in MEDLINE, such as "Oxidative stress in plants," "spectroscopy of amidohydrolases," or "Depression in aging." Retrieval time on the 17 million MEDLINE abstracts is sub-second on Xeon Dual Core 3.0 GHz computers with 1 GB of RAM.

Ontology:
To augment the ontology for Biochemistry and Molecular Biology, a top ontology was constructed by hand, based upon our own domain knowledge. Websites of curated biomedical terminology were crawled to obtain a complete list of their ontological attachments. These were then mapped to our top ontology by hand.

Lexical and Concept Thesaurus Augmentation:
Biomedical terminological databases were crawled and the vocabulary (terms, phrases and acronyms) extracted, along with their synonyms and ontological classes, where available.
All vocabulary was checked for frequency in the MEDLINE abstracts and any items with fewer than 20 occurrences were deleted. Redundancy with the current dictionary was checked automatically, and redundant items curated by hand.
Specialized programs were written to crawl each website. Curated terms, synonyms and attachments were automatically added to the CSIR semantic map. Acronym spell-outs were used as sense contexts for acronym meanings (9).

Precision and Relative Recall Test of CSIR vs Pubmed.
We formulated 50 queries for the MEDLINE abstracts. The total number of CSIR retrievals was recorded, and the relevance evaluated for the top 10 and top 20 retrievals, as assessed by the UT Southwestern team. The same queries were posed to PubMed for comparison (in a Boolean format: "genetic" AND "interaction" AND "BCL2"). Relative recall was assessed by taking as full recall the largest number of relevant results found by either search engine. The queries used can be seen on the E.J.

Scale and Scope:
CSIR functions optimally when the semantic map "knows" the vocabulary in the documents. At the initiation of this project, a lexical evaluation of MEDLINE showed that CSIR was missing 66,000 tokens (words). Estimates of the total number of Biomedical terms is over a million, a much larger number, mostly phrases (10). Before this work, the CSIR Lexicon contained about 20,000 medical or biological terms (species, cells, anatomy, etc.). Here we added about 85,000 protein names, 35,000 chemical names, an ontology for Biochemistry and Molecular Biology possessing 2,400 nodes, and over 30,000 biomedical synonym classes. Together with other ongoing lexical augmentations, the detailed description of the entire Cognition semantic map is present in Table 1.   (7), or TAMBIS (11). The very top of our ontology discriminates 'proteins," laboratory procedures," etc.; an intermediate level of protein and gene names was inspired by the ontology in the AfCS (eg. "binding protein," "g-protein", transcription-factors), and by an ontology of terms in the HGNC that categorizes proteins and genes. (Table 2)  We introduced vocabulary from the UMLS Metathesaurus.
We built a map from the Metathesaurus ontology to our existing ontology, and then introduced the UMLS vocabulary into the lexicon automatically.
Multi-sense words were inspected by a linguist to prevent duplication. Synonyms, with the appropriate senses, were introduced to the Concept Thesaurus automatically.
Normalization included removal of plurals, redundant capitalized versions, and re-ordered versions. Automatic discovery of additional normalization rules, as in Wellner (2005) and Yoshimasa (2008) (22,23) would be a further step. This database includes both nouns and verbs covering biological sciences and medicine, amounting to 88,423 word senses, and 76,816 synonyms.
We then obtained additional word senses, all nouns, from the Alliance for Cell Signaling (www.alliance.org) (19). This source is current, curated and offers ontological entries, giving 15,661 new or improved word senses. The adoption of this vocabulary was accomplished through a combination of automated tasks and expert curation. Duplicates were curated. Unknown vocabulary was then added to the semantic map automatically, including ontological attachments and synonyms. Data from the HGNC (www.genenames.org) (20) has also been partially introduced. About 30 ontologies of protein families in HGNC have been imported, including AKAPs, ADAM proteases, bcl, BRCA, channel proteins, P450s, tubulins, ubiquitin ligases, phosphatases, TNF-receptors, histones, SMADs, and so on. We also introduced the IUPAC enzyme names and EC numbers, over 6,000 names. These were chosen because of the well-thought-out ontology that may be accessed with the EC numbers. A difficulty with this augmentation is the lack of natural language usage and lack of synonymy. In a separate project we introduced natural language terms by finding synonyms for the EC numbers in the UMLS.

Vocabulary growth
At the beginning of this project, there were 66,000 missing tokens (words). At present, we have completed the addition and curation of all words with a frequency greater than 35 ( fig. 2), and there are now 5,000 with frequency greater than 20 to add. MEDLINE abstracts were also searched to find verbs, which were curated to find words (such as express, silence, translocate, spin, sandwich, bait, prey) that have domain specific-meanings. This project has led to 225 new word senses. The added verb definitions contribute to improved precision, and will be useful when full sentence parsing is included in CSIR (12). 50 typical queries for MEDLINE were formulated as simple questions in the areas of biochemistry, molecular biology and medicine. The UT Southwestern team tabulated the relevance of the retrievals in http://MEDLINE.cognition.com. The reader is perhaps the best judge of the performance of the search engine. However, we compared Cognition's retrievals with those of Pubmed (http://pubmed.com). To make the evaluation manageable, we used the "relative recall" technique, wherein full recall is estimated as the greatest number of retrievals achieved by either search engine. For example, one of the queries was "genetic correlates of alcoholism". Of the first twenty CSIR retrievals, 16 were relevant. Thus CSIR's precision was 16/20 or .8. The total number of retrievals for CSIR was 1,436. To extrapolate the good retrievals, we multiplied the precision ratio .8 times 1,436 to yield extrapolated recall of 1,149. A similar calculation for Pubmed was .3 precision, a total of 44 retrievals, to yield extrapolated recall of 13 (Table 3).
Of the two extrapolated recall numbers, CSIR's is greater by inspection, so it is taken to be full recall on this query. Then recall for the two search engines on this query is calculated: CognitionSearch 1,149/1,149 or 1, Pubmed 13/1,149 or .01. Precision and recall ratios for all 50 queries are averaged to calculate the overall precision and recall.

Bootstrapping ontological attachments:
Most of the vocabulary derived from the acronym database and the UMLS had poor (very general) ontological attachments (eg, "amino-acid"). About 80,000 of 136,000 protein names were poorly attached. Attachments of well-classified words were spread to their synonyms resulting in 20,000 better attachments. A bootstrapping method took substrings as triggers; for example, "helix-loop-helix" as a substring of "transcription-factor-15-basic-helixloop-helix" suggests an attachment to the node "helix-loop-helix." This attachment was then assigned to the synonyms "bHLH-EC2-protein" and "paraxis".

Discussion
We think that the natural language approach of CSIR has an important role in future access to textual information in the Biomedical domain. This effort is our first pass at introducing Biochemical and Molecular Biology terms into the CSIR lexicon.
Other sources of new words will come from tracking user queries, evaluation of MEDLINE, and other curated databases. Efforts directed toward database integration may provide useful definitions, synonymy and ontology in molecular biology (13). We also plan to introduce additional parsing functions (24), (12) which should improve the precision of Cognition Search. CSIR works equally well on full-text as on abstracts. This work contributes to precise interpretation of biomedical texts for purposes of search (1,3,25), research (4) and data mining (2,26).

Uses and Applications of CSIR:
It is useful to review which linguistic processes produce these improved results. Morphology improves recall, so that the user can state a query term in one of its morphological variants, and CSIR automatically finds all other forms, as in phosphorylate and phosphorylation. Synonymy improves recall because one member of a synonym class retrieves documents with any of its members, as in "CD116," "GMHCFS receptor alpha subunit," etc. Ontological reasoning improves recall as the software reasons down from higher-level concepts to lowerlevel concepts. For example, you can query "what MAP kinase phosphorylates ATF2" and get documents with "ERK" and "p38" which are kinds of MAP kinases.
Sense disambiguation improves precision because only the documents that contain the query terms in the meanings intended by the user are retrieved. Phrase parsing improves both precision and recall. It improves precision by avoiding retrievals that happen to contain parts of a phrase in various positions, but not as the phrase. So "RNA", "binding" and "protein" might all appear in an abstract that has nothing to do with RNA binding proteins. It improves recall because it enables the mapping of synonym relations between phrases, and between phrases and acronyms, as in "TUBB" and "beta-tubulin". Biomedical language also possesses ontological relationships for proteins, genes, the Tree-of-Life animals, diseases, etc. CSIR includes the function of downward reasoning in ontologies. Thus, CSIR NLP technology can help to solve problems in medicine by finding material about specific instances of general concepts such as "heart disease medicine".

Areas for improvement
Precision is lowered when words are difficult to disambiguate, such as "Bad", which is an apoptosis protein, but at present is recognized as the ordinary English "bad". It will be relatively easy to address missing terms since we know there are still 5000 individual terms used in MEDLINE with a frequency of 20 or more that we need to define. We will use the methods of Tsuruoka (27) for future term recognition, synonymy expansion and evaluation of coverage.