Federated Access to Heterogeneous Information Resources in the Neuroscience Information Framework (NIF)
- First Online:
- Cite this article as:
- Gupta, A., Bug, W., Marenco, L. et al. Neuroinform (2008) 6: 205. doi:10.1007/s12021-008-9033-y
- 1.3k Downloads
The overarching goal of the NIF (Neuroscience Information Framework) project is to be a one-stop-shop for Neuroscience. This paper provides a technical overview of how the system is designed. The technical goal of the first version of the NIF system was to develop an information system that a neuroscientist can use to locate relevant information from a wide variety of information sources by simple keyword queries. Although the user would provide only keywords to retrieve information, the NIF system is designed to treat them as concepts whose meanings are interpreted by the system. Thus, a search for term should find a record containing synonyms of the term. The system is targeted to find information from web pages, publications, databases, web sites built upon databases, XML documents and any other modality in which such information may be published. We have designed a system to achieve this functionality. A central element in the system is an ontology called NIFSTD (for NIF Standard) constructed by amalgamating a number of known and newly developed ontologies. NIFSTD is used by our ontology management module, called OntoQuest to perform ontology-based search over data sources. The NIF architecture currently provides three different mechanisms for searching heterogeneous data sources including relational databases, web sites, XML documents and full text of publications. Version 1.0 of the NIF system is currently in beta test and may be accessed through http://nif.nih.gov.
KeywordsOntology Data federation Neuroscience resource
A more focused search that might actually help the neuroscientist better is shown in Fig. 1(b). This result is mostly about resources like Open Biosystems that can be used as a resource for cDNA libraries for mouse models of human diseases. The resource finding problem gets compounded if the neuroscientist wants to search for the information not only from the web, but over any kind of information resource mentioned in the previous paragraph, because there are no search tools that provide adequate functionality to satisfy the information needs of our neuroscientist.
Although a domain user searches for resources using keywords, the intent of the search is conceptual. Thus, although the search term astrocytoma, there is an implicit expectation that a resource about astrocytic glioma or glial malignancy will be part of the result. Most search engines do not provide a semantic search facility, which includes not only search by synonyms but by terms that are notionally related to the search terms. As another example, the user searching for hippocampal formation would possibly also be interested in the cell types found there because they are semantically related.
- The web is an important class of information resource, but discipline-specific web search suffers from three significant limitations:
◦ Most web search engines are not discipline specific and are based on a general pagerank like mechanism. Therefore, the result of a query is more likely to use the general popularity of a page instead of finding its discipline-specific relevance when returning a search result. Thus a query on knockout is likely to rank web pages on “knockout matches” in sports higher than “knockout animals”, which are more relevant to biological sciences.
◦ The problem of the “deep web”, whereby a resource does not expose the content of the database but allows a user to access it only through forms or some other functional interface, is not yet solved. It is an active research area among information management and information retrieval researchers.
◦ The web is not a coherently designed information system. So it does not resolve or correlate an information entity found in one source to another information entity found in another source, even though might refer to the same real world entity. Thus two web sites referring to the same publication are considered to be different pieces of information, and will typically produce duplicated results for a query.
Keyword-based search is the most natural form of information search when the user knows very little about the structure and content of an information resource. However, not all information resources provide a facility for keyword search. For example, database systems or data files may support very limited form of keyword search if any (although academic researchers are working in this area e.g., Hristidis et al. (2003)). On the other hand, as the user gets to know an information source better, the user prefers some other mode of information access including data browsing, queries, or special purpose techniques like atlas exploration. There are no such search/query tools that provide the user with the ability to uniformly search over all types of heterogeneous information resources and then refine the search procedure as the user gains the ability to perform deeper search.
To address this problem, we have developed a method for federated access to the information federation framework called NIF (Neuroscience Information Framework) where heterogeneous information resources can be accessed through a shared ontology. This framework is designed to admit resources that provide different degrees of access to their data content. An extensible OWL (Web Ontology Language, see http://www.w3.org/TR/owl-ref/) ontology called NIFSTD (NIF Standard) has been constructed based on sound ontological principles. We have constructed OntoQuest, an ontology management system that permits a user to store, search and navigate any number of OWL-structured ontologies. A fully functional web-accessible system, NIF version 1.0, currently in beta release, (available through http://nif.nih.gov) has been developed.
In the following sections we describe the overall architecture and different components of the NIF system. More details on the background of the NIF project is covered in the NIF white paper (Gardner et al. (2008)) in this issue.
The NIF System Architecture
Recently, the term dataspace has been introduced in Franklin et al. (2005) to refer to an information management scenario where the data resides not just within the custody of a managed storage-and-retrieval software like a DBMS (DataBase Management System like MySQL or Oracle), but in text files, emails, software-produced documents, and yet provides a set of common services for search and information organization. The NIF architecture is an example of a dataspace system that provides search and data exploration services over heterogeneous information systems whose capabilities and ontological descriptions are registered to a few central catalogs.
A NIF Web Resource is a web site that has information relevant to Neuroscientists. Such a resource can be an informational web site that only allows browsing, a web site that allows browsing and queries through web forms, software sites, sites for chemicals like reagents, and so on.
A NIF Data Resource is a database that enables an external application to send a query using a query API or a query language.
A Data Mediator is a data integration engine, developed in the context of the BIRN (Biomedical Informatics Research Network) project that allows one to query a set of distributed relational databases, and computation engines, by creating a single virtual database on top of them.
A Mediated NIF Data Resource is a database that can be queried by NIF only through the Data Mediator.
The NIF Literature Resource is a text processing system that parses publications, extracts its metadata, marks its content from a known vocabulary and allows the NIF system to search for publications through keyword and metadata queries. Currently, the NIF literature resource is assembled through the Textpresso text indexing system (see Müller et al. 2008).
The NIF Ontology is a human curated, semi-automatically assimilated OWL-structured ontology called NIFSTD (NIF Standard, see Bug et al. (2008) for details) that contains terms and inter-term relationships relevant to neuroscience researchers.
The overall architecture of the NIF system is shown in Fig. 2. The different building blocks are discussed below.
The client software that allows an authorized user to add a new resource name to the NIF Web Catalog.
Web Search Client
This is the web client that is used by the simple and the advanced query interfaces.
User Request Manager: The Request Manager is the entry point of the system where the users can either add a new entry for the NIF Web Catalog, and more importantly performs a conceptual search operation. The application logic handles the request, for example, by passing on the query to the Search Coordinator, described next. It also controls the display of the results.
NIF Search Coordinator: An integral part of the application logic, the NIF Search Coordinator takes the user’s keyword query and in the most common case, performs an ontological search to retrieve conceptual terms that closely match the terms in the ontology, and if desired, the neighborhood of these ontological terms. This process of exploring the ontology to find related terms is performed interactively. When the user settles on the final query terms, the keyword module uses the index to locate sources that have the data or web documents satisfying the keywords. Once the data sources are located, the source query wrapper module transforms the query into queries against all sources and broadcasts these transformed queries. The process of transformation converts the query keywords into SQL (or HTTP calls and so on) for structured data sources, XML requests, search against the web index and so forth. If the user’s search terms are not found in the ontology, the search coordinator allows the query to be posted directly against the sources as a string search.
Keyword Query Processor: This module manipulates the user’s keyword queries to an internal form
Index Manager: We use the term Index Manager to refer to the indexing engine and the controlling program surrounding it. The NIF system uses the Lucene indexing engine from Apache to create an inverted index of the results of the web crawl. The Lucene index is also used to index all readable data sources, both relational and XML. The index manager contains the methods to create, update and access the index, and is primarily used by the NIF Search Coordinator.
NIF Web Catalog Manager: The NIF Web Catalog (also called the “NIF Registry”) is a repository of NIF Web Resources. For each resource, NIF maintains a number of attributes that characterize the resource. Of these, some like the URL of the resource, or the rough classification of the source are mandatory, while others, like the detailed description of the Web Resource are optional. In the current version of the NIF system, the category assigned to a Web Resource comes from a simple hierarchical vocabulary (e.g., a neural modeling resource comes under the category software resource) assembled by the NIF team. In the current implementation, both the catalog and the vocabulary are structured as XML documents. The Catalog Handler is a set of methods and index structures that enable searching of the catalog information. Currently, both keyword queries and XML queries are supported by the handler.
To include more web pages, we developed “NIF Web”, which uses a web crawler to traverse the web sites contained in the NIF catalog. This expands the scope of NIF search beyond the web sites of the NIF catalog, but still keeps the scope within the realms of Neuroscience. Using these seed sites, the Nutch web crawler from Apache (http://lucene.apache.org/nutch/) is used to crawl the links to a depth of 15. Even as the number of seed sites grows, we have found that the 15-deep crawl provides a sufficiently broad coverage and yet retrieves web pages that largely contain information relevant to Neuroscience. The results from the web crawl are harvested and sent to the Index manager.
The Data Structure Layer contains different index structures to make queries faster. A technical description of this layer is beyond the scope of this paper.
NIF Ontology Manager: The NIF Ontology (NIFSTD) is a large and growing OWL entity that is itself a combination of several ontologies (see Bug et al. (2008)). These ontologies are stored in OntoQuest (Chen et al. (2006)). Partly inspired by the IODT framework from IBM (Mei et al. 2006) OntoQuest stores all distinguished relationships permitted by OWL (e.g., subclass-of, allValuesFrom, disjoint etc.) in separate tables, while all user-defined relation names are stored in a quad-store. Logically, OntoQuest views the ontology as a graph and performs graph-like operations (e.g., finding the k-neighborhood) on it. It contains specialized indexes (see Chen et al. (2005)) to quickly find ancestor-descendant like relationships for transitive relationships like subclass-of and part-of. OntoQuest contains its own query processing engine to support ontological queries.
Structured Data Integrator: We use the term “structured data” to refer to relational databases that can be accessed in any of the following ways — 1) directly by querying an SQL database (e.g., Cell Centered Database or CCDB, Martone et al. (2003)), 2) through an HTTP GET or POST operation executed against a database exposed through a web form (e.g., the CRISP grants database from NIH available at http://crisp.cit.nih.gov/), 3) invoking a function or a web service, 4) by querying the BIRN mediator (Gupta et al. (2003)), which in turn integrates multiple databases (e.g., the Senselab database from Yale). The structured data integrator module uses the mediator’s data integration registry to find the schemas of the databases, and performs a federated query by sending SQL queries created in the manner described below. The result of the federated query is sent back to the Search Coordinator Module.
Web Result Post-processor: For the NIF Web, the results of the keyword search are passed through two additional steps. The first step ranks the results, placing higher importance on the title and the relative frequency (the tf-idf score) of the query keywords in the content of the document than, for instance, on its recency. The results are also sent to post-clustering module, currently implemented with the fuzzyAnts algorithm in Weiss (2006) of the Carrot Clustering engine (see http://demo.carrot2.org) to organize the results into groups of related web sites whose pages significantly share common terms.
The Textpresso Subsystem: The NIF Literature search system provides the ability to search text from publications. This is performed through the Textpresso subsystem, which indexes full-text publications and categorizes all non-trivial terms against predefined term categories. The user’s keyword query is posed against the Textpresso system to retrieve publication with the search terms and synonyms highlighted. In the NIF infrastructure, Textpresso is accessed as a set of web services. The web services are implemented as a two-step process. The first step is to run a search on the server; the second is to retrieve results from the server. Such a process is necessary because the search results (in XML format) may be on the order of several megabytes. Forming the XML file may take more time than the time out limit for the client. Also the client may not need all the documents that the search resulted in. In most cases, users are interested only in the documents (and the sentences therein) that have the maximum scores, similar to how users look only through the first few pages of a Google or Yahoo search. The current set-up allows the client to retrieve only a maximum of 500 documents in one call. For retrieving more than 500 documents, the client needs to send more queries with appropriate document range. This system, currently indexing about 67000 papers, is described in more detail in Müller et al. (2008).
How the NIF System Works
At present, the NIF system is only partially capable of performing more complex matching strategies where, for example, a search on “neuron” will also match “neural”. We have implemented such a “fuzzy search” on the NIF Web Catalog content on an experimental basis — the user may optionally use this feature. Based on community feedback, and the response time to perform such a search on large volumes of data, we might add this feature to other resources in future versions of the system.
An important feature of the NIF Web is the ability to control factors used to rank results. Because this is a web index built specifically for neuroscientists, we can develop appropriate criteria for determining the rank order of returned results. We envision that such a system could be tuned by different groups hosting a NIF site depending upon their constituents. For example, the NIF Web may be tuned to rank NIH Blueprint-sponsored resources higher than non-Blueprint resources so that they appear higher in the returned list in the NIF Web. Many of these resources are small and do not have the web traffic to rank highly in the commercial search engines. However, through the NIF, these resources can be given more weight.
For NIF literature, the Textpresso system (Müller et al. (2008)) returns an XML result. The NIF system not only displays the data, but automatically constructs links to PubMed and Google Scholar from which the articles can be downloaded if the appropriate permissions are in place. It also links the results to the Textpresso-annotated records at the Textpresso site where the full capabilities of the Textpresso web site (http://www.textpresso.org/neuroscience/) can be utilized.
Adding a New Data Resource to NIF
An important aspect of NIF is that new information resources can be added to it without having to change the infrastructure. NIF allows one to add two categories of information sources — those that are accessed as web sites (e.g., CRISP), and those that are accessed as databases.
To add a new web resource, the NIF system needs to determine how to convert a keyword query posed by a user to an equivalent HTTP query to the web resource. At this time, this is accomplished semi-automatically. The new website entry points are analyzed to determine how an HTTP GET or and HTTP POST can be constructed for the specific web site with the keywords. Sometimes, as in the case of CRISP, additional parameters need to be supplied (e.g., number of results desired); a set of default values are used for this purpose. In future versions, these parameters can be made user-selectable. This information is stored in a site wrapper specifically created for that source. In our experience, in most cases, this step takes at most a couple of hours for each new source.
Adding a new database source to NIF is a little more involved and requires an IT personnel like a database administrator who goes through a process called database registration, and then optionally, a step called concept mapping. The database registration step is based on the information integration mechanism developed for the NIH/NCRR funded BIRN (Biomedical Informatics Research Network) project (see http://www.nbirn.net/). The registration maker uses a tool called Fuente (Astakhov et al. (2006)) that connects to the database being registered. Fuente operates by first connecting to the database to be registered and reading the full schema into a visual tool. From this schema the registration maker determines which tables and columns should be accessed by the integration engine, and how to map the data types of the database to the data types known to the integration engine. Once, this mapping is specified, Fuente exports it to the integration system, which in turn stores the schema in a registry. When a new schema is deposited in the registry, the NIF system makes an update in its configuration so that the next time a query is made, it would also be broadcast to the new schema, and the results would be reported in a new panel on the interface. The configuration can be modified by the NIF operators to decide which tables and columns should be visible to the NIF user. When a new database is registered, the NIF indexing mechanism updates the NIF indexes so that the keyword queries can operate efficiently against the new data source. In our experience, the whole process of adding a new database takes between 2–4 h, depending on the size of the database, and the efficiency of the manual part of the process. Currently, Fuente can connect to MySQL, PostGreSQL, Oracle, SQL Server, and a couple of smaller DBMS systems.
The concept mapping step can occur after a database has been registered. The goal of this step is to create a mapping from the field names and terms used within the database to the terms known to the NIFSTD ontology. For example, if the database has the term “electron tomography” and the ontology does not have this term, then a knowledgeable and authorized user of the database can map it to a nearby term in the ontology like “electron microscopic imaging technique”. If such a mapping is created, a query on an ontological term like “electron microscopic imaging technique” will also retrieve the data record on “electron tomography” which would have been otherwise impossible to retrieve. In NIF we have created the first version of a concept mapping tool that can map one term of a database to one term of the ontology. The mappings are stored in one part of the NIF infrastructure called the Term Index Source. We estimate that the concept mapping process currently requires between a few hours to several days effort, depending on the complexity of the information to be shared. In future versions, this tool will be upgraded to add further automation, and the ability to specify more complex mappings.
Information Content of the NIF System
The NIF Web Catalog: The content of the NIF Web catalog is created by expert contributors, by selecting web sites that represent different forms of Neuroscience resources. Each entry of the NIF Catalog (NIF Registry) is annotated with high level descriptors from a controlled vocabulary that describes the resource type, its general content and other information about the resource. As of this writing, there are a total of 388 resources registered to the NIF. A breakdown of these resources according to the high level categories established by NIF is given in Fig. 3. Many of these resources were imported directly from the Internet Analysis Tool Registry (IATR) (http://www.cma.mgh.harvard.edu/iatr/), an existing resource that maintained a list of software tools for neuroscience, leading to a heavy representation of software tools. In addition, the NIF was selective in the types of resources that it catalogued: no commercial sites or products were included; the resources had to provide information or tools directly relevant to performing neuroscientific research.
The NIF Web: The limited web crawling process outlined in the previous section has turned out to be quite effective. The crawling depth of 15 almost always captures the content of the entire web site of the seed sites. In 90% of the time, the links connecting outside of the seed sites turn out to be neuroscience relevant sites, pointing to NIH sites for instance. The crawler also indexes Word and PDF documents accessible from the web site. This gives us the extra benefit that even if these pages were not initially marked up through the ontology, the system can still perform ontological search on them after the indexed text content has been brought into the NIF system. At the present time about 10 million relevant web pages are indexed and are searchable. In 10% of the cases, however, our current crawling strategy produces extraneous content not connected to Neuroscience. For example, a pointer to a newspaper article about a neuroscientific discovery, may further link to other unrelated content from the same newspaper article. One hindrance encountered in operating the NIF Web crawler is that some very informative web sites like The Antibody Resource Page (http://www.antibodyresource.com/) have explicit directives for crawlers not to crawl the site. Since we have to respect such provider directives, we cannot complete cover all content through the NIF Web. Further note that the current version of the NIF system does not address the “hidden web problem”.
External Databases: Currently, relatively few neuroscience resources use a well-designed robust relational database system. Even those that do usually do not allow external systems to query their databases directly. However, we believe that data sharing, including database sharing, will be much more common for Neuroscience in the future. To illustrate how such community-wide sharing might occur, we chose five databases, each with unique but overlapping content. The Cell-Centered Database (Martone et al. (2003)) at UCSD provides access to multi-resolution cellular data captured by different imaging and volume reconstruction techniques. The Senselab system at Yale (http://senselab.med.yale.edu/) provides access to physiological models of neuronal circuits. The SUMSDB database at Washington University (http://brainmap.wustl.edu/caret/) provides access to cortical maps of human, macaque and rodent brains. The Neuromorpho database (http://www.neuromorpho.org) at George Mason University provides synthetically constructed neuron models (see Halavi et al. (2008)). The NeuroMAB database at University of California Davis (http://www.neuromab.org) is an antibody supply catalog for mouse models that was included because it is a “facilities” type of resource that can be accessed based on molecular targets like potassium channels, transporters and scaffold proteins.
Informal Testing the NIF version 1.0
The simple user interface was used by more people initially, but there was a steady increase in the use of the advanced interface with time. This illustrated that with some degree of experience, the users, particularly the more knowledgeable users, found the use of ontology to be more useful.
A number of users pointed out that it will be beneficial if they could (a) have the option to select the sources they wanted to search over, and (b) specify the type of results they wanted (e.g., results with image content only).
A number of users showed instances where the NIF Web retrieves some data pages that are not in the domain of neuroscience.
Users almost unanimously stated that they wanted the results of the queries to be organized by ontological terms instead of (or in addition to) by resource type. The primary argument was that an ontology-based result presentation will scale much better as more data sources are added.
A number of users wanted a greater variety of content to be covered, ranging from genetic data to the latest imaging techniques to drugs used for neurological disorders. For these cases, the users found that Google had better coverage than NIF. We verified that this observation results from the following: (a) the NIF Registry sites we have used to seed the NIF Web search did not have a well-rounded coverage, while Google’s coverage, albeit not focused, is much more universal, and (b) sometimes Google’s ranking of the results was preferred by users compared to the ranking produced by NIF web.
While users liked the fact that NIF Web results were clustered, the quality of clustering produced mixed reactions because for some searches the grouping produced by the clustering algorithm were considered “not useful”.
The response of some of the system components like Textpresso and science.gov became slower as the number of query terms was increased. This could be partly rectified in the testing period, but needs to be investigated more thoroughly in future.
These findings, albeit coming from a non-rigorous testing process, highlight some of the mismatches between the users’ expectations and the current capabilities of the NIF version 1.0 system.
Conclusion and Future Work
In this paper we have described the technical design and functionality of the version 1.0 of NIF system. We expect this system to evolve in the future with a number of enhancements besides the ones listed under the test feedback. We plan to include genetic and proteomic data and computational resources related to Neuroscience. For instance, web-accessible genetic data from NCBI, mouse model data from the Jackson Laboratory (http://www.jax.org) and QTL data from University of Tennessee (http://www.genenetwork.org/) are likely to be added to NIF. A future version of the NIF system will also be able to query and access RDF-formatted (RDF stands for Resource Description Framework, which is an emerging standard for representing semantic information for the web) from the Neurocommons project (http://neurocommons.org, see also Ruttenburg et al. (2007)) and academic systems such as Lam et al. (2007), O’Connor et al. (2007a, b). We also plan to experiment with different variations of user interface for different categories of users to determine the difference in the intended behavior of system for different audiences.
Information Sharing Statement
The NIFSTD and BIRNLex ontologies are available at http://purl.org/nif/ontology/nif.owl and http://purl.org/nbirn/birnlex/ontology/birnlex.owl respectively. The NIF is offered under BSD and MIT compatible OS licenses (http://opensource.org/licenses).
This project has been funded in whole or in part through the NIH Blueprint for Neuroscience Research with Federal funds from the National Institute on Drug Abuse, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN271200577531C. The mediator and concept mapping tools were adopted from the Biomedical Informatics Research Network, supported by an award from the National Center for Research Resources (U24-RR019701),
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.