Design and implementation of Metta, a metasearch engine for biomedical literature retrieval intended for systematic reviewers
- 2.6k Downloads
Individuals and groups who write systematic reviews and meta-analyses in evidence-based medicine regularly carry out literature searches across multiple search engines linked to different bibliographic databases, and thus have an urgent need for a suitable metasearch engine to save time spent on repeated searches and to remove duplicate publications from initial consideration. Unlike general users who generally carry out searches to find a few highly relevant (or highly recent) articles, systematic reviewers seek to obtain a comprehensive set of articles on a given topic, satisfying specific criteria. This creates special requirements and challenges for metasearch engine design and implementation.
We created a federated search tool that is connected to five databases: PubMed, EMBASE, CINAHL, PsycINFO, and the Cochrane Central Register of Controlled Trials. Retrieved bibliographic records were shown online; optionally, results could be de-duplicated and exported in both BibTex and XML format.
The query interface was extensively modified in response to feedback from users within our team. Besides a general search track and one focused on human-related articles, we also added search tracks optimized to identify case reports and systematic reviews. Although users could modify preset search options, they were rarely if ever altered in practice. Up to several thousand retrieved records could be exported within a few minutes. De-duplication of records returned from multiple databases was carried out in a prioritized fashion that favored retaining citations returned from PubMed.
Systematic reviewers are used to formulating complex queries using strategies and search tags that are specific for individual databases. Metta offers a different approach that may save substantial time but which requires modification of current search strategies and better indexing of randomized controlled trial articles. We envision Metta as one piece of a multi-tool pipeline that will assist systematic reviewers in retrieving, filtering and assessing publications. As such, Metta may find wide utility for anyone who is carrying out a comprehensive search of the biomedical literature.
KeywordsInformation retrieval Metasearch engine Bibliometrics Medical informatics Evidence-based medicine Meta-analysis Systematic reviews
A metasearch engine is a federated search tool that supports unified access to multiple search systems . It contains a query interface in which a user enters a single query that is sent to multiple search engines linked to different databases; the responses returned from the search engines are gathered and merged in real time, and are displayed in a concise, organized manner. The process is deceptively simple, for a great number of technical, informatics and design issues need to be solved in order to make a practical metasearch engine. Technical issues include creating a global query interface from the query interfaces of individual search engines , making sure that queries are understood meaningfully for each search engine, that responses occur in a timely fashion, and that results remain correct and reliable in the face of changes and updates that may occur, independently and unpredictably, within each of the search engines or their linked databases. Informatics issues include synonym and abbreviation recognition, and other natural language processing steps intended to make queries robust and comprehensive. Design issues include making sure the interface is intuitive and easy to use, and that using the metasearch engine actually saves time and maintains performance relative to conducting searches through each search engine separately.
Individuals and groups who write systematic reviews and meta-analyses in evidence-based medicine regularly carry out literature searches in multiple bibliographic databases, and thus have an urgent need for a suitable metasearch engine to save time and to remove duplicate publications from initial consideration. However, they also have special needs which creates special requirements and challenges for metasearch engine design and implementation. Most general users may carry out searches to find a few highly relevant (or highly recent) articles, but systematic reviewers seek to obtain a comprehensive set of articles on a given topic, satisfying specific criteria . The focus on high recall means that the metasearch engine needs to retrieve large numbers of bibliographic records from all search engines, and not merely the highest few from a ranked list based on relevance or recency. (Note that a record contains citation information such as author, title and journal, and may include additional fields such as abstract and database specific annotations and indexing terms). Because the indexing of articles is imperfect and not uniform across search engines, and because term usage varies considerably , systematic reviewers tend to create very large, complex queries that are tailored differently to take into account unique features of each search engine/database. The high cost of missing any relevant articles leads to a situation in which the initial set of retrieved records requires manual inspection and may be 10–100 times greater than the final set deemed truly relevant for consideration in a given systematic review .
While the process of updating systematic reviews is ongoing, and there is continuing need for systematic reviews on new topics, current resources to conduct them is constrained. One of the goals of Metta is to enable a larger group of teams with clinical, but not specialized informatics expertise, to perform systematic reviews. By moving the burden of comprehensive searching from search experts to the Metta system, the resources necessary to conduct a systematic review in a single, perhaps highly specialized, topic could be lessened. This could greatly expand the number of topics covered by systematic reviews by enabling a variety of teams with appropriate clinical expertise to conduct and publish these reviews.
In the present paper, we describe our experience in creating Metta, a metasearch engine which is intended to serve as the first step in a multi-step pipeline of informatics tools designed to reduce time and effort during the intial stages of compiling a set of relevant articles for consideration in a systematic review. Because most of the bibliographic databases have copyright and subscription restrictions, Metta is not available for use by the general public, but a working prototype for soliciting feedback and comments can be viewed at http://mengs1.cs.binghamton.edu/metta/search.action.
Our research team is an NIH-funded multi-institutional consortium that includes computer science experts on metasearch engines (CY and WM) and their students (LJ, CL and YJ), investigators with expertise in information retrieval and data mining (NS and AC), as well as experienced systematic reviewers with backgrounds in clinical research and information science (JMD, CEA, MSM, Samantha Roberts, and Karla Soares-Weiser) who are affiliated with several major systematic review groups, namely, the Cochrane Collaboration (Schizophrenia Review Group) and the AHRQ and multi-state funded Drug Effectiveness Review Project (DERP).
We asked each of the systematic reviewers to provide a list of the most important bibliographic databases that they search in their own studies, and arrived at a consensus that five are the most important for inclusion in a metasearch engine: PubMed (which encompasses MEDLINE as well as additional records indexed in PubMed and deposited in PubMed Central); EMBASE (which overlaps extensively with MEDLINE but includes a wider range of topics in zoology and chemistry); CINAHL (which focuses on allied health fields); the Cochrane Central Register of Controlled Trials (which lists certain types of clinical trial articles); and PsycINFO (which focuses on psychology and related social science fields).
Except for PubMed , online access to all of these databases generally requires a subscription. (The Cochrane Library is freely available in many countries but not in most of the US). Even among leading academic medical institutions, not all have institutional subscriptions to all of these databases, and to ensure that copyrights and licenses are not violated, it is important that potential users (who may be located outside of our own research group) only have access to those databases to which they have a subscription. For the purposes of making the prototype version of Metta, for internal development by our own group, we routed users through the institutional log-in of the University of Illinois at Chicago, which has subscriptions to all databases.
Results and discussion
In general, Metta employed the default query expansion and processing strategies of each database, with some exceptions (e.g. full-text searching was turned off for PsycINFO). Because all five databases employed three common search tags (title [ti], author [au] and abstract [ab]) these were permitted in user-entered queries. The permitted tags are shown as examples on the right side of the homepage, and in detail on the Help page (Additional file 1).
Design and user issues
The front-end of Metta was intended to satisfy a wide range of different types of users, as well as a wide range of different search strategies. Certainly Metta was aligned to the needs of the majority of people searching the biomedical literature, who tend to carry out only one or two queries at a time, employ only one or a few search terms, and do not routinely use search tags [7, 8]. To accommodate the needs of users who employ an iterative approach (in which initial results are examined, and the initial query modified and resubmitted), we cached queries so that all previous queries from the same session were visible as users began to type in the query box.
A much more difficult decision was how to reconcile a unified query interface with the prevailing practice of many systematic reviewers to carry out long, complex queries that involve search tags and advanced commands which are specific to individual databases. The individual databases all have optional advanced search interfaces that allow users to build up queries consisting of separate query terms (linked to field tags) concatenated with AND, OR or NOT as well as restricting search term to specific text and meta-data fields. However, entering a search tag into Metta that is inappropriate for one or more of the databases will result in query errors. We did initially implement an Advanced Search page in Metta (patterned after PubMed’s Advanced Search Builder), but finally decided to remove it and only the basic query interface is currently active.
This decision was reinforced by an analysis of the reasons that reviewers exclude initially-retrieved clinical trial articles from final inclusion in systematic reviews (discussed in detail in [9, 10]). Briefly, we found that reviewers did not trust that metadata indexing of articles was adequately reliable, particularly with regard to study design aspects (such as randomization or use of placebos), and so a) did not employ these restrictions while carrying out initial searches, and b) utilized many different word and phrase combinations in the query, in an attempt to capture all possible relevant articles. This results in an initial search that may retrieve 10–100 times more records than are finally deemed to be relevant for inclusion in the systematic review.
The resources required to conduct systematic reviews are too great for several reasons. Construction of the query is a complex process carried out by specialists, typical searches return too many articles that will be excluded from the systematic review, and doing a comprehensive search across multiple databases is time consuming and incurs a lot of manual work in deduplication and other tasks necessary to clean up the search results. Metta is only the first step in a pipeline project that creates a series of computer-assisted tools to assist systematic reviewers  (another step of which is to re-tag clinical trial articles with study design labels). The pipeline is intended to present an alternate manner in which systematic review literature search and initial review could be performed which reduces resource utilization in several time and resource consuming areas. We felt that it was more important to keep Metta simple and predictable in its output, rather than designing it to handle extremely large sets of retrieved records (~5,000 or more) or to facilitate the entry of extremely complex queries by users via an advanced query builder.
Thus, for systematic reviewers to be expected to adopt Metta on a routine basis, it will be necessary to re-engineer multiple steps in the process by which articles are queried, retrieved, filtered and examined for inclusion. For example, better study design annotation and filtering of articles retrieved by Metta searches should give overall high-recall retrieval with better precision than is currently obtained by conventional searches, and thus largely obviate the need of reviewers to formulate highly complex queries. In any case, one of us (NRS) has found that it is easier simply to build up complex queries on a blank Word document and then cut-and-paste it into the Metta query window, than to build up queries step by step within an advanced query page.
a) Validating permissions for each search engine.
b) Submitting queries to each search engine.
c) Ensuring that a meaningful result is returned in real time.
d)Exporting retrieved records.
This includes export of full bibliographic records in XML format intended for use by automated informatics processing tools residing later in the pipeline project. We modified the XML format used by PubMed so that it was applicable across all of the databases. In addition, we created a separate link containing a text file of abbreviated bibliographic information in BibTex format, which can be readily imported into commercial reference manager software. A significant issue in this module is to identify articles that are retrieved from multiple databases. This allows duplicate results to be removed (only one copy is kept), which saves time for systematic reviewers . Identification of multiple records that correspond to the same real-world entity is known as entity identification and record-linkage, and the problem has received much attention in the database community (e.g., [15, 16]). For Metta, the problem is solved in three steps. First, the semantic data units within each record are identified. For example, a citation record may have semantic data units Author, Title, Journal Name, etc. Second, a set of distance/similarity functions is used to compute how well two values of the same semantic from different records are matched. For example, an edit-distance function can be used to compute how similar two titles are. Third, for each pair of records, based on how similar their corresponding data units are, a decision is made on whether the two records are matched. Once citation records corresponding to the same article are identified, de-duplication in Metta was carried out by first retaining records that were indexed in PubMed (because of its better and more elaborate indexing scheme) and then following a priority order: EMBASE, Cochrane, CINAHL and lastly, PsycINFO. The details about the algorithms and code for de-duplication in Metta are described in a separate publication .
Metta is a working demonstration site that continues to undergo field testing and modification to serve systematic reviewers. However, a number of important challenges will be important to tackle if Metta is to become deployed as a stable production service:
A metasearch engine for systematic reviewers is a deceptively simple endeavor, in which the technical issues are significant but the human issues are predominant. Users need to trust that a single query can effectively retrieve all relevant records from five heterogeneous databases. Each database employs different indexing schemes, and each has their own way of carrying out query processing and expansion, which constantly changes and evolves over time. Using a metasearch engine also removes some flexibility: For example, when a person carries out a search directly through PubMed, the displayed results page allows iterative filtering of the retrieved set of records, which is not possible in Metta.
Our feeling is that a metasearch engine such as Metta can play a valuable role in speeding up the process of retrieving the initial set of records during the preparation of a systematic review, as part of an overall re-engineering of the process. One option is to allow the metasearch engine stage to identify articles based on subject matter relevance, and later stages to filter these based on such aspects as publication type and study design. It is important that clinical trial articles can be comprehensively re-tagged to indicate these attributes more reliably and with better granularity, and thus overcome a major limitation of current indexing schemes. In conclusion, Metta is envisioned as comprising one component of a pipeline of informatics tools that, taken together, may re-engineer the workflow of writing systematic reviews.
Thanks to Hong Wang for programming and Giovanni Lugli for evaluating the query interface. Thanks also to Samantha Roberts and Karla Soares-Weiser for their advice and guidance in literature searching. This study was supported by NIH grant R01 LM010817.
- 1.Meng W, Yu C: Advanced Metasearch Engine Technology. 2010, Morgan & Claypool: San Rafael, CAGoogle Scholar
- 2.Dragut EC, Meng W, Yu C: Deep Web Query Interface Understanding and Integration. 2012, Morgan & Claypool: San Rafael, CAGoogle Scholar
- 6.PubMed: PubMed. [http://www.ncbi.nlm.nih.gov/pubmed]
- 7.Islamaj Dogan R, Murray GC, Névéol A, Lu Z: Understanding PubMed user search behavior through log analysis. Database (Oxford) 2009, 2009:bap018Google Scholar
- 9.Edinger T, Cohen AM: A large-scale analysis of the reasons given for excluding articles that are retrieved by literature search during systematic review. AMIA Annu Symp Proc. 2013, in pressGoogle Scholar
- 11.Cohen AM, Adam CE, Davis JM: Evidence-based medicine, the essential role of systematic reviews, and the need for automated text mining tools. Proc 1st ACM Int Symp. 2010, 376-380. doi:10.1145/1882992.1883046Google Scholar
- 16.Shu L, Lin C, Meng W, Han Y, Yu C, Smalheiser NR: A framework for entity resolution with efficient blocking. IEEE Inter Confe on Info Reuse and Integ (IRI). 2012, 431-440.Google Scholar
- 21.Hopewell S, Clarke M, Lefebvre C, Scherer R: Handsearching versus electronic searching to identify reports of randomized trials. Cochrane Database Syst Rev. 2007, 2: MR000001Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.