Abstract
We identify two issues with searching literature digital collections within digital libraries: (a) there are no effective paper-scoring and ranking mechanisms. Without a scoring and ranking system, users are often forced to scan a large and diverse set of publications listed as search results and potentially miss the important ones. (b) Topic diffusion is a common problem: publications returned by a keyword-based search query often fall into multiple topic areas, not all of which are of interest to users. This paper proposes a new literature digital collection search paradigm that effectively ranks search outputs, while controlling the diversity of keyword-based search query output topics. Our approach is as follows. First, during pre-querying, publications are assigned into pre-specified ontology-based contexts, and query-independent context scores are attached to papers with respect to the assigned contexts. When a query is posed, relevant contexts are selected, search is performed within the selected contexts, context scores of publications are revised into relevancy scores with respect to the query at hand and the context that they are in, and query outputs are ranked within each relevant context. This way, we (1) minimize query output topic diversity, (2) reduce query output size, (3) decrease user time spent scanning query results, and (4) increase query output ranking accuracy. Using genomics-oriented PubMed publications as the testbed and Gene Ontology terms as contexts, our experiments indicate that the proposed context-based search approach produces search results with up to 50% higher precision, and reduces the query output size by up to 70%.
Similar content being viewed by others
References
Gene Ontology, http://www.geneontology.org
Chakrabarti S. (2003). Mining the Web, Discovering Knowledge from Hypertext Data. Morgan-Kaufmann, Los Altos, CA
Cakmak, A., Ozsoyoglu, G.: Annotating genes using textual patterns. PSB (2007)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems (1998)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: ACM-SIAM Symp. on Discr Alg. (1998)
Ontology Lookup Service, http://www.ebi.ac.uk/ontology-lookup
Po, J.: Context-based search in literature digital libraries. MS Thesis, CWRU (2006)
Salton G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA
CiteSeer literature search system, http://citeseer.ist.psu.edu/cs
Google Scholar, http://scholar.google.com/scholar/about.html
IEEE Xplore, http://www.ieee.org/ieeexplore
CaseExplorer, http://nashua.case.edu/anthexpl
Chmura, J., Ratprasartporn, N., Ozsoyoglu, G.: Scalability of databases for digital libraries. ICADL pp. 435–445 (2005)
Delfs, R., Doms, A., Kozlenkov, A., Schroeder, M.: GoPubMed: ontology-based literature search applied to Gene Ontology and PubMed. In: German Conference on Bioinformatics (2004)
Agrawal, R., Ramakrishnan S.: Fast algorithms for mining association rules. VLDB (1994)
ESearch Entrez Utility, http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html
GO Evidence Code Hierarchy, http://www.geneontology.org/GO.evidence.shtml#hier
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. IJCAI (1995)
Cakmak, A.: HITS- and PageRank-based importance score computations for ACM anthology papers. Technical Report, CWRU (2003)
Haveliwala, T.: Topic-sensitive PageRank. WWW (2002)
Aussenac-Gilles, N., Mothe, J.: Ontologies as background knowledge to explore document collections. RIAO (2004)
Ratprasartporn, N., Bani-Ahmad, S., Cakmak, A., Po, J., Ozsoyoglu, G.: Evaluating utility of different score functions in a context-based environment. In: DBRank Workshop – in Conjunction with ICDE 2007
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. WWW (2001)
Kraft, R., Chang, C.C., Maghoul, F., Kumar, R.: Searching with context. WWW (2006)
Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. WWW (2005)
Al-Hamdani, A.: Querying web resources with metadata in a database. PHD Dissertation, CWRU (2004)
Small H. (1973). Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Informat. Sci. 24(4): 28–31
Kessler M.M. (1963). Bibliographic coupling between scientific papers. Am. Documentat. 14: 10–25
SWISS-Prot Keywords, http://www.expasy.org/cgi-bin/keywlist.pl
The Institute of Genomic Research (TIGR), http://www.tigr.org/
ACM Digital Library, http://www.acm.org/dl
Open Directory Project, http://www.dmoz.org
Medical Subject Heading (MeSH), http://www.nlm.nih.gov/mesh/
Hawkins, D.T., Wagers, R.: Online bibliographic search strategy development. Online, May 1982
Schlosser R.W., Wendt O., Bhavnani S. and Nail-Chiwetalu B. (2006). Use of information-seeking strategies for developing systematic reviews and engaging in evidence-based practice: the application of traditional and comprehensive pearl growing. A review. Int. J. Language Commun. Disorders 41(5): 567–582
Porter M.F. (1980). An algorithm for suffix stripping. Program 12(3): 130–137
Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison Wesley, Reading, MA
Hearst, M.A.: TileBars: visualization of term distribution information in full text information access. In: Proc. of the ACM SIGCHI conference on human factor in computing systems, pp. 59–66 (1995)
Kaki, M.: Findex: search results categories help users when document ranking fails. In: Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems (2005)
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. SIGIR (1996)
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. WWW (1999)
Osinski, S., Weiss, D.: Conceptual clustering using lingo algorithm: evaluation on open directory project data. In: Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland, pp. 359–368, (2004)
Zeng, H., He, Q., Chen, Z., Ma, W.: learning to cluster web search results. SIGIR (2004)
Zhang, D., Yong, Y.: Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, April 2004
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. WWW (2004)
Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. SIGIR (2003)
Vivisimo, http://vivisimo.com/
Clusty, http://clusty.com/
Mooter, http://www.mooter.com/
Chen, M., Hearst, M.A.: Presenting web site search results in contexts: a demonstration. SIGIR (1998)
Wittenburg, K., Sigman, E.: Integration of browsing, searching, and filtering in an applet for web information access. In: Proceedings of the ACM Conference on Human Factors in Computing systems, Late Breaking Track (1997)
Pratt, W., Hearst, M.A., Fagan, L.M.: A knowledge-based approach to organizing retrieved documents. AAAI (1999)
Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2 (2003)
Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Trans. Knowl. Data Eng. 19(2) (2007)
RDQL – A Query Language for RDF, http://www.w3.org/Submission/RDQL/
Yahoo! Directory, http://dir.yahoo.com/
ACM Computing Classification Systems, http://acm.org/class
LINGO 3G, http://company.carrot-search.com/lingo-applications.html
Pedersen, T., Pakhomov, S., Patwardhan, S., Chute, C.: Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Informat. (2006)
Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19(10) (2003)
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. WWW (2005)
Ratprasartporn, N., Ozsoyoglu, G.: Finding related papers in literature digital libraries. In: 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL) (2007)
ChEBI, http://www.ebi.ac.uk/chebi/
Chen Y.-L., Wei J.-J., Wu S.-Y. and Hu Y.-H. (2006). A similarity-based method for retrieving documents from the SCI/SSCI database. J. Informat. Sci. 32(5): 449–464
Desai M. and Spink A. (2005). An algorithm to cluster documents based on relevance. Int. J. Informat. Process. Manage. 41(September): 1035–1049
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ratprasartporn, N., Po, J., Cakmak, A. et al. Context-based literature digital collection search. The VLDB Journal 18, 277–301 (2009). https://doi.org/10.1007/s00778-008-0099-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-008-0099-9