The VLDB Journal

, Volume 18, Issue 1, pp 277–301 | Cite as

Context-based literature digital collection search

  • Nattakarn Ratprasartporn
  • Jonathan Po
  • Ali Cakmak
  • Sulieman Bani-Ahmad
  • Gultekin Ozsoyoglu
Regular Paper


We identify two issues with searching literature digital collections within digital libraries: (a) there are no effective paper-scoring and ranking mechanisms. Without a scoring and ranking system, users are often forced to scan a large and diverse set of publications listed as search results and potentially miss the important ones. (b) Topic diffusion is a common problem: publications returned by a keyword-based search query often fall into multiple topic areas, not all of which are of interest to users. This paper proposes a new literature digital collection search paradigm that effectively ranks search outputs, while controlling the diversity of keyword-based search query output topics. Our approach is as follows. First, during pre-querying, publications are assigned into pre-specified ontology-based contexts, and query-independent context scores are attached to papers with respect to the assigned contexts. When a query is posed, relevant contexts are selected, search is performed within the selected contexts, context scores of publications are revised into relevancy scores with respect to the query at hand and the context that they are in, and query outputs are ranked within each relevant context. This way, we (1) minimize query output topic diversity, (2) reduce query output size, (3) decrease user time spent scanning query results, and (4) increase query output ranking accuracy. Using genomics-oriented PubMed publications as the testbed and Gene Ontology terms as contexts, our experiments indicate that the proposed context-based search approach produces search results with up to 50% higher precision, and reduces the query output size by up to 70%.


Context-based search Digital collections Ontology Context score Ranking 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Chakrabarti S. (2003). Mining the Web, Discovering Knowledge from Hypertext Data. Morgan-Kaufmann, Los Altos, CA Google Scholar
  4. 4.
    Cakmak, A., Ozsoyoglu, G.: Annotating genes using textual patterns. PSB (2007)Google Scholar
  5. 5.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems (1998)Google Scholar
  6. 6.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: ACM-SIAM Symp. on Discr Alg. (1998)Google Scholar
  7. 7.
    Ontology Lookup Service,
  8. 8.
    Po, J.: Context-based search in literature digital libraries. MS Thesis, CWRU (2006)Google Scholar
  9. 9.
    Salton G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA Google Scholar
  10. 10.
    CiteSeer literature search system,
  11. 11.
  12. 12.
  13. 13.
  14. 14.
    Chmura, J., Ratprasartporn, N., Ozsoyoglu, G.: Scalability of databases for digital libraries. ICADL pp. 435–445 (2005)Google Scholar
  15. 15.
    Delfs, R., Doms, A., Kozlenkov, A., Schroeder, M.: GoPubMed: ontology-based literature search applied to Gene Ontology and PubMed. In: German Conference on Bioinformatics (2004)Google Scholar
  16. 16.
    Agrawal, R., Ramakrishnan S.: Fast algorithms for mining association rules. VLDB (1994)Google Scholar
  17. 17.
  18. 18.
  19. 19.
    Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. IJCAI (1995)Google Scholar
  20. 20.
    Cakmak, A.: HITS- and PageRank-based importance score computations for ACM anthology papers. Technical Report, CWRU (2003)Google Scholar
  21. 21.
    Haveliwala, T.: Topic-sensitive PageRank. WWW (2002)Google Scholar
  22. 22.
    Aussenac-Gilles, N., Mothe, J.: Ontologies as background knowledge to explore document collections. RIAO (2004)Google Scholar
  23. 23.
    Ratprasartporn, N., Bani-Ahmad, S., Cakmak, A., Po, J., Ozsoyoglu, G.: Evaluating utility of different score functions in a context-based environment. In: DBRank Workshop – in Conjunction with ICDE 2007Google Scholar
  24. 24.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. WWW (2001)Google Scholar
  25. 25.
    Kraft, R., Chang, C.C., Maghoul, F., Kumar, R.: Searching with context. WWW (2006)Google Scholar
  26. 26.
    Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. WWW (2005)Google Scholar
  27. 27.
    Al-Hamdani, A.: Querying web resources with metadata in a database. PHD Dissertation, CWRU (2004)Google Scholar
  28. 28.
    Small H. (1973). Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Informat. Sci. 24(4): 28–31 Google Scholar
  29. 29.
    Kessler M.M. (1963). Bibliographic coupling between scientific papers. Am. Documentat. 14: 10–25 CrossRefGoogle Scholar
  30. 30.
  31. 31.
    The Institute of Genomic Research (TIGR),
  32. 32.
    ACM Digital Library,
  33. 33.
    Open Directory Project,
  34. 34.
    Medical Subject Heading (MeSH),
  35. 35.
    Hawkins, D.T., Wagers, R.: Online bibliographic search strategy development. Online, May 1982Google Scholar
  36. 36.
    Schlosser R.W., Wendt O., Bhavnani S. and Nail-Chiwetalu B. (2006). Use of information-seeking strategies for developing systematic reviews and engaging in evidence-based practice: the application of traditional and comprehensive pearl growing. A review. Int. J. Language Commun. Disorders 41(5): 567–582 CrossRefGoogle Scholar
  37. 37.
    Porter M.F. (1980). An algorithm for suffix stripping. Program 12(3): 130–137 Google Scholar
  38. 38.
    Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison Wesley, Reading, MA Google Scholar
  39. 39.
    Hearst, M.A.: TileBars: visualization of term distribution information in full text information access. In: Proc. of the ACM SIGCHI conference on human factor in computing systems, pp. 59–66 (1995)Google Scholar
  40. 40.
    Kaki, M.: Findex: search results categories help users when document ranking fails. In: Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems (2005)Google Scholar
  41. 41.
    Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. SIGIR (1996)Google Scholar
  42. 42.
    Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. WWW (1999)Google Scholar
  43. 43.
    Osinski, S., Weiss, D.: Conceptual clustering using lingo algorithm: evaluation on open directory project data. In: Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland, pp. 359–368, (2004)Google Scholar
  44. 44.
    Zeng, H., He, Q., Chen, Z., Ma, W.: learning to cluster web search results. SIGIR (2004)Google Scholar
  45. 45.
    Zhang, D., Yong, Y.: Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, April 2004Google Scholar
  46. 46.
    Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. WWW (2004)Google Scholar
  47. 47.
    Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. SIGIR (2003)Google Scholar
  48. 48.
  49. 49.
  50. 50.
  51. 51.
    Chen, M., Hearst, M.A.: Presenting web site search results in contexts: a demonstration. SIGIR (1998)Google Scholar
  52. 52.
    Wittenburg, K., Sigman, E.: Integration of browsing, searching, and filtering in an applet for web information access. In: Proceedings of the ACM Conference on Human Factors in Computing systems, Late Breaking Track (1997)Google Scholar
  53. 53.
    Pratt, W., Hearst, M.A., Fagan, L.M.: A knowledge-based approach to organizing retrieved documents. AAAI (1999)Google Scholar
  54. 54.
    Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2 (2003)Google Scholar
  55. 55.
    Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Trans. Knowl. Data Eng. 19(2) (2007)Google Scholar
  56. 56.
    RDQL – A Query Language for RDF,
  57. 57.
    Yahoo! Directory,
  58. 58.
    ACM Computing Classification Systems,
  59. 59.
  60. 60.
  61. 61.
    Pedersen, T., Pakhomov, S., Patwardhan, S., Chute, C.: Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Informat. (2006)Google Scholar
  62. 62.
    Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19(10) (2003)Google Scholar
  63. 63.
    Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. WWW (2005)Google Scholar
  64. 64.
    Ratprasartporn, N., Ozsoyoglu, G.: Finding related papers in literature digital libraries. In: 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL) (2007)Google Scholar
  65. 65.
  66. 66.
    Chen Y.-L., Wei J.-J., Wu S.-Y. and Hu Y.-H. (2006). A similarity-based method for retrieving documents from the SCI/SSCI database. J. Informat. Sci. 32(5): 449–464 CrossRefGoogle Scholar
  67. 67.
    Desai M. and Spink A. (2005). An algorithm to cluster documents based on relevance. Int. J. Informat. Process. Manage. 41(September): 1035–1049 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Nattakarn Ratprasartporn
    • 1
  • Jonathan Po
    • 1
  • Ali Cakmak
    • 1
  • Sulieman Bani-Ahmad
    • 1
  • Gultekin Ozsoyoglu
    • 1
  1. 1.Department of Electrical Engineering and Computer ScienceCase Western Reserve UniversityClevelandUSA

Personalised recommendations