International Journal on Digital Libraries

, Volume 20, Issue 4, pp 307–334 | Cite as

Capisco: low-cost concept-based access to digital libraries

  • Annika HinzeEmail author
  • David Bainbridge
  • Sally Jo Cunningham
  • Craig Taube-Schock
  • Rangi Matamua
  • J. Stephen Downie
  • Edie Rasmussen


In this article, we present the conceptual design and report on the implementation of Capisco—a low-cost approach to concept-based access to digital libraries. Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts. Supplementary to this, the disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. For established digital library systems, completely replacing, or even making significant changes to the document retrieval mechanism (document analysis, indexing strategy, query processing, and query interface) would require major technological effort and would most likely be disruptive. In addition to presenting Capisco, we describe ways to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.


Semantic analysis Disambiguation Indexing Semantic enrichment Metadata enrichment 



The authors thank the Andrew W. Mellon Foundation for their support of this work (Grant Reference Numbers 21300666 and 41500672). We also thank the staff at the HathiTrust Research Center for their assistance, and Tom Ryan, a humanities scholar at the University of Waikato.


  1. 1.
    Cunningham, S.J., Hinze, A.M., Bainbridge, D., Taube-Schock, C., Ryan, T.: Building heritage document collections for Pacific Island nations using semantic-enriched search. In: Proceedings of the Samoa Conference III. Sãmoa: National University of Sãmoa (2014)Google Scholar
  2. 2.
    Duineveld, A.J., Stoter, R., Weiden, M.R., Kenepa, B., Benjamins, V.R.: Wondertools? A Comparative Study of Ontological Engineering ToolsGoogle Scholar
  3. 3.
    Airio, E., Järvelin, K., Saatsi, P., Kekäläinen, J., Suomela, S.: Ciri-an ontology-based query interface for text retrieval. In: Web Intelligence: Proceedings of the 11th Finnish Artificial Intelligence Conference, Citeseer (2004)Google Scholar
  4. 4.
    Apperley, M., Cunningham, S.J., Keegan, T.T., Witten, I.H.: Niupepa: a historical newspaper collection. Commun. ACM 44(5), 86–87 (2001)CrossRefGoogle Scholar
  5. 5.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval—The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley, Reading (2011)Google Scholar
  6. 6.
    Bainbridge, D., Don, K.J., Buchanan, G.R., Witten, I.H., Jones, S., Jones, M., Barr, M.I.: Dynamic digital library construction and configuration. In: Heery, R., Lyon, L. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries: 8th European Conference, ECDL 2004, Bath, UK, September 12–17, 2004, pp 1–13. Springer, Berlin (2004)Google Scholar
  7. 7.
    Berrios, D.C.: Methods for Semi-automated Index Generation for High Precision Information Retrieval. PhD thesis, Stanford University (2001)Google Scholar
  8. 8.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  9. 9.
    Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, ACL, pp. 9–16 (2006)Google Scholar
  10. 10.
    Campbell, I.: The Ostensive Model of Developing Information-Needs. PhD thesis, University of Glasgow (2000)Google Scholar
  11. 11.
    Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012). CrossRefzbMATHGoogle Scholar
  12. 12.
    Churchill, W.: Niue: a reconnaissance. Bull. Am. Geogr. Soc. 40(3), 150–156 (1908)CrossRefGoogle Scholar
  13. 13.
    Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. IJCAI 9, 1513–1518 (2009)Google Scholar
  14. 14.
    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, Prague, Czech Republic, pp. 708–716 (2007).
  15. 15.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  16. 16.
    Downie, J.S., Cole, T., Senseney, M., Jett, J., Page, K., Hinze, A., Muñoz, T., Audenaert, N.: Workset Creation for Scholarly Analysis: Recommendations and Prototyping Project Reports. University of Illinois at Urbana-Champaign, Tech. rep. (2015)Google Scholar
  17. 17.
    Dugan, J.M., Berrios, D.C., Liu, X., Kim, D.K., Kaizer, H., Fagan, L.M.: Automation and integration of components for generalized semantic markup of electronic medical texts. In: Proceedings of the AMIA Symposium, American Medical Informatics Association, pp. 736–740 (1999)Google Scholar
  18. 18.
    Efthimiadis, E.N.: Interactive query expansion: a user-based evaluation in a relevance feedback environment. J. Am. Soc. Inf. Sci. 51(11), 989–1003 (2000)CrossRefGoogle Scholar
  19. 19.
    El-Beltagy, S.R., Rafea, A.: KP-Miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)CrossRefGoogle Scholar
  20. 20.
    Fellbaum, C.: WordNet. Wiley, New York (1998)CrossRefGoogle Scholar
  21. 21.
    Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G.: Ontology change: classification and survey. Knowl. Eng. Rev. 23(02), 117–152 (2008)CrossRefGoogle Scholar
  22. 22.
    Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human–system communication. Commun. ACM 30(11), 964–971 (1987). CrossRefGoogle Scholar
  23. 23.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann (2007)Google Scholar
  24. 24.
    Ganea, O.E., Ganea, M., Lucchi, A., Eickhoff, C., Hofmann, T.: Probabilistic bag-of-hyperlinks model for entity linking. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 927–938 (2016)Google Scholar
  25. 25.
    Griffiths, T.L., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating topics and syntax. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS’04), pp. 537–544 . MIT Press, Cambridge, MA, USA, (2004)Google Scholar
  26. 26.
    Grishman, R., Sundheim, B.: Message understanding conference—6: a brief history. In: Proceedings of the 16th Conference on Computational Linguistics, ACL, COLING ’96, pp. 466–471 (1996).
  27. 27.
    Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 700–709 (2003)Google Scholar
  28. 28.
    Guppy, H.B.: Coral Islands and Savage Myths. Victoria Institute and Philosophical Society of Great Britain, London (1889)Google Scholar
  29. 29.
    Harris, P., Matamua, R., Smith, T., Kerr, H., Waaka, T.: A review of Māori astronomy in Aotaora-New Zealand. J. Astron. Hist. Herit. 16(3), 325–336 (2013)Google Scholar
  30. 30.
    Hinze, A., Heese, R., Luczak-Rösch, M., Paschke, A.: Semantic enrichment by non-experts: usability of manual annotation tools. In: The Semantic Web—ISWC 2012, pp. 165–181. Springer, Berlin (2012)CrossRefGoogle Scholar
  31. 31.
    Hinze, A., Heese, R., Schlegel, A., Luczak-Rösch, M.: User-defined semantic enrichment of full-text documents: experiences and lessons learned. In: Theory and Practice of Digital Libraries, pp. 209–214. Springer, Berlin (2012)CrossRefGoogle Scholar
  32. 32.
    Hinze, A., Taube-Schock, C., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Introducing Capisco: A semantically-enhanced search and discovery system for large-scale text corpora. ACM SIGWEB Newsl. Autumn 2015, 4:1–4:14 (2015).
  33. 33.
    Hinze, A., Taube-Schock, C., Bainbridge, D., Matamua, R., Downie, J.S.: Improving access to large-scale digital libraries through semantic-enhanced search and disambiguation. In: Proceedings of the ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 147–156. ACM (2015)Google Scholar
  34. 34.
    Hinze, A., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Low-cost semantic enhancement to digital library metadata and indexing: simple yet effective strategies. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp 93–102. ACM (2016).
  35. 35.
    Hinze, A., Coleman, M., Cunningham, S.J., Bainbridge, D.: Semantic bookworm: mining literary resources revisited. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 227–228. ACM (2016b).
  36. 36.
    Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 782–792 (2011)Google Scholar
  37. 37.
    Hovy, E., Navigli, R., Ponzetto, S.P.: Collaboratively built semi-structured content and artificial intelligence: the story so far. Artif. Intell. 194, 2–27 (2013)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering documents using a Wikipedia-based concept representation. In: Proceedings of 13th Pacific-Asia Conference, pp. 628–636. Springer, Berlin (2009)CrossRefGoogle Scholar
  39. 39.
    Jean-Louis, L., Zouaq, A., Gagnon, M., Ensan, F.: An assessment of online semantic annotators for the keyword extraction task. In: PRICAI 2014: Trends in Artificial Intelligence, pp. 548–560. Springer, Berlin (2014)Google Scholar
  40. 40.
    Johnes, A.J.: Johnes on the causes which have produced dissent from the established church in the principality of Wales. Henry Hooper, London (1870)Google Scholar
  41. 41.
    Jon, K.J., Bainbridge, D., Witten, I.H.: The Design of Greenstone 3: An Agent Based Dynamic Digital Library. Tech. rep., Department of Computer Science, University of Waikato (2002)Google Scholar
  42. 42.
    Karger, D.: Unference: UI (Not AI) as Key to the Semantic Web. Panel on Interaction Design Grand Challenges and the Semantic Web, at the 3rd International Semantic Web User Interaction Workshop (2006)Google Scholar
  43. 43.
    Karger, D., Schraefel, M.: The pathetic fallacy of RDF. In: International Workshop on the Semantic Web and User Interaction (SWUI) 2006 (2006).
  44. 44.
    Kim, D.K., Fagan, L.M., Jones, K.T., Berrios, D.C., Yu, V.L.: MYCIN II: design and implementation of a therapy reference with complex content-based indexing. In: Proceedings of the AMIA Symposium, pp. 175–179. American Medical Informatics Association (1998)Google Scholar
  45. 45.
    Köhncke, B., Balke, W.T.: Context-sensitive ranking using cross-domain knowledge for chemical digital libraries. In: International Conference on Theory and Practice of Digital Libraries, pp. 285–296. Springer, Berlin (2013)CrossRefGoogle Scholar
  46. 46.
    Köhncke, B., Siehndel, P., Balke, W.T.: Bridging the gap–using external knowledge bases for context-aware document retrieval. In: International Conference on Asian Digital Libraries, pp. 11–20. Springer, Berlin (2013)CrossRefGoogle Scholar
  47. 47.
    Kohomban, U.S., Lee, W.S.: Learning semantic classes for word sense disambiguation. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 34–41 (2005)Google Scholar
  48. 48.
    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–466. ACM (2009)Google Scholar
  49. 49.
    Lei, Y., Uren, V., Motta, E.: Semsearch: a search engine for the semantic web. In: International Conference on Knowledge Engineering and Knowledge Management, pp. 238–245. Springer, Berlin (2006)CrossRefGoogle Scholar
  50. 50.
    Leonard, P.: Mining large datasets for the humanities. In: World Library and Information Congress. International Federation of Library Associations (2014)Google Scholar
  51. 51.
    Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. ACL (2012)Google Scholar
  52. 52.
    Lytras, M., Sicilia, M., Davies, J., Kashyap, V., Stojanovic, N.: On the conceptualisation of the query refinement task. Library Manag. 26(4/5), 281–294 (2005)CrossRefGoogle Scholar
  53. 53.
    Mäkelä, E.: Survey of semantic search research. In: Proceedings of the Seminar on Knowledge Management on the Semantic Web. Department of Computer Science, University of Helsinki, Helsinki (2005)Google Scholar
  54. 54.
    Mangold, C.: A survey and classification of semantic search approaches. Int. J. Metadata Semant. Ontol. 2(1), 23–34 (2007)CrossRefGoogle Scholar
  55. 55.
    Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. ACL (2009)Google Scholar
  56. 56.
    Mihalcea, R., Csomai, A.: Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242. ACM (2007)Google Scholar
  57. 57.
    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)Google Scholar
  58. 58.
    Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artif. Intell. 194, 222–239 (2013)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Milne, D., Medelyan, O., Witten, I.H.: Mining domain-specific thesauri from Wikipedia: a case study. In: Proceedings IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE (2006)Google Scholar
  60. 60.
    Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 445–454. ACM (2007)Google Scholar
  61. 61.
    Moldovan, D.I., Mihalcea, R.: Using WordNet and lexical operators to improve internet searches. IEEE Internet Comput. 4(1), 34–43 (2000)CrossRefGoogle Scholar
  62. 62.
    Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, pp. 219–226. Springer, Berlin (2009)Google Scholar
  63. 63.
    Nakayama, K., Hara, T., Nishio, S.: A thesaurus construction method from large scaleweb dictionaries. In: 21st International Conference on Advanced Information Networking and Applications, 2007 (AINA’07), pp. 932–939. IEEE (2007)Google Scholar
  64. 64.
    Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. (CSUR) 41(2), 10:1–10:69 (2009)CrossRefGoogle Scholar
  65. 65.
    O’Brien, R.B. (ed.): Home Rule, Speeches by John Redmond. T. F Unwin, London (1910)Google Scholar
  66. 66.
    Peat, H.J., Willett, P.: The limitations of term co-occurrence data for query expansion in document retrieval systems. J. Am. Soc. Inf. Sci. 42, 378–383 (1991)CrossRefGoogle Scholar
  67. 67.
    Plale, B., Prakash, A., McDonald, R.: The Data Capsule for Non-consumptive Research: Final report. Tech. rep., Indiana University (2015).
  68. 68.
    Potthast, M., Stein, B., Anderka, M.: A Wikipedia-based multilingual retrieval model. In: European Conference on Information Retrieval, pp. 522–530. Springer, Berlin (2008)Google Scholar
  69. 69.
    Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384. ACL (2011)Google Scholar
  70. 70.
    Rito, J.S.T., Healy, S.M. (eds): Proceedings of the Traditional Knowledge Conference 2008: Traditional Knowledge and Gateways to Balanced Relationships. New Zealand’s Māori Centre of Research Excellence (2008)Google Scholar
  71. 71.
    Rizzo, G., Troncy, R.: Nerd: evaluating named entity recognition tools in the web of data. In: ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11) (2011).
  72. 72.
    Scheau, C., Rebedea, T., Chiru, C., Trausan-Matu, S.: Improving the relevance of search engine results by using semantic information from Wikipedia. In: 9th RoEduNet IEEE International Conference, pp. 151–156. IEEE (2010)Google Scholar
  73. 73.
    Shapira, B., Ofek, N., Makarenkov, V.: Exploiting Wikipedia for information retrieval tasks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’15, pp. 1137–1140. ACM (2015).
  74. 74.
    Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large altavista query log. ACM SIGIR Forum 33, 6–12 (1998)CrossRefGoogle Scholar
  75. 75.
    Sinkkilä, R., Suominen, O., Hyvönen, E.: Automatic semantic subject indexing of web documents in highly inflected languages. In: The Semantic Web: Research and Applications, pp. 215–229. Springer, Berlin (2011)Google Scholar
  76. 76.
    Soderland, S., Aronow, D., Fisher, D., Aseltine, J., Lehnert, W.: Machine Learning of Text Analysis Rules for Clinical Records. Tech. rep., Dept. of Computer Science, University of Massachusetts (1995)Google Scholar
  77. 77.
    Sorg, P., Cimiano, P.: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng. 74, 26–45 (2012)CrossRefGoogle Scholar
  78. 78.
    Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley Longman, Reading (1984)zbMATHGoogle Scholar
  79. 79.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)Google Scholar
  80. 80.
    Stojanovic, N.: Information-need driven query refinement. Web Intell. Agent Syst. 3(3), 155–169 (2005)Google Scholar
  81. 81.
    Stojanovic, N., Studer, R., Stojanovic, L.: An approach for step-by-step query refinement in the ontology-based information retrieval. In: International Conference on Web Intelligence, WI’04, pp. 36–43. IEEE (2004).
  82. 82.
    Sykes, W.R.: Contributions to the Flora of Niue. Department of Scientific and Industrial Research, Christchurch (1970)Google Scholar
  83. 83.
    Tregear, E.: The Maori Race. AD Willis, Wanganui (1904)Google Scholar
  84. 84.
    Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 61–69 Springer, Berlin (1994)CrossRefGoogle Scholar
  85. 85.
    Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)Google Scholar
  86. 86.
    Wei, W., Barnaghi, P.M., Bargiela, A.: Search with meanings: an overview of semantic search systems. Int. J. Commun. SIWN 3, 76–82 (2008)Google Scholar
  87. 87.
    Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30. AAAI Press, Chicago (2008)Google Scholar
  88. 88.
    Witten, I.H., Boddie, S.J., Bainbridge, D., McNab, R.J.: Greenstone: a comprehensive open-source digital library software system. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 113–121. ACM, New York (2000)Google Scholar
  89. 89.
    Witten, I.H., Bainbridge, D., Nichols, D.M.: How to Build a Digital Library, 2nd edn. Morgan Kaufmann, San Francisco (2009)Google Scholar
  90. 90.
    Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on Wikipedia for semantic relatedness. In: Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistic (2009)Google Scholar
  91. 91.
    Yesilada, Y., Bechhofer, S., Horan, B.: Cohse: Dynamic Linking of Web Resources. Tech. rep., Sun Microsystems Inc. (2007)Google Scholar
  92. 92.
    Zhang, L.: Interactive Retrieval Based on Wikipedia Concepts (2014). arXiv preprint arXiv:1412.8281

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of WaikatoHamiltonNew Zealand
  2. 2.University of Illinois, Urbana-ChampaignUrbanaUSA
  3. 3.University for British ColumbiaVancouverCanada

Personalised recommendations