Capisco: low-cost concept-based access to digital libraries


In this article, we present the conceptual design and report on the implementation of Capisco—a low-cost approach to concept-based access to digital libraries. Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts. Supplementary to this, the disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. For established digital library systems, completely replacing, or even making significant changes to the document retrieval mechanism (document analysis, indexing strategy, query processing, and query interface) would require major technological effort and would most likely be disruptive. In addition to presenting Capisco, we describe ways to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30


  1. 1.


  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

    Technical non-experts are users who are domain experts but are not familiar with technical detail of semantic concepts [30].

  7. 7.

  8. 8.

  9. 9.

  10. 10.

    These documents and other test collections have been provided by the HathiTrust.

  11. 11.

  12. 12.

    For simplicity, we abstract from the precise locations in which the terms appear on each page.

  13. 13.

    Such as the advanced search for HathiTrust items at

  14. 14.

    The references link to the publications in which the corpora were first introduced.


  1. 1.

    Cunningham, S.J., Hinze, A.M., Bainbridge, D., Taube-Schock, C., Ryan, T.: Building heritage document collections for Pacific Island nations using semantic-enriched search. In: Proceedings of the Samoa Conference III. Sãmoa: National University of Sãmoa (2014)

  2. 2.

    Duineveld, A.J., Stoter, R., Weiden, M.R., Kenepa, B., Benjamins, V.R.: Wondertools? A Comparative Study of Ontological Engineering Tools

  3. 3.

    Airio, E., Järvelin, K., Saatsi, P., Kekäläinen, J., Suomela, S.: Ciri-an ontology-based query interface for text retrieval. In: Web Intelligence: Proceedings of the 11th Finnish Artificial Intelligence Conference, Citeseer (2004)

  4. 4.

    Apperley, M., Cunningham, S.J., Keegan, T.T., Witten, I.H.: Niupepa: a historical newspaper collection. Commun. ACM 44(5), 86–87 (2001)

    Article  Google Scholar 

  5. 5.

    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval—The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley, Reading (2011)

  6. 6.

    Bainbridge, D., Don, K.J., Buchanan, G.R., Witten, I.H., Jones, S., Jones, M., Barr, M.I.: Dynamic digital library construction and configuration. In: Heery, R., Lyon, L. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries: 8th European Conference, ECDL 2004, Bath, UK, September 12–17, 2004, pp 1–13. Springer, Berlin (2004)

  7. 7.

    Berrios, D.C.: Methods for Semi-automated Index Generation for High Precision Information Retrieval. PhD thesis, Stanford University (2001)

  8. 8.

    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  9. 9.

    Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, ACL, pp. 9–16 (2006)

  10. 10.

    Campbell, I.: The Ostensive Model of Developing Information-Needs. PhD thesis, University of Glasgow (2000)

  11. 11.

    Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012).

    Article  MATH  Google Scholar 

  12. 12.

    Churchill, W.: Niue: a reconnaissance. Bull. Am. Geogr. Soc. 40(3), 150–156 (1908)

    Article  Google Scholar 

  13. 13.

    Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. IJCAI 9, 1513–1518 (2009)

    Google Scholar 

  14. 14.

    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, Prague, Czech Republic, pp. 708–716 (2007).

  15. 15.

    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  16. 16.

    Downie, J.S., Cole, T., Senseney, M., Jett, J., Page, K., Hinze, A., Muñoz, T., Audenaert, N.: Workset Creation for Scholarly Analysis: Recommendations and Prototyping Project Reports. University of Illinois at Urbana-Champaign, Tech. rep. (2015)

  17. 17.

    Dugan, J.M., Berrios, D.C., Liu, X., Kim, D.K., Kaizer, H., Fagan, L.M.: Automation and integration of components for generalized semantic markup of electronic medical texts. In: Proceedings of the AMIA Symposium, American Medical Informatics Association, pp. 736–740 (1999)

  18. 18.

    Efthimiadis, E.N.: Interactive query expansion: a user-based evaluation in a relevance feedback environment. J. Am. Soc. Inf. Sci. 51(11), 989–1003 (2000)

    Article  Google Scholar 

  19. 19.

    El-Beltagy, S.R., Rafea, A.: KP-Miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)

    Article  Google Scholar 

  20. 20.

    Fellbaum, C.: WordNet. Wiley, New York (1998)

    Google Scholar 

  21. 21.

    Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G.: Ontology change: classification and survey. Knowl. Eng. Rev. 23(02), 117–152 (2008)

    Article  Google Scholar 

  22. 22.

    Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human–system communication. Commun. ACM 30(11), 964–971 (1987).

    Article  Google Scholar 

  23. 23.

    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann (2007)

  24. 24.

    Ganea, O.E., Ganea, M., Lucchi, A., Eickhoff, C., Hofmann, T.: Probabilistic bag-of-hyperlinks model for entity linking. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 927–938 (2016)

  25. 25.

    Griffiths, T.L., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating topics and syntax. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS’04), pp. 537–544 . MIT Press, Cambridge, MA, USA, (2004)

  26. 26.

    Grishman, R., Sundheim, B.: Message understanding conference—6: a brief history. In: Proceedings of the 16th Conference on Computational Linguistics, ACL, COLING ’96, pp. 466–471 (1996).

  27. 27.

    Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 700–709 (2003)

  28. 28.

    Guppy, H.B.: Coral Islands and Savage Myths. Victoria Institute and Philosophical Society of Great Britain, London (1889)

    Google Scholar 

  29. 29.

    Harris, P., Matamua, R., Smith, T., Kerr, H., Waaka, T.: A review of Māori astronomy in Aotaora-New Zealand. J. Astron. Hist. Herit. 16(3), 325–336 (2013)

    Google Scholar 

  30. 30.

    Hinze, A., Heese, R., Luczak-Rösch, M., Paschke, A.: Semantic enrichment by non-experts: usability of manual annotation tools. In: The Semantic Web—ISWC 2012, pp. 165–181. Springer, Berlin (2012)

    Google Scholar 

  31. 31.

    Hinze, A., Heese, R., Schlegel, A., Luczak-Rösch, M.: User-defined semantic enrichment of full-text documents: experiences and lessons learned. In: Theory and Practice of Digital Libraries, pp. 209–214. Springer, Berlin (2012)

    Google Scholar 

  32. 32.

    Hinze, A., Taube-Schock, C., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Introducing Capisco: A semantically-enhanced search and discovery system for large-scale text corpora. ACM SIGWEB Newsl. Autumn 2015, 4:1–4:14 (2015).

  33. 33.

    Hinze, A., Taube-Schock, C., Bainbridge, D., Matamua, R., Downie, J.S.: Improving access to large-scale digital libraries through semantic-enhanced search and disambiguation. In: Proceedings of the ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 147–156. ACM (2015)

  34. 34.

    Hinze, A., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Low-cost semantic enhancement to digital library metadata and indexing: simple yet effective strategies. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp 93–102. ACM (2016).

  35. 35.

    Hinze, A., Coleman, M., Cunningham, S.J., Bainbridge, D.: Semantic bookworm: mining literary resources revisited. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 227–228. ACM (2016b).

  36. 36.

    Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 782–792 (2011)

  37. 37.

    Hovy, E., Navigli, R., Ponzetto, S.P.: Collaboratively built semi-structured content and artificial intelligence: the story so far. Artif. Intell. 194, 2–27 (2013)

    MathSciNet  Article  Google Scholar 

  38. 38.

    Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering documents using a Wikipedia-based concept representation. In: Proceedings of 13th Pacific-Asia Conference, pp. 628–636. Springer, Berlin (2009)

    Google Scholar 

  39. 39.

    Jean-Louis, L., Zouaq, A., Gagnon, M., Ensan, F.: An assessment of online semantic annotators for the keyword extraction task. In: PRICAI 2014: Trends in Artificial Intelligence, pp. 548–560. Springer, Berlin (2014)

    Google Scholar 

  40. 40.

    Johnes, A.J.: Johnes on the causes which have produced dissent from the established church in the principality of Wales. Henry Hooper, London (1870)

    Google Scholar 

  41. 41.

    Jon, K.J., Bainbridge, D., Witten, I.H.: The Design of Greenstone 3: An Agent Based Dynamic Digital Library. Tech. rep., Department of Computer Science, University of Waikato (2002)

  42. 42.

    Karger, D.: Unference: UI (Not AI) as Key to the Semantic Web. Panel on Interaction Design Grand Challenges and the Semantic Web, at the 3rd International Semantic Web User Interaction Workshop (2006)

  43. 43.

    Karger, D., Schraefel, M.: The pathetic fallacy of RDF. In: International Workshop on the Semantic Web and User Interaction (SWUI) 2006 (2006).

  44. 44.

    Kim, D.K., Fagan, L.M., Jones, K.T., Berrios, D.C., Yu, V.L.: MYCIN II: design and implementation of a therapy reference with complex content-based indexing. In: Proceedings of the AMIA Symposium, pp. 175–179. American Medical Informatics Association (1998)

  45. 45.

    Köhncke, B., Balke, W.T.: Context-sensitive ranking using cross-domain knowledge for chemical digital libraries. In: International Conference on Theory and Practice of Digital Libraries, pp. 285–296. Springer, Berlin (2013)

    Google Scholar 

  46. 46.

    Köhncke, B., Siehndel, P., Balke, W.T.: Bridging the gap–using external knowledge bases for context-aware document retrieval. In: International Conference on Asian Digital Libraries, pp. 11–20. Springer, Berlin (2013)

    Google Scholar 

  47. 47.

    Kohomban, U.S., Lee, W.S.: Learning semantic classes for word sense disambiguation. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 34–41 (2005)

  48. 48.

    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–466. ACM (2009)

  49. 49.

    Lei, Y., Uren, V., Motta, E.: Semsearch: a search engine for the semantic web. In: International Conference on Knowledge Engineering and Knowledge Management, pp. 238–245. Springer, Berlin (2006)

    Google Scholar 

  50. 50.

    Leonard, P.: Mining large datasets for the humanities. In: World Library and Information Congress. International Federation of Library Associations (2014)

  51. 51.

    Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. ACL (2012)

  52. 52.

    Lytras, M., Sicilia, M., Davies, J., Kashyap, V., Stojanovic, N.: On the conceptualisation of the query refinement task. Library Manag. 26(4/5), 281–294 (2005)

    Article  Google Scholar 

  53. 53.

    Mäkelä, E.: Survey of semantic search research. In: Proceedings of the Seminar on Knowledge Management on the Semantic Web. Department of Computer Science, University of Helsinki, Helsinki (2005)

  54. 54.

    Mangold, C.: A survey and classification of semantic search approaches. Int. J. Metadata Semant. Ontol. 2(1), 23–34 (2007)

    Article  Google Scholar 

  55. 55.

    Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. ACL (2009)

  56. 56.

    Mihalcea, R., Csomai, A.: Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242. ACM (2007)

  57. 57.

    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)

  58. 58.

    Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artif. Intell. 194, 222–239 (2013)

    MathSciNet  Article  Google Scholar 

  59. 59.

    Milne, D., Medelyan, O., Witten, I.H.: Mining domain-specific thesauri from Wikipedia: a case study. In: Proceedings IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE (2006)

  60. 60.

    Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 445–454. ACM (2007)

  61. 61.

    Moldovan, D.I., Mihalcea, R.: Using WordNet and lexical operators to improve internet searches. IEEE Internet Comput. 4(1), 34–43 (2000)

    Article  Google Scholar 

  62. 62.

    Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, pp. 219–226. Springer, Berlin (2009)

    Google Scholar 

  63. 63.

    Nakayama, K., Hara, T., Nishio, S.: A thesaurus construction method from large scaleweb dictionaries. In: 21st International Conference on Advanced Information Networking and Applications, 2007 (AINA’07), pp. 932–939. IEEE (2007)

  64. 64.

    Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. (CSUR) 41(2), 10:1–10:69 (2009)

    Article  Google Scholar 

  65. 65.

    O’Brien, R.B. (ed.): Home Rule, Speeches by John Redmond. T. F Unwin, London (1910)

    Google Scholar 

  66. 66.

    Peat, H.J., Willett, P.: The limitations of term co-occurrence data for query expansion in document retrieval systems. J. Am. Soc. Inf. Sci. 42, 378–383 (1991)

    Article  Google Scholar 

  67. 67.

    Plale, B., Prakash, A., McDonald, R.: The Data Capsule for Non-consumptive Research: Final report. Tech. rep., Indiana University (2015).

  68. 68.

    Potthast, M., Stein, B., Anderka, M.: A Wikipedia-based multilingual retrieval model. In: European Conference on Information Retrieval, pp. 522–530. Springer, Berlin (2008)

  69. 69.

    Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384. ACL (2011)

  70. 70.

    Rito, J.S.T., Healy, S.M. (eds): Proceedings of the Traditional Knowledge Conference 2008: Traditional Knowledge and Gateways to Balanced Relationships. New Zealand’s Māori Centre of Research Excellence (2008)

  71. 71.

    Rizzo, G., Troncy, R.: Nerd: evaluating named entity recognition tools in the web of data. In: ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11) (2011).

  72. 72.

    Scheau, C., Rebedea, T., Chiru, C., Trausan-Matu, S.: Improving the relevance of search engine results by using semantic information from Wikipedia. In: 9th RoEduNet IEEE International Conference, pp. 151–156. IEEE (2010)

  73. 73.

    Shapira, B., Ofek, N., Makarenkov, V.: Exploiting Wikipedia for information retrieval tasks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’15, pp. 1137–1140. ACM (2015).

  74. 74.

    Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large altavista query log. ACM SIGIR Forum 33, 6–12 (1998)

    Article  Google Scholar 

  75. 75.

    Sinkkilä, R., Suominen, O., Hyvönen, E.: Automatic semantic subject indexing of web documents in highly inflected languages. In: The Semantic Web: Research and Applications, pp. 215–229. Springer, Berlin (2011)

    Google Scholar 

  76. 76.

    Soderland, S., Aronow, D., Fisher, D., Aseltine, J., Lehnert, W.: Machine Learning of Text Analysis Rules for Clinical Records. Tech. rep., Dept. of Computer Science, University of Massachusetts (1995)

  77. 77.

    Sorg, P., Cimiano, P.: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng. 74, 26–45 (2012)

    Article  Google Scholar 

  78. 78.

    Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley Longman, Reading (1984)

    Google Scholar 

  79. 79.

    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)

    Google Scholar 

  80. 80.

    Stojanovic, N.: Information-need driven query refinement. Web Intell. Agent Syst. 3(3), 155–169 (2005)

    Google Scholar 

  81. 81.

    Stojanovic, N., Studer, R., Stojanovic, L.: An approach for step-by-step query refinement in the ontology-based information retrieval. In: International Conference on Web Intelligence, WI’04, pp. 36–43. IEEE (2004).

  82. 82.

    Sykes, W.R.: Contributions to the Flora of Niue. Department of Scientific and Industrial Research, Christchurch (1970)

  83. 83.

    Tregear, E.: The Maori Race. AD Willis, Wanganui (1904)

    Google Scholar 

  84. 84.

    Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 61–69 Springer, Berlin (1994)

    Google Scholar 

  85. 85.

    Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)

  86. 86.

    Wei, W., Barnaghi, P.M., Bargiela, A.: Search with meanings: an overview of semantic search systems. Int. J. Commun. SIWN 3, 76–82 (2008)

    Google Scholar 

  87. 87.

    Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30. AAAI Press, Chicago (2008)

  88. 88.

    Witten, I.H., Boddie, S.J., Bainbridge, D., McNab, R.J.: Greenstone: a comprehensive open-source digital library software system. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 113–121. ACM, New York (2000)

  89. 89.

    Witten, I.H., Bainbridge, D., Nichols, D.M.: How to Build a Digital Library, 2nd edn. Morgan Kaufmann, San Francisco (2009)

  90. 90.

    Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on Wikipedia for semantic relatedness. In: Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistic (2009)

  91. 91.

    Yesilada, Y., Bechhofer, S., Horan, B.: Cohse: Dynamic Linking of Web Resources. Tech. rep., Sun Microsystems Inc. (2007)

  92. 92.

    Zhang, L.: Interactive Retrieval Based on Wikipedia Concepts (2014). arXiv preprint arXiv:1412.8281

Download references


The authors thank the Andrew W. Mellon Foundation for their support of this work (Grant Reference Numbers 21300666 and 41500672). We also thank the staff at the HathiTrust Research Center for their assistance, and Tom Ryan, a humanities scholar at the University of Waikato.

Author information



Corresponding author

Correspondence to Annika Hinze.

Additional information

This manuscript is an extension of the authors’ earlier work presented at the ACM/IEEE-CS Joint Conference on Digital Libraries: [33] (JCDL 2015) and [34] (JCDL 2016).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hinze, A., Bainbridge, D., Cunningham, S.J. et al. Capisco: low-cost concept-based access to digital libraries. Int J Digit Libr 20, 307–334 (2019).

Download citation


  • Semantic analysis
  • Disambiguation
  • Indexing
  • Semantic enrichment
  • Metadata enrichment