How Ontology Based Information Retrieval Systems May Benefit from Lexical Text Analysis

  • Sylvie RanwezEmail author
  • Benjamin Duthil
  • Mohameth François Sy
  • Jacky Montmain
  • Patrick Augereau
  • Vincent Ranwez
Part of the Theory and Applications of Natural Language Processing book series (NLP)


The exponential growth of available electronic data is almost useless without efficient tools to retrieve the right information at the right time. This is especially crucial in the context of decision making (e.g. for politicians), innovative development (e.g. for scientists and industrials) or economic development (e.g. for market or concurrence studies). It is now widely acknowledged that information retrieval systems (IRS in short) need to take semantics into account. In this context, semantic Web technologies have been rapidly widespread and accepted. This article surveys semantic based methodologies designed to efficiently retrieve and exploit information. Some of them, based on terminologies, are fitted to open context, dealing with heterogeneous and unstructured data, while others, based on taxonomies or ontologies, are semantically richer but require formal knowledge representation of the studied domain. Hence, a continuum of solutions exists from terminology to ontology based IRSs. These approaches are often seen as concurrent and exclusive, but this chapter asserts that their advantages may be efficiently combined in a hybrid solution built upon domain ontology. The original approach presented here benefits from both lexical and ontological document description, and combines them in a software architecture dedicated to information retrieval in specific domains. Relevant documents are first identified via their conceptual indexing based on domain ontology, and then each document is segmented to highlight text fragments that deal with users’ information needs.The system thus specifies why these documents have been chosen and facilitates end-user information gathering.


Domain Ontology Query Term Information Retrieval System Text Segmentation Ontology Concept 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is partially supported by the AVieSan national program (French national alliance for life sciences and health) and by the French Agence Nationale de la Recherche “Investissements d’avenir/Bioinformatique” [ANR-10-BINF-01-02 “Ancestrome”].


  1. 1.
    An, R.A., Morris, J., Hirstt, G.: Lexical cohesion computed by thesaural. Comput. Linguist. 17, 21–48 (1991)Google Scholar
  2. 2.
    Badra, F., Despres, S., Djedidi, R.: Ontology and lexicon: the missing link. In: Slodzian, M., Valette, M., Aussenac-Gilles, N., Condamines, A., Hernandez, N., Rothenburger, B. (eds.) Workshop Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, pp. 16–18. INALCO, Paris (2011)Google Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM, New York; Addison-Wesley (1999)Google Scholar
  4. 4.
    Baziz, M., Boughanem, M., Pasi, G., Prade, H.: An information retrieval driven by ontology: from query to document expansion. In: RIAO. ACM, pp. 301–313. New York (2007)Google Scholar
  5. 5.
    Bhagdev, R., Chapman, S., Ciravegna, F., Lanfranchi, V., Petrelli, D.: Hybrid search: effectively combining keywords and semantic searches. In: Proceedings of the 5th European semantic web conference on the Semantic Web: Research and Applications, ESWC’08, pp. 554–568. Springer, Berlin/Heidelberg (2008)Google Scholar
  6. 6.
    Buitelaar, P., Cimiano, P., McCrae, J., Montiel-Ponsada, E., Declerck, T.: Ontology lexicalisation: the lemon perspective. In: Slodzian, M., Valette, M., Aussenac-Gilles, N., Condamines, A., Hernandez, N., Rothenburger, B. (eds.) Workshop Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, pp. 33–36. INALCO, Paris (2011)Google Scholar
  7. 7.
    Caillet, M., Pessiot, J.F., Reza Amini, M., Gallinari, P.: Unsupervised learning with term clustering for thematic segmentation of texts. In: Proceedings of RIAO, pp. 648–656. CID, Paris (2004)Google Scholar
  8. 8.
    Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, vol. 23, pp. 26–33. ACL, Stroudsburg, PA, USA (2000)Google Scholar
  9. 9.
    Christensen, H., Kolluru, B., Gotoh, Y., Renals, S.: From text summarisation to style-specific summarisation for broadcast news. In: Proceedings of ECIR 2004: European conference on IR research No27, Sunderland, ROYAUME-UNI (05/04/2004), vol. 2997, pp. 223–237, ISBN 3-540-21382-1. Springer, Berlin, Germany (2004)Google Scholar
  10. 10.
    Chuang, W.T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. In: Proceedings of the 23rd ACM SIGIR, pp. 152–159. ACM, New York (2000)Google Scholar
  11. 11.
    Cimiano, P., Buitelaar, P., McCrae, J., Sintek, M.: Lexinfo: a declarative model for the lexicon-ontology interface. Web Semant. Sci. Serv. Agents WorldW. Web 9(1), 29–51 (2011)CrossRefGoogle Scholar
  12. 12.
    Clifton, C., Cooley, R., Rennie, J.: Topcat: data mining for topic identification in a text corpus. In: Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases. Springer, Berlin/New York (2002)Google Scholar
  13. 13.
    Cockburn, A., McKenzie, B.: 3D or not 3D?. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM New York, NY, USA (2001)Google Scholar
  14. 14.
    Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Brief. Bioinform. 6(1), 57–71 (2005)CrossRefGoogle Scholar
  15. 15.
    Dragoni, M., Pereira, C.D.C., Tettamanzi, A.G.B.: An ontological representation of documents and queries for information retrieval systems. In: Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems – Volume Part II, IEA/AIE’10, pp. 555–564. Springer, Berlin/Heidelberg (2010)Google Scholar
  16. 16.
    Dubois, D., Prade, H.: A review of fuzzy set aggregation connectives. Inf. Sci. 36(1-2), 85–121 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Duthil, B., Trousset, F., Roche, M., Dray, G., Plantié, M., Montmain, J., Poncelet, P.: Towards an automatic characterization of criteria, DEXA ’11. In: Proceedings of the 22nd International Conference on Database and Expert Systems Applications DEXA 2011, p. 457. Springer, Berlin/New York (2011)Google Scholar
  18. 18.
    Fox, C.J.: Lexical analysis and stoplists. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures & Algorithms, pp. 102–130. Prentice-Hall, Inc. Upper Saddle River, NJ, USA (1992)Google Scholar
  19. 19.
    Friedenson, B.: The BRCA1/2 pathway prevents hematologic cancers in addition to breast and ovarian cancers. BMC Cancer 7, 152 (2007)CrossRefGoogle Scholar
  20. 20.
    Gillick, D., Favre, B., Hakkani-tür, D.: The icsi summarization system at tac 2008. In: Proceedings of the Text Analysis Conference Workshop, pp. 801–815. National Institute of Standards and Technology Gaithersburg, Maryland, USA (2008)Google Scholar
  21. 21.
    Giunchiglia, F., Kharkevich, U., Zaihrayeu, I.: Concept search. In: ESWC, pp. 429–444. Springer Berlin Heidelberg (2009).
  22. 22.
    Haav, H., Lubi, T.: A survey of concept-based information retrieval tools on the web. In: 5th East-European Conference, ADBIS 2001, Vilnius. Springer, Berlin/New York (2001)Google Scholar
  23. 23.
    Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. ACM 23, 33–64 (1997)Google Scholar
  24. 24.
    Hersh, W.: Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Brief. Bioinform. 6(4), 344–356 (2005)CrossRefGoogle Scholar
  25. 25.
    Hulth, A., Megyesi, B.B.: A study on automatically extracted keywords in text categorization. In: Proceedings of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (CoLing/ACL). ACL, Stroudsburg, PA, USA (2006)Google Scholar
  26. 26.
    Joris, D., Paul-Armand, V., Joris, V., Dirk, C., Joost, R.D.: Topic identification based on document coherence and spectral analysis. Inf. Sci. 181, 3783–3797 (2011)CrossRefGoogle Scholar
  27. 27.
    Kan, M.Y., Klavans, J.L., McKeown, K.R.: Linear segmentation and segment significance. In: Proceedings of the 6th International Workshop of Very Large Corpora, Montreal, pp. 197–205 (1998)Google Scholar
  28. 28.
    Kleiber, G.: Noms propres et noms communs: un problème de dénomination. Meta, 41, 567–589 (1996)CrossRefGoogle Scholar
  29. 29.
    Kozima, H.: Text segmentation based on similarity between words. In: ACL, pp. 286–288. ACL, Morristown (1993)Google Scholar
  30. 30.
    Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73. ACM, New York (1995)Google Scholar
  31. 31.
    Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: a genetic algorithm for linear text segmentation. In: IJCAI’07, pp. 1647–1652. AAAI, Menlo Park, California, USA (2007)Google Scholar
  32. 32.
    Lin, D.: An Information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco, California, USA (1998)Google Scholar
  33. 33.
    Lin, H.T., Chi, N.W., Hsieh, S.H.: A concept-based information retrieval approach for engineering domain-specific technical documents. Adv. Eng. Inf. 26, 349–360 (2012)CrossRefGoogle Scholar
  34. 34.
    Malioutov, I., Barzilay, R.: Minimum cut model for spoken lecture segmentation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006), pp. 25–32. ACL, Stroudsburg (2006)Google Scholar
  35. 35.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)zbMATHCrossRefGoogle Scholar
  36. 36.
    McDonald, D., Hsinchun, C.: Using sentence-selection heuristics to rank text segments in txtractor. In: JCDL’02, pp. 28–35. ACM, New York (2002)Google Scholar
  37. 37.
    Misra, H., Yvon, F., Cappé, O., Jose, J.: Text segmentation: a topic modeling perspective. Inf. Process. Manag. 47, 528–544 (2011, in press). Corrected ProofGoogle Scholar
  38. 38.
    Moens, M.F., De Busser, R.: Generic topic segmentation of document texts. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pp. 418–419. ACM, New York (2001)Google Scholar
  39. 39.
    Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems – FOIS ’01, Ogunquit, pp. 2–9. ACM, New York (2001)Google Scholar
  40. 40.
    Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR ’98, Melbourne, pp. 275–281. ACM, New York (1998)Google Scholar
  41. 41.
    Prévot, L., Borgo, S., Oltramari, A.: Interfacing ontologies and lexical resources. In: Ren Huang, C., Calzolari, N., Gangemi, A., Lenci, A., Oltramari, A. Prévot, L. (eds.) Ontology and the Lexicon, a Natural Language Processing Perspective, Studies in Natural Language Processing, pp. 185, 200. Cambridge University Press, Cambridge/New York (2010)Google Scholar
  42. 42.
    Pylkas, K., Erkko, H., Nikkila, J., Solyom, S., Winqvist, R.: Analysis of large deletions in BRCA1, BRCA2 and PALB2 genes in Finnish breast and ovarian cancer families. BMC Cancer 8, 146 (2008)CrossRefGoogle Scholar
  43. 43.
    Ranwez, S., Ranwez, V., Villerd, J., Crampes, M.: Ontological distance measures for information visualisation on conceptual maps. In: Meersman, R., Tari, Z., Herrero P. (eds.) On the Move to Meaningful Internet Systems 2006: OTM 2006 Workshops. Lecture Notes in Computer Science, vol. 4278, pp. 1050–1061. Springer, Berlin/Heidelberg (2006)CrossRefGoogle Scholar
  44. 44.
    Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130 (1999)zbMATHGoogle Scholar
  45. 45.
    Reynar, J.C.: Topic segmentation: algorithms and applications. Ph.D. thesis, Computer and Information Science. University of Pennsylvania, Pennsylvania, USA (1998)Google Scholar
  46. 46.
    Riedhammer, K., Favre, B., Hakkani-Tür, D.: Long story short? Global unsupervised models for keyphrase based meeting summarization. Speech Commun. 52(10), 801–815 (2010)Google Scholar
  47. 47.
    Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Hypertext’96, pp. 53–65. ACM, New York (1996)Google Scholar
  48. 48.
    Schmid, H.: Treetagger. In: TC project at the institute for Computational Linguistics of the University of Stuttgart (1994).
  49. 49.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (1997)Google Scholar
  50. 50.
    Staab, S., Maedche, A.: Ontology learning for the semantic web. IEEE Intell. Syst. 16(2), 72–79 (2001)CrossRefGoogle Scholar
  51. 51.
    Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval – SIGIR ’03, Toronto, p. 159. ACM, New York (2003)Google Scholar
  52. 52.
    Supekar, K., Chute, C.G., Solbrig, H.: Representing lexical components of medical terminologies in OWL. AMIA Annu. Symp. Proc. 2005, 719–723 (2005)Google Scholar
  53. 53.
    Sy, M., Ranwez, S., Montmain, J., Regnault, A., Crampes, M., Ranwez, V.: User centered and ontology based information retrieval system for life sciences. BMC Bioinformatics 13(Suppl 1), S4 (2011)CrossRefGoogle Scholar
  54. 54.
    Wiss, U., Carr, D.: A cognitive classification framework for 3-Dimensional information visualization. Research report LTU-TR-1998/4-Lulea University of Technology (1998)Google Scholar
  55. 55.
    Xie, S., Hakkani-tür, D., Favre, B., Liu, Y.: Integrating prosodic features in extractive meeting summarization. In: Proceedings IEEE Workshop on Speech Recognition and Understanding (ASRU). IEEE, Piscataway (2009)Google Scholar
  56. 56.
    Zheng, H., Borchert, C., Jiang, Y.: A knowledge-driven approach to biomedical document conceptualization. Artif. Intell. Med. 49(2), 67–78 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Sylvie Ranwez
    • 1
    Email author
  • Benjamin Duthil
    • 1
  • Mohameth François Sy
    • 1
  • Jacky Montmain
    • 1
  • Patrick Augereau
    • 3
  • Vincent Ranwez
    • 2
  1. 1.LGI2P Research Center from Ecole des Mines d’AlèsParc scientifique G. BesseNîmes Cedex 1France
  2. 2.SupAgro Montpellier (UMR AGAP)Montpellier Cedex 1France
  3. 3.IRCM, Institut de Recherche en Cancérologie de Montpellier Inserm U896 and Université Montpellier 1CRLC Val d’Aurelle Paul LamarqueMontpellierFrance

Personalised recommendations