A Proximity Measure and a Clustering Method for Concept Extraction in an Ontology Building Perspective

  • Guillaume Cleuziou
  • Sylvie Billot
  • Stanislas Lew
  • Lionel Martin
  • Christel Vrain
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4203)


In this paper, we study the problem of clustering textual units in the framework of helping an expert to build a specialized ontology. This work has been achieved in the context of a French project, called Biotim, handling botany corpora. Building an ontology, either automatically or semi-automatically is a difficult task. We focus on one of the main steps of that process, namely structuring the textual units occurring in the texts into classes, likely to represent concepts of the domain. The approach that we propose relies on the definition of a new non-symmetrical measure for evaluating the semantic proximity between lemma, taking into account the contexts in which they occur in the documents. Moreover, we present a non-supervised classification algorithm designed for the task at hand and that kind of data. The first experiments performed on botanical data have given relevant results.


Textual Representation Proximity Measure Expression Tree Textual Unit Common Context 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cleuziou, G., Martin, L., Vrain, C.: PoBOC: an Overlapping Clustering Algorithm. Application to Rule-Based Classification and Textual Data. In: López de Mántaras, R., Saitta, L. (eds.) Proceedings of the 16th European Conference on Artificial Intelligence, August 22-27, 2004, pp. 440–444. IOS Press, Valencia, Spain (2004)Google Scholar
  2. 2.
    Harris, Z., Gottfried, M., Ryckman, T., Mattick, P., Daladier, A., Harris, T.N., Harris, S.: The form of Information in Science: Analysis of an immunology sublanguage. Kluwer Academic Publishers, Dordrecht (1989)Google Scholar
  3. 3.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  4. 4.
    Le Moigno, S., Charlet, J., Bourigault, D., Degoulet, P., Jaulent, M.C.: Terminology extraction from text to build an ontology in surgical intensive care. In: Proceedings of the AMIA Annual Symposium, San Antonio, Texas, pp. 9–13 (2002)Google Scholar
  5. 5.
    Lin, K.I., Kondadadi, R.: A Word-Based Soft Clustering Algorithm for Documents. In: Proceedings of 16th International Conference on Computers and Their Applications, Seattle, Washington (2001)Google Scholar
  6. 6.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
  7. 7.
    Nazarenko, A., Zweigenbaum, P., Bouaud, J., Habert, B.: Corpus-based identification and refinement of semantic classes. In: Marys, D.R. (ed.) Proceedings of the 1997 American Medical Informatics Association (AMIA) Annual Fall Symposium, Nashville, Tenessee (1997)Google Scholar
  8. 8.
    Pantel, P.: Clustering by Committee. Ph.d. dissertation, Department of Computing Science, University of Alberta (2003)Google Scholar
  9. 9.
    Rada, R., Bicknell, E.: Ranking documents with a thesaurus. Journal of the American Society for Information Science 40(5), 304–310 (1989)CrossRefGoogle Scholar
  10. 10.
    Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy - The Principles and Practice of Numerical Classification. W. H. Freeman and Compagny, San Francisco (1973)MATHGoogle Scholar
  11. 11.
    Turner, W., Chartron, G., Laville, F., Michelet, B.: Quantitative studies of science and technology. In: Van Raan, A. (ed.) Packaging Information for Peer Review: New Co-Word Analysis Techniques, North-Holland, Amsterdam (1988)Google Scholar
  12. 12.
    Tversky, A.: Features of similarity. Psychological Review 84, 327–352 (1977)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Guillaume Cleuziou
    • 1
  • Sylvie Billot
    • 1
  • Stanislas Lew
    • 1
  • Lionel Martin
    • 1
  • Christel Vrain
    • 1
  1. 1.Laboratoire d’Informatique Fondamentale (LIFO)Université d’OrléansOrléansFrance

Personalised recommendations