A Proximity Measure and a Clustering Method for Concept Extraction in an Ontology Building Perspective
In this paper, we study the problem of clustering textual units in the framework of helping an expert to build a specialized ontology. This work has been achieved in the context of a French project, called Biotim, handling botany corpora. Building an ontology, either automatically or semi-automatically is a difficult task. We focus on one of the main steps of that process, namely structuring the textual units occurring in the texts into classes, likely to represent concepts of the domain. The approach that we propose relies on the definition of a new non-symmetrical measure for evaluating the semantic proximity between lemma, taking into account the contexts in which they occur in the documents. Moreover, we present a non-supervised classification algorithm designed for the task at hand and that kind of data. The first experiments performed on botanical data have given relevant results.
KeywordsTextual Representation Proximity Measure Expression Tree Textual Unit Common Context
Unable to display preview. Download preview PDF.
- 1.Cleuziou, G., Martin, L., Vrain, C.: PoBOC: an Overlapping Clustering Algorithm. Application to Rule-Based Classification and Textual Data. In: López de Mántaras, R., Saitta, L. (eds.) Proceedings of the 16th European Conference on Artificial Intelligence, August 22-27, 2004, pp. 440–444. IOS Press, Valencia, Spain (2004)Google Scholar
- 2.Harris, Z., Gottfried, M., Ryckman, T., Mattick, P., Daladier, A., Harris, T.N., Harris, S.: The form of Information in Science: Analysis of an immunology sublanguage. Kluwer Academic Publishers, Dordrecht (1989)Google Scholar
- 4.Le Moigno, S., Charlet, J., Bourigault, D., Degoulet, P., Jaulent, M.C.: Terminology extraction from text to build an ontology in surgical intensive care. In: Proceedings of the AMIA Annual Symposium, San Antonio, Texas, pp. 9–13 (2002)Google Scholar
- 5.Lin, K.I., Kondadadi, R.: A Word-Based Soft Clustering Algorithm for Documents. In: Proceedings of 16th International Conference on Computers and Their Applications, Seattle, Washington (2001)Google Scholar
- 6.MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
- 7.Nazarenko, A., Zweigenbaum, P., Bouaud, J., Habert, B.: Corpus-based identification and refinement of semantic classes. In: Marys, D.R. (ed.) Proceedings of the 1997 American Medical Informatics Association (AMIA) Annual Fall Symposium, Nashville, Tenessee (1997)Google Scholar
- 8.Pantel, P.: Clustering by Committee. Ph.d. dissertation, Department of Computing Science, University of Alberta (2003)Google Scholar
- 11.Turner, W., Chartron, G., Laville, F., Michelet, B.: Quantitative studies of science and technology. In: Van Raan, A. (ed.) Packaging Information for Peer Review: New Co-Word Analysis Techniques, North-Holland, Amsterdam (1988)Google Scholar