Text clustering to help knowledge acquisition from documents

  • Stéphane Lapalut
Eliciting Knowledge from Textual and Other Sources
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1076)


At the earlier stage of the knowledge acquisition process, interviews of experts produce a large amount of rich but ill-structured texts. Knowledge engineers need some tool to help them in the exploitation of all these texts. We propose the use of a statistical method, the top-down hierarchical classification and a new interpretation of its results. The initial statistical analysis proposed by M. Reinert [16, 17] gives two kinds of results: first a segmentation of texts that reflects their “semantic contexts” that we use to raise structures of texts, and second, classes of significant terms belonging to these contexts, which can be related to the experts or to these specialities. In this paper, we describe the method, its empirical validity and a comparison with similar approaches, its uses with examples and results. We conclude with some research directions to extend the exploitation of the analysis results.


Knowledge Acquisition Semantic Context Knowledge Engineer Text Corpus Expository Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    J.P. Benzecri. L'analyse des Données. Dunod, Paris, 1973.Google Scholar
  2. 2.
    D. Bourigault. Lexter, a terminology extraction software for knowledge acquisition from texts. In Proceedings of the 9th Knowledge Acquisition for Knowledge Based Systems Workshop, Banff, Canada, 1995.Google Scholar
  3. 3.
    E. Charniak. Statistical language learning. Bradford books. The MIT Press, Cambridge, Mass., 1993.Google Scholar
  4. 4.
    J. Chaumier and M. Dejean. L'indexation assistée par ordinateur, principes et méthodes. Documentaliste — Sciences de l'information, 29(1):3–6, 1992.Google Scholar
  5. 5.
    C. Desjardins, C. Riccardi-Rigault, P. Plante, L. Dumas, and F. Henri. ACTIA. In F. Maurer, editor, 2nd Knowledge Engineering Forum, number SFB 501 Bericht 01/96, Kaiserslautern University, Germany, 1996.Google Scholar
  6. 6.
    S.K. Fall, T.C. Crawford, S.L. Souders, and M.J. Rabin. Automated knowledge acquisition technics for intelligence analysts. AAI, 1095 of SPIE:66–77, 1989.Google Scholar
  7. 7.
    B.R. Gaines and M.L.G. Shaw. Using knowledge acquisition and representation tools to support scientific communities. In Proceedings of the Twelve National Conference on Artificial Intelligence, volume 1, pages 707–712. AAAI Press, 1994.Google Scholar
  8. 8.
    M.A. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the ACL, Las Cruces, NM, June 1994.Google Scholar
  9. 9.
    P.S. Jacobs. Using statistical methods to improve knowledge-based news categorization. IEEE Expert, 8(2):13–23, April 1993.Google Scholar
  10. 10.
    K.A. Kaufman, R.S. Michalsky, and L. Kershberg. Knowledge extraction from database: design principles of the INLEN system. In Proceedings of the 6th International Symposium on Methodology for Intelligent Systems, number 542 in LNCS, pages 152–161, Berlin, 1991. Springer-Verlag.Google Scholar
  11. 11.
    M. Kendall and A. Stuart. Inference and Relationship, volume 2 of The advanced Theory of Statistics. Charles Griffin and Co Ltd, 1979.Google Scholar
  12. 12.
    S. Lapalut. Text clustering to support knowledge acquisition from documents. Technical Report RR-2639, INRIA U.R. de Sophia Antipolis, BP 93, 06902 Sophia Antipolis Cedex, 1995. Scholar
  13. 13.
    S. Lapalut. How to handle multiple expertise from several experts: a general text clustering approach. In F. Maurer, editor, 2nd Knowledge Engineering Forum, number SFB 501 Bericht 01/96, Kaiserslautern University, Germany, 1996.Google Scholar
  14. 14.
    B. Moulin and D. Rousseau. Automated knowledge acquisition from regulatory texts. IEEE Expert, 7(5):27–35, October 1992.Google Scholar
  15. 15.
    M.S. Register and N. Kannan. A hybrid architecture for text classification. In Fourth International conference on Tools with Artificial Intelligence, TAI'92, pages 286–92, Arlington, VA, USA, 1992. IEEE Compu. Soc. Press.Google Scholar
  16. 16.
    M. Reinert. Classification descendante hiérarchique pour l'analyse de contenu et traitement statistique de corpus. PhD thesis, Université Paris 6, Paris, 1979.Google Scholar
  17. 17.
    M. Reinert. Notice du logiciel ALCESTE, version 2.0, 1992.Google Scholar
  18. 18.
    M.L.G. Shaw and B.R. Gaines. KITTEN: Knowledge initiation and transfert tools for experts and novices. Int. J. Man-Machine Studies, 27:251–280, 1987.Google Scholar
  19. 19.
    Z.B. Wu, L.S. Hsu, and C.L Tan. A survey on statistical approaches to natural language processing, Technical report TRA4/92, Departement of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 0511, April 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Stéphane Lapalut
    • 1
  1. 1.Projet ACACIA, INRIA Sophia AntipolisSophia Antipolis CedexFrance

Personalised recommendations