Text Segmentation into Paragraphs Based on Local Text Cohesion

  • Igor A. Bolshakov
  • Alexander Gelbukh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2166)


The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically self-contained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource - a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bolshakov, I. A. Multifunctional thesaurus for computerized preparation of Russian texts. Automatic Documentation and Mathematical Linguistics. Allerton Press Inc. Vol. 28, No. 1, 1994, p. 13–28.Google Scholar
  2. 2.
    Bolshakov, I. A. Multifunction thesaurus for Russian word processing. Proc. of 4th Conf. on Applied Natural Language Processing, Stuttgart, 13–15 October, 1994, p. 200–202.Google Scholar
  3. 3.
    Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.Google Scholar
  4. 4.
    Ferret, O. How to Thematically Segment Texts by Using Lexical Cohesion? Proc. of Co-ling-ACL-98, v. 2, 1998, p. 1481–1483.Google Scholar
  5. 5.
    Ferret, O., B. Grau, N. Masson. Thematic segmentation of texts: two methods for two kinds of texts. Proc. of Coling-ACL-1998, v. 1, p. 392–396.Google Scholar
  6. 6.
    Jobbins, A. C., L. J. Evett. Text segmentation using reiteration and collocation. In: Proc. of Coling-ACL-1998, v. 1, p. 614–618.Google Scholar
  7. 7.
    Hearst, A. M. Multi-paragraph segmentation of expository text. Proc. ACL-94. Las Cruces, N. M., USA, 1994, p. 9–16.Google Scholar
  8. 8.
    Hearst, A. M., C. Plaunt. Subtopic Structuring for Full-Length Document Access. Proc. ACM-SIGIR’93, 1993, p. 59–68.Google Scholar
  9. 9.
    Heinonen, O. Optimal multiparagraph text segmentation by Dynamic Programming. Proc. of Coling-ACL-98, v. 2, 1998, p. 1484–1486.Google Scholar
  10. 10.
    Litman, D., R.J. Passonneau. Combining Multiple Knowledge Sources For Discourse Segmentation. Proc. 31th Annual Meeting ACL Conference, 1993, Columbus, p. 108–115.Google Scholar
  11. 11.
    Kaufmann, S. Second Order Cohesion. Proc. PACLING’99 Conf., 1999, p. 209–222.Google Scholar
  12. 12.
    Kozima, H. Text segmentation based on similarity between words. Proc. of ACL-93, Columbus, Ohio, USA, 1993, p. 286–288.Google Scholar
  13. 13.
    Kurohashi, S., M. Nagao. Automatic Detection of Discourse Structure By Checking Sur-face Information in Sentences. Proc. Coling-94, Kyoto, 1994, p. 1123–1127.Google Scholar
  14. 14.
    Mel’cuk. I. Dependency Syntax: Theory and Practice. SUNY Press, NY. 1988.Google Scholar
  15. 15.
    Nomoto, T., Y. Nitta. A Grammatico-Statistical Approach to Discourse Partitioning. Proc. Coling-94, Kyoto, 1994, p. 1145–1150.Google Scholar
  16. 16.
    Oppenheim, A.V., R.V. Shafer. Discrete-Time Signal Processing. Prentice Hall. NJ, 1989.zbMATHGoogle Scholar
  17. 17.
    Salton, G., A. Singhal, M. Mitra, C. Buckley. Automatic Text Structuring and Summarization. Information Processing & Management. V. 33(2), 1997, p. 193–207.CrossRefGoogle Scholar
  18. 18.
    Smadja, F. Retreiving Collocations from text: Xtract. Computational Linguistics. Vol. 19, No. 1, 1993, p. 143–177.Google Scholar
  19. 19.
    Suzuki, Y. et al. Segmentation and Event Detection of New Stories Using Term Weighting. Proc. PACLING’ 99 Conf., 1999, p. 149–154.Google Scholar
  20. 20.
    Vossen, Piek (ed.). EuroWordNet General Document. Vers. 3 final. 2000,
  21. 21.
    Zadrozny, W., K. Jensen. Semantic of Paragraphs. Computational Linguistics. V. 17(2), 1991, p. 171–209.Google Scholar
  22. 22.
    Zobel, J. Writing for computer science. Springer. 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Igor A. Bolshakov
    • 1
  • Alexander Gelbukh
    • 1
  1. 1.Center for Computing Research (CIC)National Polytechnic Institute (IPN)Mexico CityMexico

Personalised recommendations