Text Segmentation into Paragraphs Based on Local Text Cohesion
The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically self-contained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource - a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.
Unable to display preview. Download preview PDF.
- 1.Bolshakov, I. A. Multifunctional thesaurus for computerized preparation of Russian texts. Automatic Documentation and Mathematical Linguistics. Allerton Press Inc. Vol. 28, No. 1, 1994, p. 13–28.Google Scholar
- 2.Bolshakov, I. A. Multifunction thesaurus for Russian word processing. Proc. of 4th Conf. on Applied Natural Language Processing, Stuttgart, 13–15 October, 1994, p. 200–202.Google Scholar
- 3.Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.Google Scholar
- 4.Ferret, O. How to Thematically Segment Texts by Using Lexical Cohesion? Proc. of Co-ling-ACL-98, v. 2, 1998, p. 1481–1483.Google Scholar
- 5.Ferret, O., B. Grau, N. Masson. Thematic segmentation of texts: two methods for two kinds of texts. Proc. of Coling-ACL-1998, v. 1, p. 392–396.Google Scholar
- 6.Jobbins, A. C., L. J. Evett. Text segmentation using reiteration and collocation. In: Proc. of Coling-ACL-1998, v. 1, p. 614–618.Google Scholar
- 7.Hearst, A. M. Multi-paragraph segmentation of expository text. Proc. ACL-94. Las Cruces, N. M., USA, 1994, p. 9–16.Google Scholar
- 8.Hearst, A. M., C. Plaunt. Subtopic Structuring for Full-Length Document Access. Proc. ACM-SIGIR’93, 1993, p. 59–68.Google Scholar
- 9.Heinonen, O. Optimal multiparagraph text segmentation by Dynamic Programming. Proc. of Coling-ACL-98, v. 2, 1998, p. 1484–1486.Google Scholar
- 10.Litman, D., R.J. Passonneau. Combining Multiple Knowledge Sources For Discourse Segmentation. Proc. 31th Annual Meeting ACL Conference, 1993, Columbus, p. 108–115.Google Scholar
- 11.Kaufmann, S. Second Order Cohesion. Proc. PACLING’99 Conf., 1999, p. 209–222.Google Scholar
- 12.Kozima, H. Text segmentation based on similarity between words. Proc. of ACL-93, Columbus, Ohio, USA, 1993, p. 286–288.Google Scholar
- 13.Kurohashi, S., M. Nagao. Automatic Detection of Discourse Structure By Checking Sur-face Information in Sentences. Proc. Coling-94, Kyoto, 1994, p. 1123–1127.Google Scholar
- 14.Mel’cuk. I. Dependency Syntax: Theory and Practice. SUNY Press, NY. 1988.Google Scholar
- 15.Nomoto, T., Y. Nitta. A Grammatico-Statistical Approach to Discourse Partitioning. Proc. Coling-94, Kyoto, 1994, p. 1145–1150.Google Scholar
- 18.Smadja, F. Retreiving Collocations from text: Xtract. Computational Linguistics. Vol. 19, No. 1, 1993, p. 143–177.Google Scholar
- 19.Suzuki, Y. et al. Segmentation and Event Detection of New Stories Using Term Weighting. Proc. PACLING’ 99 Conf., 1999, p. 149–154.Google Scholar
- 20.Vossen, Piek (ed.). EuroWordNet General Document. Vers. 3 final. 2000, http://www.hum.uva.nl/ewn.
- 21.Zadrozny, W., K. Jensen. Semantic of Paragraphs. Computational Linguistics. V. 17(2), 1991, p. 171–209.Google Scholar
- 22.Zobel, J. Writing for computer science. Springer. 1997.Google Scholar