We propose and motivate a novel task: paragraph segmentation. We discuss and compare this task with text segmentation and discourse parsing. We present a system that performs the task with high accuracy. A variety of features is proposed and examined in detail. The best models turn out to include lexical, coherence, and structural features.


Wall Street Journal Text Segmentation Previous Sentence Rhetorical Structure Graph Boundary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Richardson, S.: Microsoft natural language understanding system and grammar checker. In: Proceedings of Fifth Conference on Applied Natural Language Processing: System Demos, ANLP 1997 (1997)Google Scholar
  2. 2.
    Fukumoto, F., Suzuki, Y.: Detecting shifts in news stories for paragraph extraction. In: Proceedings of 19th International Conference on Computational Linguistics, COLING 2002 (2002)Google Scholar
  3. 3.
    Genzel, D., Charniak, E.: Variation of entropy and parse trees of sentences as a function of the sentence number. In: Proceedings of EMNLP 2003, Sapporo, Japan (2003)Google Scholar
  4. 4.
    Bolshakov, I.A., Gelbukh, A.F.: Text segmentation into paragraphs based on local text cohesion. In: Matoušek, V., Mautner, P., Mouček, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 158–166. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Hearst, M.: TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23 (1997)Google Scholar
  6. 6.
    Beeferman, D., Berger, A., Lafferty, J.: Text segmentation using exponential models. In: Proceedings of EMNLP 1997 (1997)Google Scholar
  7. 7.
    Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: Proceedings of HLT/NAACL 2003 (2003)Google Scholar
  8. 8.
    Bouayad-Agha, N., Power, R., Scott, D.: Can text structure be incompatible with rhetorical structure? In: Proceedings of the International Natural Language Generation Conference, INLG 2000 (2000)Google Scholar
  9. 9.
    Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of ACL 2002 (2002)Google Scholar
  10. 10.
    Rosenblatt, F.: The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408 (1958)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Grosz, B., Joshi, A., Weinstein, S.: Centering: a framework for modelling the local coherence of discourse. Computational Linguistics 21, 203–226 (1995)Google Scholar
  12. 12.
    Blaheta, D., Charniak, E.: Assigning function tags to parsed text. In: Proceedings of NAACL 2000 (2000)Google Scholar
  13. 13.
    Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19, 313–330 (1993)Google Scholar
  14. 14.
    Tolstoy, L.: War and Peace (1869), Available online, in 4 languages (Russian, English, Spanish, Italian)
  15. 15.
    Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of ACL 2001, Toulouse (2001)Google Scholar
  16. 16.
    Johnson, M.: A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In: Proceedings of ACL 2002 (2002)Google Scholar
  17. 17.
    Cohen, P.: Empirical methods for artificial intelligence. MIT Press, Cambridge (1995)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Dmitriy Genzel
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations