Skip to main content
Log in

A Dynamic Programming Algorithm for Linear Text Segmentation

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

In this paper we introduce a dynamic programming algorithm which performs linear text segmentation by global minimization of a segmentation cost function which incorporates two factors: (a) within-segment word similarity and (b) prior information about segment length. We evaluate segmentation accuracy of the algorithm by precision, recall and Beeferman's segmentation metric. On a segmentation task which involves Choi's text collection, the algorithm achieves the best segmentation accuracy so far reported in the literature. The algorithm also achieves high accuracy on a second task which involves previously unused texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Beeferman, D., Berger, A., and Lafferty, J. (1997a). A Model of Lexical Attraction and Repulsion. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics (pp. 373–380).

  • Beeferman, D., Berger, A., and Lafferty, J. (1997b). Text Segmentation Using Exponential Models. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (pp. 35–46).

  • Beeferman, D., Berger, A., and Lafferty, J. (1999). Statistical Models for Text Segmentation. Machine Learning, 34, 177–210.

    Article  Google Scholar 

  • Blei, D.M. and Moreno, P.J. (2001). Topic Segmentation with an Aspect Hidden Markov Model. Tech. Rep. CRL 2001-07, COMPAQ Cambridge Research Lab.

  • Choi, F.Y.Y. (2000). Advances in Domain Independent Linear Text Segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (pp. 26–33).

  • Choi, F.Y.Y., Wiemer-Hastings, P., and Moore, J. (2001). Latent Semantic Analysis for Text Segmentation. In Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (pp. 109–117).

  • Francis, W.N. and Kucera, H. (1982). Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin.

  • Halliday, M. and Hasan, R. (1976). Cohesion in English. Longman.

  • Hearst, M.A. (1993). TextTiling: A Quantitative Approach to Discourse Segmentation. Tech. Rep. 93/24, Dept. of Computer Science, University of California, Berkeley.

  • Hearst, M.A. (1994a). Context and Structure in Automated Full-Text Information Access. Ph.D. Thesis, Report No. UCB/CSD-94/836, Dept. of Computer Science, University of California, Berkeley.

  • Hearst, M.A. (1994b). Multi-Paragraph Segmentation of Expository Texts. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistic (pp. 9–16).

  • Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-Length Document Access. In Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval of the Association of Computer Machinery-Special Interest Group on Information Retrieval (ACM-SIGIR) (pp. 59–68).

  • Hearst, M.A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th International Conference on Computational Linguistics (pp. 539–545).

  • Heinonen, O. (1998). Optimal Multi-Paragraph Text Segmentation by Dynamic Programming. In Proceedings of 17th International Conference on Computational Linguistics (COLING-ACL'98) (pp. 1484–1486).

  • Hirschberg, J. and Litman, D. (1993). Empirical Studies on the Disambiguation and Cue Phrases. Computational Linguistics, 19, 501–530.

    Google Scholar 

  • Kan, M., Klavans, J.L., and McKeown, K.R. (1998). Linear Segmentation and Segment Significance. In Proceedings of the 6th International Workshop of Very Large Corpora (pp. 197–205).

  • Kan, Min-Yen. (2000). Combining Visual Layout and Lexical Cohesion Features for Text Segmentation. Tech. Rep. CUCS-002-01, Dept. of Computer Science, Columbia University.

  • Kozima, H. (1993). Text Segmentation Based on Similarity Between Words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (pp. 286–288).

  • Kozima, H. and Furugori, T. (1993). Similarity BetweenWords Computed by Spreading Activation on an English Dictionary. In Proceedings of 6th Conference of the European Chapter of the Association for Computational Linguistics (pp. 232–239).

  • Kozima, H. and Furugori, T. (1994). Segmenting Narrative Text into Coherent Scenes. Literary and Linguistic Computing, 9, 13–19.

    Google Scholar 

  • Mittendorf, E. and Schuble, P. (1996). Document and Passage Retrieval Based on Hidden Markov models. In Proceedings of the 19th Annual International of Association of Computer Machinery-Special Interest Group on Information Retrieval (ACM/SIGIR) Conference on Research and Development in Information Retrieval (pp. 318–327).

  • Morris, J. (1988). Lexical Cohesion, the Thesaurus and the Structure of the Text. Tech. Rep. CSRI-219, Computer Systems Research Institute, Univerity of Toronto.

  • Morris, J. and Hirst, G. (1991). Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text. Computational Linguistics, 17, 21–42.

    Google Scholar 

  • Passoneau, R. and Litman, D.J. (1993). Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues. In Proceedings of the 31st Meeting of the Association for Computational Liguistics (pp. 148–155).

  • Passoneau, R. and Litman, D.J. (1996). Empirical Analysis of Three Dimensions of Spoken Discourse: Segmentation, Coherence and Linguistic Devices. In E.H. Hovy and D.R. Scott (Eds.), Computational and Conversational Discourse: Burning Issues-An Interdisciplinary Account, Springer (pp. 161–194).

  • Ponte, J.M. and Croft, W.B. (1997). Text Segmentation by Topic. In Proceedings of the 1st European Conference on Research and Advanced Technology for Digital Libraries (pp. 120–129).

  • Porter, M.F. (1980). An Algorithm for Suffix Stripping. Program, 14, 130–137.

    Google Scholar 

  • Reynar, J.C. (1998). Topic Segmentation: Algorithms and Applications. Ph.D. Thesis, Dept. of Computer Science, Univ. of Pennsylvania.

  • Reynar, J.C. (1994). An Automatic Method of Finding Topic Boundaries. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (pp. 331–333).

  • Reynar, J.C. and Ratnaparkhi, A. (1997). A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the 5th Conference on Applied Natural Language Processing (pp. 16-19).

  • Roget, P.M. (1911). Roget's International Thesaurus, 1st ed. Cromwell.

  • Roget, P.M. (1977). Roget's International Thesaurus, 4th ed. Harper and Row.

  • Utiyama, M. and Isahara, H. (2001). A Statistical Model for Domain-Independent Text Segmentation. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (pp. 491–498).

  • Xu, J. and Croft, W.B. (1996). Query Expansion Using Local and Global Document Analysis. In Proceedings of the 19th Annual International of Association of Computer Machinery-Special Interest Group on Information Retrieval (ACM/SIGIR) Conference on Research and Development in Information Retrieval (pp. 4–11).

  • Yaari, Y. (1997). Segmentation of Expository Texts by Hierarchical Agglomerative Clustering. In Proceedings of the Conference on Recent Advances in Natural Language Processing (pp. 59–65).

  • Yaari, Y. (1999). Intelligent Exploration of Expository Texts. Ph.D. thesis. Dept. of Computer Science, Bar-Ilan University.

  • Yamron, J., Carp, I., Gillick, L., Lowe, S., and van Mulbregt, P. (1999). Topic Tracking in a News Stream. In Proceedings of DARPA Broadcast News Workshop (pp. 133–136).

  • Youmans, G. (1990). Measuring Lexical Style and Competence: The Type-Token Vocabulary Curve. Style, 24, 584–599.

    Google Scholar 

  • Youmans, G. (1991). A New Tool for Discourse Analysis: The Vocabulary Management Profile. Language, 67, 763–789.

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fragkou, P., Petridis, V. & Kehagias, A. A Dynamic Programming Algorithm for Linear Text Segmentation. Journal of Intelligent Information Systems 23, 179–197 (2004). https://doi.org/10.1023/B:JIIS.0000039534.65423.00

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:JIIS.0000039534.65423.00

Navigation