Subtopic Segmentation of Scientific Texts: Parameter Optimisation

  • Natalia Avdeeva
  • Galina Artemova
  • Kirill Boyarsky
  • Natalia Gusarova
  • Natalia Dobrenko
  • Eugeny Kanevsky
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 518)


Information research within a scientific text needs to deal with the problem of automatic document partition on subtopics by taking text specifics and user purposes into account. This task is important for primary source selection, for working with texts in foreign languages or for getting acquainted with research problems. This paper is focused on the application of subtopic segmentation algorithms to real-life scientific texts. For studying this we use monographs on the same subject written in three languages. The corpus includes several original and professionally trasnlated fragments. The research is based on the TextTiling algorithm that analyses how tightly adjoining parts of the text cohere. We examine how some parameters (the cutoff rate, the size of moving window and of the shift from one block to the next one) influence the segmentation quality and define the optimal combinations of these parameters for several languages. The studies on Russian suggest that external lexical resources notably improve the segmentation quality.


Text tiling Classification Parsing Segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  2. 2.
    Bolshakova, E.I., Klyshinsky, E.S., Lande, D.V., Noskov, A.A., Peskova, O.V.: Automatic processing of natural language texts and computer linguistics. Moscow State Institute of Electronics and Mathematics (2011)Google Scholar
  3. 3.
    Chaibi, A., Naili, M., Sammoud, S.: Topic segmentation for textual document written in arabic language. In: Procedia Computer Science: 18th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES2014, vol. 35, pp. 26–33 (2014)Google Scholar
  4. 4.
    Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 26–33 (2000)Google Scholar
  5. 5.
    Dias, G., Alves, E., Lopes, J.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI 2007 Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, pp. 1334–1339 (2007)Google Scholar
  6. 6.
    Douglas, B., Berger, A., Lafferty, J.: Statistical models of text segmentation. Machine Learning 34(1–3) (1999)Google Scholar
  7. 7.
    Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013)Google Scholar
  8. 8.
    Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: NAACL 2009 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361 (2009)Google Scholar
  9. 9.
    Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 334–343 (2008)Google Scholar
  10. 10.
    Flejter, D., Wieloch, K., Abramowicz, W.: Unsupervised methods of topical text segmentation for polish. In: Balto-Slavonic Natural Language Processing 2007, pp. 51–58 (2007)Google Scholar
  11. 11.
    Georgescu, M., Clark, A., Armstrong, S.: An analysis of quantitative aspects in the evaluation of thematic segmentation algorithms. In: SigDIAL 2006 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue (2009)Google Scholar
  12. 12.
    Hearst, M.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  13. 13.
    Hearst, M., Plaunt, C.: Subtopic structuring for full-length document access. In: SIGIR 1993: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–68 (1993)Google Scholar
  14. 14.
    Kanevsky, E.A., Boyarsky, K.: Semantics and sintactics parser semsin. In: Dialog-2012: International Conference on Computational Linguistics (2012). (date of access June 29, 2015)
  15. 15.
    Kazantseva, A., Szpakowicz, S.: Linear text segmentation using affinity propagation. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 284–293 (2011)Google Scholar
  16. 16.
    Kazantseva, A., Szpakowicz, S.: Hierarchical topical segmentation with affinity propagation. In: Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics: Technical Papers, pp. 37–47 (2014)Google Scholar
  17. 17.
    Kotyurova, M.P.: Scietific style of speech. Akademiya (2010)Google Scholar
  18. 18.
    Lamprier, S., Amghar, T., Levrat, B.: On evaluation methodologies for text segmentation algorithms. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2, pp. 11–18 (2007)Google Scholar
  19. 19.
    Lee, D.: Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning & Technology 5(3), 37–72 (2001)Google Scholar
  20. 20.
    Misra, H., Yvon, F., Cappe, O., Jose, J.: Text segmentation: A topic modeling perspective. Information Processing and Management 47(4), 528–544 (2011)CrossRefGoogle Scholar
  21. 21.
    Ponte, J.M., Croft, W.B.: Text segmentation by topic. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 113–125. Springer, Heidelberg (1997) CrossRefGoogle Scholar
  22. 22.
    Reynar, J.: An automatic method of finding topic boundaries. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 331–333 (1994)Google Scholar
  23. 23.
    Riedl, M., Biemann, C.: Text segmentation with topic models. JLCL 27(1), 47–69 (2012)Google Scholar
  24. 24.
    Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 362–366 (2012)Google Scholar
  25. 25.
    Smolyanina, E.A.: Cohesion types in the scientific text (based on english article by m. black “metaphor”). Vestnik of Perm State University: Russian and Foreign Philology 4(24), 140–150 (2004)Google Scholar
  26. 26.
    Stark, H.: What do paragraph markings do? Discourse Processes 11, 275–303 (1988)CrossRefGoogle Scholar
  27. 27.
    Trofimova, G.K.: Russian language ant the culture of speech: lectures. Flinta, Nauka (2004)Google Scholar
  28. 28.
    Tuzov, V.A.: Computer semantics of the Russian language. Saint-Petersburg University Press (2004)Google Scholar
  29. 29.
    Wan, X.: On the effectiveness of subwords for lexical cohesion based story segmentation of chinese broadcast news. Information Sciences 177, 3718–3730 (2007)CrossRefGoogle Scholar
  30. 30.
    Ye, N., Zhu, J., Wang, H., Ma, M., Zhang, B.: An improved model of dotplotting for text segmentation. Journal of Chinese Language and Computing 17(1), 27–40 (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Natalia Avdeeva
    • 1
  • Galina Artemova
    • 2
  • Kirill Boyarsky
    • 2
  • Natalia Gusarova
    • 2
  • Natalia Dobrenko
    • 2
  • Eugeny Kanevsky
    • 1
  1. 1.Saint Petersburg Institute for Economics and MathematicsRussian Academy of SciencesSaint PetersburgRussia
  2. 2.Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Saint PetersburgRussia

Personalised recommendations