Skip to main content

Subtopic Segmentation of Scientific Texts: Parameter Optimisation

Part of the Communications in Computer and Information Science book series (CCIS,volume 518)

Abstract

Information research within a scientific text needs to deal with the problem of automatic document partition on subtopics by taking text specifics and user purposes into account. This task is important for primary source selection, for working with texts in foreign languages or for getting acquainted with research problems. This paper is focused on the application of subtopic segmentation algorithms to real-life scientific texts. For studying this we use monographs on the same subject written in three languages. The corpus includes several original and professionally trasnlated fragments. The research is based on the TextTiling algorithm that analyses how tightly adjoining parts of the text cohere. We examine how some parameters (the cutoff rate, the size of moving window and of the shift from one block to the next one) influence the segmentation quality and define the optimal combinations of these parameters for several languages. The studies on Russian suggest that external lexical resources notably improve the segmentation quality.

Keywords

  • Text tiling
  • Classification
  • Parsing
  • Segmentation

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-24543-0_1
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-24543-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Bolshakova, E.I., Klyshinsky, E.S., Lande, D.V., Noskov, A.A., Peskova, O.V.: Automatic processing of natural language texts and computer linguistics. Moscow State Institute of Electronics and Mathematics (2011)

    Google Scholar 

  3. Chaibi, A., Naili, M., Sammoud, S.: Topic segmentation for textual document written in arabic language. In: Procedia Computer Science: 18th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES2014, vol. 35, pp. 26–33 (2014)

    Google Scholar 

  4. Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 26–33 (2000)

    Google Scholar 

  5. Dias, G., Alves, E., Lopes, J.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI 2007 Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, pp. 1334–1339 (2007)

    Google Scholar 

  6. Douglas, B., Berger, A., Lafferty, J.: Statistical models of text segmentation. Machine Learning 34(1–3) (1999)

    Google Scholar 

  7. Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013)

    Google Scholar 

  8. Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: NAACL 2009 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361 (2009)

    Google Scholar 

  9. Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 334–343 (2008)

    Google Scholar 

  10. Flejter, D., Wieloch, K., Abramowicz, W.: Unsupervised methods of topical text segmentation for polish. In: Balto-Slavonic Natural Language Processing 2007, pp. 51–58 (2007)

    Google Scholar 

  11. Georgescu, M., Clark, A., Armstrong, S.: An analysis of quantitative aspects in the evaluation of thematic segmentation algorithms. In: SigDIAL 2006 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue (2009)

    Google Scholar 

  12. Hearst, M.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)

    Google Scholar 

  13. Hearst, M., Plaunt, C.: Subtopic structuring for full-length document access. In: SIGIR 1993: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–68 (1993)

    Google Scholar 

  14. Kanevsky, E.A., Boyarsky, K.: Semantics and sintactics parser semsin. In: Dialog-2012: International Conference on Computational Linguistics (2012). http://www.dialog-21.ru/digests/dialog2012/materials/pdf/Kanevsky.pdf (date of access June 29, 2015)

  15. Kazantseva, A., Szpakowicz, S.: Linear text segmentation using affinity propagation. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 284–293 (2011)

    Google Scholar 

  16. Kazantseva, A., Szpakowicz, S.: Hierarchical topical segmentation with affinity propagation. In: Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics: Technical Papers, pp. 37–47 (2014)

    Google Scholar 

  17. Kotyurova, M.P.: Scietific style of speech. Akademiya (2010)

    Google Scholar 

  18. Lamprier, S., Amghar, T., Levrat, B.: On evaluation methodologies for text segmentation algorithms. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2, pp. 11–18 (2007)

    Google Scholar 

  19. Lee, D.: Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning & Technology 5(3), 37–72 (2001)

    Google Scholar 

  20. Misra, H., Yvon, F., Cappe, O., Jose, J.: Text segmentation: A topic modeling perspective. Information Processing and Management 47(4), 528–544 (2011)

    CrossRef  Google Scholar 

  21. Ponte, J.M., Croft, W.B.: Text segmentation by topic. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 113–125. Springer, Heidelberg (1997)

    CrossRef  Google Scholar 

  22. Reynar, J.: An automatic method of finding topic boundaries. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 331–333 (1994)

    Google Scholar 

  23. Riedl, M., Biemann, C.: Text segmentation with topic models. JLCL 27(1), 47–69 (2012)

    Google Scholar 

  24. Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 362–366 (2012)

    Google Scholar 

  25. Smolyanina, E.A.: Cohesion types in the scientific text (based on english article by m. black “metaphor”). Vestnik of Perm State University: Russian and Foreign Philology 4(24), 140–150 (2004)

    Google Scholar 

  26. Stark, H.: What do paragraph markings do? Discourse Processes 11, 275–303 (1988)

    CrossRef  Google Scholar 

  27. Trofimova, G.K.: Russian language ant the culture of speech: lectures. Flinta, Nauka (2004)

    Google Scholar 

  28. Tuzov, V.A.: Computer semantics of the Russian language. Saint-Petersburg University Press (2004)

    Google Scholar 

  29. Wan, X.: On the effectiveness of subwords for lexical cohesion based story segmentation of chinese broadcast news. Information Sciences 177, 3718–3730 (2007)

    CrossRef  Google Scholar 

  30. Ye, N., Zhu, J., Wang, H., Ma, M., Zhang, B.: An improved model of dotplotting for text segmentation. Journal of Chinese Language and Computing 17(1), 27–40 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natalia Gusarova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Avdeeva, N., Artemova, G., Boyarsky, K., Gusarova, N., Dobrenko, N., Kanevsky, E. (2015). Subtopic Segmentation of Scientific Texts: Parameter Optimisation. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and Semantic Web. KESW 2015. Communications in Computer and Information Science, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-319-24543-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24543-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24542-3

  • Online ISBN: 978-3-319-24543-0

  • eBook Packages: Computer ScienceComputer Science (R0)