Abstract
Information research within a scientific text needs to deal with the problem of automatic document partition on subtopics by taking text specifics and user purposes into account. This task is important for primary source selection, for working with texts in foreign languages or for getting acquainted with research problems. This paper is focused on the application of subtopic segmentation algorithms to real-life scientific texts. For studying this we use monographs on the same subject written in three languages. The corpus includes several original and professionally trasnlated fragments. The research is based on the TextTiling algorithm that analyses how tightly adjoining parts of the text cohere. We examine how some parameters (the cutoff rate, the size of moving window and of the shift from one block to the next one) influence the segmentation quality and define the optimal combinations of these parameters for several languages. The studies on Russian suggest that external lexical resources notably improve the segmentation quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Bolshakova, E.I., Klyshinsky, E.S., Lande, D.V., Noskov, A.A., Peskova, O.V.: Automatic processing of natural language texts and computer linguistics. Moscow State Institute of Electronics and Mathematics (2011)
Chaibi, A., Naili, M., Sammoud, S.: Topic segmentation for textual document written in arabic language. In: Procedia Computer Science: 18th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES2014, vol. 35, pp. 26–33 (2014)
Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 26–33 (2000)
Dias, G., Alves, E., Lopes, J.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI 2007 Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, pp. 1334–1339 (2007)
Douglas, B., Berger, A., Lafferty, J.: Statistical models of text segmentation. Machine Learning 34(1–3) (1999)
Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013)
Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: NAACL 2009 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361 (2009)
Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 334–343 (2008)
Flejter, D., Wieloch, K., Abramowicz, W.: Unsupervised methods of topical text segmentation for polish. In: Balto-Slavonic Natural Language Processing 2007, pp. 51–58 (2007)
Georgescu, M., Clark, A., Armstrong, S.: An analysis of quantitative aspects in the evaluation of thematic segmentation algorithms. In: SigDIAL 2006 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue (2009)
Hearst, M.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Hearst, M., Plaunt, C.: Subtopic structuring for full-length document access. In: SIGIR 1993: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–68 (1993)
Kanevsky, E.A., Boyarsky, K.: Semantics and sintactics parser semsin. In: Dialog-2012: International Conference on Computational Linguistics (2012). http://www.dialog-21.ru/digests/dialog2012/materials/pdf/Kanevsky.pdf (date of access June 29, 2015)
Kazantseva, A., Szpakowicz, S.: Linear text segmentation using affinity propagation. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 284–293 (2011)
Kazantseva, A., Szpakowicz, S.: Hierarchical topical segmentation with affinity propagation. In: Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics: Technical Papers, pp. 37–47 (2014)
Kotyurova, M.P.: Scietific style of speech. Akademiya (2010)
Lamprier, S., Amghar, T., Levrat, B.: On evaluation methodologies for text segmentation algorithms. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2, pp. 11–18 (2007)
Lee, D.: Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning & Technology 5(3), 37–72 (2001)
Misra, H., Yvon, F., Cappe, O., Jose, J.: Text segmentation: A topic modeling perspective. Information Processing and Management 47(4), 528–544 (2011)
Ponte, J.M., Croft, W.B.: Text segmentation by topic. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 113–125. Springer, Heidelberg (1997)
Reynar, J.: An automatic method of finding topic boundaries. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 331–333 (1994)
Riedl, M., Biemann, C.: Text segmentation with topic models. JLCL 27(1), 47–69 (2012)
Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 362–366 (2012)
Smolyanina, E.A.: Cohesion types in the scientific text (based on english article by m. black “metaphor”). Vestnik of Perm State University: Russian and Foreign Philology 4(24), 140–150 (2004)
Stark, H.: What do paragraph markings do? Discourse Processes 11, 275–303 (1988)
Trofimova, G.K.: Russian language ant the culture of speech: lectures. Flinta, Nauka (2004)
Tuzov, V.A.: Computer semantics of the Russian language. Saint-Petersburg University Press (2004)
Wan, X.: On the effectiveness of subwords for lexical cohesion based story segmentation of chinese broadcast news. Information Sciences 177, 3718–3730 (2007)
Ye, N., Zhu, J., Wang, H., Ma, M., Zhang, B.: An improved model of dotplotting for text segmentation. Journal of Chinese Language and Computing 17(1), 27–40 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Avdeeva, N., Artemova, G., Boyarsky, K., Gusarova, N., Dobrenko, N., Kanevsky, E. (2015). Subtopic Segmentation of Scientific Texts: Parameter Optimisation. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and Semantic Web. KESW 2015. Communications in Computer and Information Science, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-319-24543-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-24543-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24542-3
Online ISBN: 978-3-319-24543-0
eBook Packages: Computer ScienceComputer Science (R0)