Classifying Written Texts Through Rhythmic Features
Rhythm analysis of written texts focuses on literary analysis and it mainly considers poetry. In this paper we investigate the relevance of rhythmic features for categorizing texts in prosaic form pertaining to different genres. Our contribution is threefold. First, we define a set of rhythmic features for written texts. Second, we extract these features from three corpora, of speeches, essays, and newspaper articles. Third, we perform feature selection by means of statistical analyses, and determine a subset of features which efficiently discriminates between the three genres. We find that using as little as eight rhythmic features, documents can be adequately assigned to a given genre with an accuracy of around 80 %, significantly higher than the 33 % baseline which results from random assignment.
KeywordsRhythm Text classification Natural language processing Discourse analysis
The work presented in this paper was partially funded by the EC H2020 project RAGE (Realising and Applied Gaming Eco-System) http://www.rageproject.eu/ Grant agreement No 644187.
- 1.Lefebvre, H.: Rhythmanalysis: Space. Time and Everyday Life. Continuum, London (2004)Google Scholar
- 2.Fürnkranz, J.: A study using n-gram features for text categorization. Austrian Research Institute for Artificial Intelligence, Wien (1998)Google Scholar
- 3.Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: 14th International Conference on Machine Learning (ICML 1997), pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
- 4.Chomsky, N., Halle, M.: The Sound Pattern of English. Harper & Row, New York (1968)Google Scholar
- 5.Liberman, M., Prince, A.: On stress and linguistic rhythm. Linguist. Inq. 8(2), 249–336 (1977)Google Scholar
- 6.Boychuk, E., Paramonov, I., Kozhemyakin, N., Kasatkina, N.: Automated approach for rhythm analysis of french literary texts. In: 15th Conference of Open Innovations Association FRUCT, pp. 15–23. IEEE, St. Petersburg (2014)Google Scholar
- 9.Beeferman, D.: The rhythm of lexical stress in prose. In: 34th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, Santa Cruz (1996)Google Scholar
- 13.Grabe, E., Low, E.L.: Durational variability in speech and the rhythm class hypothesis. In: Gussenhoven, C., Warner, N. (eds.) Papers in Laboratory Phonology, pp. 515–546. Mouton de Gruyter, Berlin (2002)Google Scholar
- 15.Carlson, L., Marcu, D., Okurowski, M.E.: Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory. In: 2nd SIGdial Workshop on Discourse and Dialogue (SIGDIAL 2001), vol. 16, pp. 1–10. Association for Computational Linguistics, Stroudsburg (2001)Google Scholar
- 16.Balint, M., Trausan-Matu, S.: A critical comparison of rhythm In music and natural language. Ann. Acad. Rom. Scientists Ser. Sci. Technol. Inf. 9(1), 43–60 (2016)Google Scholar
- 18.Garson, G.D.: Multivariate GLM, MANOVA, and MANCOVA. Statistical Associates Publishing, Asheboro (2015)Google Scholar