Advertisement

Stylistic Changes for Temporal Text Classification

  • Sanja Štajner
  • Marcos Zampieri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8082)

Abstract

This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17th to the early 20th century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of diachronic changes in these four features revealed that the texts written in the 17th and 18th centuries have similar AWL, LD and LR, which differ significantly from those in the texts written in the 19th and 20th centuries. This information was later used in automatic classification of texts per century, leading to an F-Measure of 0.92.

Keywords

text classification stylistic changes historical corpora Portuguese 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Joseph, B., Janda, R.: The Handbook of Historical Linguistics. Blackwell Publishing (2003)Google Scholar
  2. 2.
    Smith, J., Kelly, C.: Stylistic constancy and change across literary corpora: Using measures of lexical richness to date works. Computers and the Humanities 36, 411–430 (2002)CrossRefGoogle Scholar
  3. 3.
    Štajner, S., Mitkov, R.: Diachronic stylistic changes in british and american varieties of 20th century written english language. In: Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 78–85 (2011)Google Scholar
  4. 4.
    Zampieri, M., Becker, M.: Colonia: Corpus of historical portuguese. ZSM Studien, Special Volume on Non-Standard Data Sources in Corpus-Based Research 5 (2013)Google Scholar
  5. 5.
    Leech, G., Hundt, M., Mair, C., Smith, N.: Change in Contemporary English: A Grammatical Study. Cambridge University Press, Cambridge (2009)CrossRefGoogle Scholar
  6. 6.
    Galves, C., Sandalo, F.: Clitic-placement in modern and classical European Portuguese. MIT Working Papers in Linguistics 47, 115–128 (2004)Google Scholar
  7. 7.
    Britto, H., Finger, M., Galves, C.: Computational and linguistic aspects of the Tycho Brahe parsed corpus of historical portuguese. In: Proceedings of the First Freiburg Workshop on Romance Corpus Linguistics, Freiburg, Germany (2000)Google Scholar
  8. 8.
    Dalli, A., Wilks, Y.: Automatic dating of documents and temporal text classification. In: Proceedings of the Workshop on Annotating and Reasoning about Time and Events, Sidney, Australia, pp. 17–22 (2006)Google Scholar
  9. 9.
    Abe, H., Tsumoto, S.: Text categorization with considering temporal patterns of term usages. In: Proceedings of ICDM Workshops, pp. 800–807. IEEE (2010)Google Scholar
  10. 10.
    Mokhov, S.: A marf approach to deft 2010. In: Proceedings of TALN 2010, Montreal, Canada (2010)Google Scholar
  11. 11.
    Trieschnigg, D., Hiemstra, D., Theune, M., de Jong, F., Meder, T.: An exploration of language identification techniques for the dutch folktale database. In: Proceedings of LREC 2012 (2012)Google Scholar
  12. 12.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK (1994)Google Scholar
  13. 13.
    Witten, I., Frank, E.: Data mining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers (2005)Google Scholar
  14. 14.
    John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)Google Scholar
  15. 15.
    Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13, 637–649 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Platt, J.C.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning (1998)Google Scholar
  17. 17.
    Cohen, W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  18. 18.
    Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)Google Scholar
  19. 19.
    Zampieri, M., Gebre, B.G.: Automatic identification of language varieties: The case of Portuguese. In: Proceedings of KONVENS 2012, Vienna, Austria, pp. 233–237 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Sanja Štajner
    • 1
  • Marcos Zampieri
    • 2
  1. 1.Research Group in Computational LinguisticsUniversity of WolverhamptonUK
  2. 2.Romance Philology DepartmentUniversity of CologneGermany

Personalised recommendations