Automatic Pragmatic Text Segmentation of Historical Letters

  • Iris Hendrickx
  • Michel Généreux
  • Rita Marquilhas
Conference paper
Part of the Theory and Applications of Natural Language Processing book series (NLP)


In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.


historical text text segmentation semantic labeling text normalization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



We would like to thank Mariana Gomes, Ana Rita Guilherme and Leonor Tavares for the manual annotation. We are grateful to JoÃčo Paulo Silvestre for sharing his electronic version of the Bluteau Dictionary and frequency counts. This work is funded by the Portuguese Science Foundation, FCT (FundaÃğÃčo para a CiÃłncia e a Tecnologia).


  1. 1.
    Archer, D., Culpeper, J.: Identifying key sociophilological usage in plays and trial proceedings): An empirical approach via corpus annotation. Journal of Historical Pragmatics 10(2), 286–309 (2009)CrossRefGoogle Scholar
  2. 2.
    Archer, D., McEnery, T., Rayson, P., Hardie, A.: Developing an automated semantic analysis system for early modern english. In: Proceedings of the Corpus Linguistics 2003 conference, pp. 22 – 31 (2003)Google Scholar
  3. 3.
    Baron, A., Rayson, P.: VARD2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics (2008)Google Scholar
  4. 4.
    Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of Language Resources and Evaluation (LREC) 2004, pp. 1313–1316 (2004)Google Scholar
  5. 5.
    Blecua, A.: Manual de Crítica Textual. Castalia, Madrid (1983)Google Scholar
  6. 6.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers (1998)Google Scholar
  7. 7.
    Bluteau, R.: Vocabulario portuguez, e latino [followed by] supplemento ao vocabulario portuguez. vols. 1-8, I-II. Coimbra-Lisboa. (1712–1728)Google Scholar
  8. 8.
    Brown, P., Levinson, S.C.: Politeness: some universals in language usage. Cambridge University Press, Cambridge (1987)Google Scholar
  9. 9.
    Cohen, J.: A coefficient of agreement for nominal scales. Education and Psychological Measuremen 20, 37–46 (1960)CrossRefGoogle Scholar
  10. 10.
    Daelemans, W., A.Van den Bosch: Memory-Based Language Processing. Cambridge University Press, Cambridge, UK (2005)CrossRefGoogle Scholar
  11. 11.
    Daelemans, W., Zavrel, J., Van den Bosch, A., Van der Sloot, K.: Mbt: Memory-based tagger, version 3.1, reference guide. Tech. rep., ILK Technical Report Series 07-08 (2007)Google Scholar
  12. 12.
    Dossena, M., van Ostade, I.T.B. (eds.): Studies in Late Modern English Correspondence. Peter Lang, Bern (2008)Google Scholar
  13. 13.
    Edmonds, P., Kilgarriff, A.: Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineerin 8(4), 279–291 (2002)CrossRefGoogle Scholar
  14. 14.
    Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the ACM/IEEE-CS conference on Digital libraries, pp. 333–341 (2007)Google Scholar
  15. 15.
    Everitt, B.: The Analysis of Contingency Tables, 2nd edn. Chapman and Hall (1992)Google Scholar
  16. 16.
    Ferret, O.: Segmenter et structurer thématiquement des textes par l’utilisation conjointe de collocations et de la récurrence lexicale. In: TALN 2002. Nancy (2002)Google Scholar
  17. 17.
    Fitzmaurice, S.M.: Epistolary identity: convention and idiosyncrasy in late modern english letters. In: Studies in Late Modern English Correspondence, pp. 77–112. Peter Lang (2008)Google Scholar
  18. 18.
    Guillén, C.: Renaissance Genres: Essays on Theory, History and Interpretation, chap. Notes towards the study of the Renaissance letter, pp. 70–101. Harvard University Press (1986)Google Scholar
  19. 19.
    Hachey, B., Grover, C.: Extractive summarisation of legal texts. Artificial Intelligence and Law: Special Issue on E-government 14, 305–345 (2007)Google Scholar
  20. 20.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)Google Scholar
  21. 21.
    Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  22. 22.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing. 2nd edition. Prentice-Hall (2009)Google Scholar
  23. 23.
    Kilgarriff, A., Palmer, M.: Introduction to the special issue on senseval. Computers in the Humanities 34(1-2), 1–13. (2000)CrossRefGoogle Scholar
  24. 24.
    Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), LNCS, vol. 3936, pp. 407–419. Springer Verlag, Heidelberg (2006)Google Scholar
  25. 25.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001)Google Scholar
  26. 26.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sovjet Physics Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar
  27. 27.
    Merity, S., Murphy, T., Curran, J.R.: Accurate argumentative zoning with maximum entropy models. In: NLPIR4DL ’09: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pp. 19–26. Association for Computational Linguistics, Morristown, NJ, USA (2009)Google Scholar
  28. 28.
    Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28, 289–318 (1999)CrossRefGoogle Scholar
  29. 29.
    Moon, R.: Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press, Oxford (1998)Google Scholar
  30. 30.
    Nevalainen, T., Tanskanen, S.K. (eds.): Letter Writing. John Benjamins Publishing Company, Amsterdam/Philadelphia (2007)Google Scholar
  31. 31.
    Ng, H.T., Lim, C.Y., Foo, S.K.: A case study on inter-annotator agreement for word sense disambiguation. In: Proceedings of the SIGLEX Workshop On Standardizing Lexical Resources (1999)Google Scholar
  32. 32.
    Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons (1989)Google Scholar
  33. 33.
    Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comp. Linguistics 28, 1–19 (2002)CrossRefGoogle Scholar
  34. 34.
    Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 82–94 (1995)Google Scholar
  35. 35.
    Rayson, P., Archer, D., Piao, S.L., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks (LREC 2004), pp. 7–12 (2004)Google Scholar
  36. 36.
    Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)Google Scholar
  37. 37.
    Sporleder, C., Lapata, M.: Broad coverage paragraph segmentation across languages and domains. ACM Transactions on Speech and Language Processing 3(2), 1–35 (2006)CrossRefGoogle Scholar
  38. 38.
    Teufel, S., Moens, M.: What’s yours and what’s mine: Determining intellectual attribution in scientific text. In: In EMNLP-VLC (2000)Google Scholar
  39. 39.
    Watts, R.: Politeness. Cambridge University Press, Cambridge (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Iris Hendrickx
    • 1
  • Michel Généreux
    • 1
  • Rita Marquilhas
    • 1
  1. 1.Centro de Linguística da Universidade de LisboaLisboaPortugal

Personalised recommendations