Automatic Pragmatic Text Segmentation of Historical Letters

  • Iris Hendrickx
  • Michel Généreux
  • Rita Marquilhas
Conference paper

DOI: 10.1007/978-3-642-20227-8_8

Part of the book series Theory and Applications of Natural Language Processing (NLP)
Cite this paper as:
Hendrickx I., Généreux M., Marquilhas R. (2011) Automatic Pragmatic Text Segmentation of Historical Letters. In: Sporleder C., van den Bosch A., Zervanou K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg

Abstract

In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.

Keywords

historical text text segmentation semantic labeling text normalization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Iris Hendrickx
    • 1
  • Michel Généreux
    • 1
  • Rita Marquilhas
    • 1
  1. 1.Centro de Linguística da Universidade de LisboaLisboaPortugal