Language Technology for Cultural Heritage pp 135-153
Automatic Pragmatic Text Segmentation of Historical Letters
- First Online:
In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.
Keywordshistorical text text segmentation semantic labeling text normalization
Unable to display preview. Download preview PDF.