Modeling Information in Textual Data Combining Labeled and Unlabeled Data

  • Dunja Mladenić
Conference paper

DOI: 10.1007/3-540-45728-3_13

Volume 2447 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Mladenić D. (2002) Modeling Information in Textual Data Combining Labeled and Unlabeled Data. In: Hand D.J., Adams N.M., Bolton R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science, vol 2447. Springer, Berlin, Heidelberg

Abstract

The paper describes two approaches to modeling word normalization (such as replacing “wrote” or “writing” by “write”) based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Dunja Mladenić
    • 1
  1. 1.J.Stefan InstituteLjubljana, Slovenia and Carnegie Mellon UniversityPittsburghUSA