Learning from Partially Annotated Sequences

  • Eraldo R. Fernandes
  • Ulf Brefeld
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6911)

Abstract

We study sequential prediction models in cases where only fragments of the sequences are annotated with the ground-truth. The task does not match the standard semi-supervised setting and is highly relevant in areas such as natural language processing, where completely labeled instances are expensive and require editorial data. We propose to generalize the semi-supervised setting and devise a simple transductive loss-augmented perceptron to learn from inexpensive partially annotated sequences that could for instance be provided by laymen, the wisdom of the crowd, or even automatically. Experiments on mono- and cross-lingual named entity recognition tasks with automatically generated partially annotated sentences from Wikipedia demonstrate the effectiveness of the proposed approach. Our results show that learning from partially labeled data is never worse than standard supervised and semi-supervised approaches trained on data with the same ratio of labeled and unlabeled tokens.

Keywords

Hide Markov Model Unlabeled Data Neural Information Processing System Entity Recognition Annotate Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altun, Y., McAllester, D., Belkin, M.: Maximum margin semi–supervised learning for structured variables. In: Advances in Neural Information Processing Systems (2006)Google Scholar
  2. 2.
    Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector machines. In: Proceedings of the International Conference on Machine Learning (2003)Google Scholar
  3. 3.
    Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.: Semantically annotated snapshot of the english wikipedia. In: European Language Resources Association (ELRA), editor, Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (May 2008)Google Scholar
  4. 4.
    Baluja, S.: Probabilistic modeling for face orientation discrimination: Learning from labeled and unlabeled data. In: Advances in Neural Information Processing Systems (1998)Google Scholar
  5. 5.
    Cao, L., Chen, C.W.: A novel product coding and recurrent alternate decoding scheme for image transmission over noisy channels. IEEE Transactions on Communications 51(9), 1426–1431 (2003)CrossRefGoogle Scholar
  6. 6.
    Chapelle, O., Schölkopf, B., Zien, A.: Semi–supervised Learning. MIT Press, Cambridge (2006)CrossRefGoogle Scholar
  7. 7.
    Collins, M.: Discriminative reranking for natural language processing. In: Proceedings of the International Conference on Machine Learning (2000)Google Scholar
  8. 8.
    Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2002)Google Scholar
  9. 9.
    Dietterich, T.G.: Machine learning for sequential data: A review. In: Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition (2002)Google Scholar
  10. 10.
    Do, T.-M.-T., Artieres, T.: Large margin training for hidden Markov models with partially observed states. In: Proceedings of the International Conference on Machine Learning (2009)Google Scholar
  11. 11.
    Forney, G.D.: The Viterbi algorithm. Proceedings of IEEE 61(3), 268–278 (1973)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices (1971) (unpublished manuscript)Google Scholar
  13. 13.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the International Conference on Machine Learning (1999)Google Scholar
  14. 14.
    Juang, B., Rabiner, L.: Hidden Markov models for speech recognition. Technometrics 33, 251–272 (1991)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    King, T.H., Dipper, S., Frank, A., Kuhn, J., Maxwell, J.: Ambiguity management in grammar writing. In: Proceedings of the ESSLLI 2000 Workshop on Linguistic Theory and Grammar Implementation (2000)Google Scholar
  16. 16.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (2001)Google Scholar
  17. 17.
    Lafferty, J., Zhu, X., Liu, Y.: Kernel conditional random fields: representation and clique selection. In: Proceedings of the International Conference on Machine Learning (2004)Google Scholar
  18. 18.
    Lee, C., Wang, S., Jiao, F., Greiner, R., Schuurmans, D.: Learning to model spatial dependency: Semi-supervised discriminative random fields. In: Advances in Neural Information Processing Systems (2007)Google Scholar
  19. 19.
    McAllester, D., Hazan, T., Keshet, J.: Direct loss minimization for structured perceptronsi. In: Advances in Neural Information Processing Systems (2011)Google Scholar
  20. 20.
    McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the International Conference on Machine Learning (2000)Google Scholar
  21. 21.
    Mika, P., Ciaramita, M., Zaragoza, H., Atserias, J.: Learning to tag and tagging to learn: A case study on wikipedia. IEEE Intelligent Systems 23, 26–33 (2008)CrossRefGoogle Scholar
  22. 22.
    Mukherjee, S., Ramakrishnan, I.V.: Taming the unstructured: Creating structured content from partially labeled schematic text sequences. In: Chung, S. (ed.) OTM 2004. LNCS, vol. 3291, pp. 909–926. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  23. 23.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)CrossRefMATHGoogle Scholar
  24. 24.
    Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard corpora for ner training. In: EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 612–620. Association for Computational Linguistics, Morristown (2009)Google Scholar
  25. 25.
    Novikoff, A.B.: On convergence proofs on perceptrons. In: Proceedings of the Symposium on the Mathematical Theory of Automata (1962)Google Scholar
  26. 26.
    Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)CrossRefGoogle Scholar
  27. 27.
    Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL 2008: HLT, pp. 1–9. Association for Computational Linguistics, Columbus (2008)Google Scholar
  28. 28.
    Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING-2002: Proceedings of the 6th Conference on Natural Language Learning, pp. 1–4. Association for Computational Linguistics, Morristown (2002)Google Scholar
  29. 29.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of CoNLL 2003, pp. 142–147 (2003)Google Scholar
  30. 30.
    Scheffer, T., Wrobel, S.: Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis (2001)Google Scholar
  31. 31.
    Schwarz, R., Chow, Y.L.: The n-best algorithm: An efficient and exact procedure for finding the n most likely hypotheses. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (1990)Google Scholar
  32. 32.
    Taskar, B., Guestrin, C., Koller, D.: Max–margin Markov networks. In: Advances in Neural Information Processing Systems (2004)Google Scholar
  33. 33.
    Truyen, T.T., Bui, H.H., Phung, D.Q., Venkatesh, S.: Learning discriminative sequence models from partially labelled data for activity recognition. In: Proceedings of the Pacific Rim International Conference on Artificial Intelligence (2008)Google Scholar
  34. 34.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 1453–1484 (2005)MathSciNetMATHGoogle Scholar
  35. 35.
    Xu, L., Wilkinson, D., Southey, F., Schuurmans, D.: Discriminative unsupervised learning of structured predictors. In: Proceedings of the International Conference on Machine Learning (2006)Google Scholar
  36. 36.
    Yu, C.-N., Joachims, T.: Learning structural svms with latent variables. In: Proceedings of the International Conference on Machine Learning (2009)Google Scholar
  37. 37.
    Zien, A., Brefeld, U., Scheffer, T.: Transductive support vector machines for structured variables. In: Proceedings of the International Conference on Machine Learning (2007)Google Scholar
  38. 38.
    Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, vol. 23 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Eraldo R. Fernandes
    • 1
  • Ulf Brefeld
    • 2
  1. 1.Pontifícia Universidade Católica do Rio de JaneiroBrazil
  2. 2.Yahoo! ResearchBarcelonaSpain

Personalised recommendations