Machine Learning

, Volume 75, Issue 3, pp 297–325 | Cite as

Search-based structured prediction

Article

Abstract

We present Searn, an algorithm for integrating search and learning to solve complex structured prediction problems such as those that occur in natural language, speech, computational biology, and vision. Searn is a meta-algorithm that transforms these complex problems into simple classification problems to which any binary classifier may be applied. Unlike current algorithms for structured learning that require decomposition of both the loss function and the feature functions over the predicted structure, Searn is able to learn prediction functions for any loss function and any class of features. Moreover, Searn comes with a strong, natural theoretical guarantee: good performance on the derived classification problems implies good performance on the structured prediction problem.

Keywords

Structured prediction Search Reductions 

References

  1. Altun, Y., Hofmann, T., & Smola, A. (2004). Gaussian process classification for segmenting and annotating sequences. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  2. Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853. MathSciNetGoogle Scholar
  3. Bagnell, J. A., Kakade, S., Ng, A., & Schneider, J. (2003). Policy search by dynamic programming. In Neural information processing systems (Vol. 16). Cambridge: MIT Press. Google Scholar
  4. Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2005). Error limiting reductions between classification tasks. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  5. Bikel, D. M. (2004). Intricacies of Collins’ parsing model. Computational Linguistics, 30(4), 479–511. CrossRefGoogle Scholar
  6. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  7. Cohen, W. W., & Carvalho, V. (2005). Stacked sequential learning. In Proceedings of the international joint conference on artificial intelligence (IJCAI). Google Scholar
  8. Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the conference on empirical methods in natural language processing (EMNLP). Google Scholar
  9. Collins, M., & Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of the conference of the association for computational linguistics (ACL). Google Scholar
  10. Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., & Yannakakis, M. (1998). On the complexity of protein folding. In ACM symposium on theory of computing (STOC) (pp. 597–603). Google Scholar
  11. Dang, H. (Ed.). (2005). Fifth document understanding conference (DUC-2005), Ann Arbor, MI, June 2005. Google Scholar
  12. Daumé III, H. (2006). Practical structured learning for natural language processing. PhD thesis, University of Southern California. Google Scholar
  13. Daumé III, H., & Marcu, D. (2002). A noisy-channel model for document compression. In Proceedings of the conference of the association for computational linguistics (ACL) (pp. 449–456). Google Scholar
  14. Daumé III, H., & Marcu, D. (2005a). Bayesian summarization at DUC and a suggestion for extrinsic evaluation. In Document understanding conference. Google Scholar
  15. Daumé III, H., & Marcu, D. (2005b). A large-scale exploration of effective global features for a joint entity detection and tracking model. In Proceedings of the joint conference on human language technology conference and empirical methods in natural language processing (HLT/EMNLP) (pp. 97–104). Google Scholar
  16. Daumé III, H., & Marcu, D. (2006). Bayesian query-focused summarization. In Proceedings of the conference of the association for computational linguistics (ACL), Sydney, Australia. Google Scholar
  17. Foulds, L. R., & Graham, R. L. (1982). The Steiner problem in phylogeny is NP-complete. Advances in Applied Mathematics, 3, 43–49. MATHCrossRefMathSciNetGoogle Scholar
  18. Freund, Y., & Shapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296. MATHCrossRefGoogle Scholar
  19. Germann, U., Jahr, M., Knight, K., Marcu, D., & Yamada, K. (2003). Fast decoding and optimal decoding for machine translation. Artificial Intelligence, 154(1–2), 127–143. MathSciNetGoogle Scholar
  20. Giménez, J., & Màrquez, L. (2004). SVMTool: a general POS tagger generator based on support vector machines. In Proceedings of the 4th LREC. Google Scholar
  21. Huang, L., Zhang, H., & Gildea, D. (2005). Machine translation as lexicalized parsing with hooks. In Proceedings of the 9th international workshop on parsing technologies (IWPT-05), October 2005. Google Scholar
  22. Kääriäinen, M. (2006). Lower bounds for reductions. In The atomic learning workshop (TTI-C), March 2006. Google Scholar
  23. Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  24. Kakade, S., Teh, Y. W., & Roweis, S. (2002). An alternate objective function for Markovian fields. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  25. Kassel, R. (1995). A comparison of approaches to on-line handwritten character recognition. PhD thesis, Massachusetts Institute of Technology, Spoken Language Systems Group. Google Scholar
  26. Knight, K., & Marcu, D. (2002). Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence, 139(1). Google Scholar
  27. Kudo, T., & Matsumoto, Y. (2001). Chunking with support vector machines. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL). Google Scholar
  28. Kudo, T., & Matsumoto, Y. (2003). Fast methods for kernel-based text analysis. In Proceedings of the conference of the association for computational linguistics (ACL). Google Scholar
  29. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  30. Langford, J., & Zadrozny, B. (2005). Relating reinforcement learning performance to classification performance. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  31. Lewis, D. (2001). Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In Proceedings of the conference on research and developments in information retrieval (SIGIR). Google Scholar
  32. Liang, P., Bouchard-Côté, A., Klein, D., & Taskar, B. (2006). An end-to-end discriminative approach to machine translation. In Proceedings of the joint international conference on computational linguistics and association of computational linguistics (COLING/ACL). Google Scholar
  33. Lin, C.-Y., & Hovy, E. (2002). From single to multi-document summarization: a prototype system and its evaluation. In Proceedings of the conference of the association for computational linguistics (ACL), July 2002. Google Scholar
  34. Lin, C.-Y., & Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the conference of the North American chapter of the association for computational linguistics and human language technology (NAACL/HLT), Edmonton, Canada, 27 May–1 June 2003. Google Scholar
  35. Manning, C. (2006). Doing named entity recognition? Don’t optimize for F 1. Post on the NLPers Blog, 25 August 2006. http://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html.
  36. McAllester, D., Collins, M., & Pereira, F. (2004). Case-factor diagrams for structured probabilistic modeling. In Proceedings of the conference on uncertainty in artificial intelligence (UAI). Google Scholar
  37. McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  38. McDonald, R. (2006). Discriminative sentence compression with soft syntactic constraints. In Proceedings of the conference of the European association for computational linguistics (EACL). Google Scholar
  39. McDonald, R., & Pereira, F. (2005). Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics, 6(Suppl 1). Google Scholar
  40. McDonald, R., Crammer, K., & Pereira, F. (2004). Large margin online learning algorithms for scalable structured classification. In NIPS workshop on learning with structured outputs. Google Scholar
  41. Musicant, D., Kumar, V., & Ozgur, A. (2003). Optimizing F-measure with support vector machines. In Proceedings of the international Florida artificial intelligence research society conference (pp. 356–360). Google Scholar
  42. Ng, A., & Jordan, M. (2000). PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the conference on uncertainty in artificial intelligence (UAI). Google Scholar
  43. Punyakanok, V., & Roth, D. (2001). The use of classifiers in sequential inference. In Advances in neural information processing systems (NIPS). Google Scholar
  44. Punyakanok, V., Roth, D., & Yih, W.-T. (2005a). The necessity of syntactic parsing for semantic role labeling. In Proceedings of the international joint conference on artificial intelligence (IJCAI) (pp. 1117–1123). Google Scholar
  45. Punyakanok, V., Roth, D., Yih, W.-T., & Zimak, D. (2005b). Learning and inference over constrained output. In Proceedings of the international joint conference on artificial intelligence (IJCAI) (pp. 1124–1129). Google Scholar
  46. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Reprinted in Neurocomputing (MIT Press, 1998). CrossRefMathSciNetGoogle Scholar
  47. Russell, S., & Norvig, P. (1995). Artificial intelligence: a modern approach. New Jersey: Prentice Hall. MATHGoogle Scholar
  48. Sarawagi, S., & Cohen, W. (2004). Semi-Markov conditional random fields for information extraction. In Advances in neural information processing systems (NIPS). Google Scholar
  49. Shen, L., Satta, G., & Joshi, A. (2007). Guided learning for bidirectional sequence classification. In Proceedings of the conference of the association for computational linguistics (ACL). Google Scholar
  50. Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the international conference on machine learning (ICML) (pp. 783–790). Google Scholar
  51. Sutton, C., Sindelar, M., & McCallum, A. (2005). Feature bagging: preventing weight undertraining in structured discriminative learning (Technical Report IR-402). University of Massachusetts, Center for Intelligent Information Retrieval. Google Scholar
  52. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In Advances in neural information processing systems (NIPS). Google Scholar
  53. Taskar, B., Chatalbashev, V., Koller, D., & Guestrin, C. (2005). Learning structured prediction models: a large margin approach. In Proceedings of the international conference on machine learning (ICML) (pp. 897–904). Google Scholar
  54. Teufel, S., & Moens, M. (1997). Sentence extraction as a classification task. In ACL/EACL-97 workshop on intelligent and scalable text summarization (pp. 58–65). Google Scholar
  55. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484. MathSciNetGoogle Scholar
  56. Tsuruoka, Y., & Tsujii, J. (2005). Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the conference on empirical methods in natural language processing (EMNLP). Google Scholar
  57. Turian, J., & Melamed, I. D. (2006). Advances in discriminative parsing. In Proceedings of the joint international conference on computational linguistics and association of computational linguistics (COLING/ACL). Google Scholar
  58. Turner, J., & Charniak, E. (2005). Supervised and unsupervised learning for sentence compression. In Proceedings of the conference of the association for computational linguistics (ACL). Google Scholar
  59. Wainwright, M. (2006). Estimating the “wrong” graphical model: benefits in the computation-limited setting (Technical report). University of California Berkeley, Department of Statistics, February 2006. Google Scholar
  60. Weston, J., Chapelle, O., Elisseeff, A., Schoelkopf, B., & Vapnik, V. (2002). Kernel dependency estimation. In Advances in neural information processing systems (NIPS). Google Scholar
  61. Ye, S., Qiu, L., Chua, T.-S., & Kan, M.-Y. (2005). NUS at DUC 2005: understanding documents via concept links. In Document understanding conference. Google Scholar
  62. Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE conference on data mining (ICMD). Google Scholar
  63. Zhang, T. (2006). Personal communication, June 2006. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.School of ComputingUniversity of UtahSalt Lake CityUSA
  2. 2.Yahoo! Research LabsNew YorkUSA
  3. 3.Information Sciences InstituteMarina del ReyUSA

Personalised recommendations