Skip to main content
SpringerLink
Log in
Menu
Find a journal Publish with us Track your research
Search
Cart
  1. Home
  2. Machine Learning
  3. Article

Piecewise training for structured prediction

  • Open access
  • Published: 16 June 2009
  • Volume 77, pages 165–194, (2009)
  • Cite this article
Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript
Piecewise training for structured prediction
Download PDF
  • Charles Sutton1 nAff2 &
  • Andrew McCallum1 
  • 1090 Accesses

  • 21 Citations

  • Explore all metrics

Abstract

A drawback of structured prediction methods is that parameter estimation requires repeated inference, which is intractable for general structures. In this paper, we present an approximate training algorithm called piecewise training (PW) that divides the factors into tractable subgraphs, which we call pieces, that are trained independently. Piecewise training can be interpreted as approximating the exact likelihood using belief propagation, and different ways of making this interpretation yield different insights into the method. We also present an extension to piecewise training, called piecewise pseudolikelihood (PWPL), designed for when variables have large cardinality. On several real-world natural language processing tasks, piecewise training performs superior to Besag’s pseudolikelihood and sometimes comparably to exact maximum likelihood. In addition, PWPL performs similarly to PW and superior to standard pseudolikelihood, but is five to ten times more computationally efficient than batch maximum likelihood training.

Article PDF

Download to read the full article text

Similar content being viewed by others

Natural Language Processing

Chapter © 2020

A survey on semi-supervised learning

Article Open access 15 November 2019

Jesper E. van Engelen & Holger H. Hoos

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Jessa Bekker & Jesse Davis

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  • Abbeel, P., Koller, D., & Ng, A. Y. (2005). Learning factor graphs in polynomial time and sample complexity. In Twenty-first conference on uncertainty in artificial intelligence (UAI05).

  • Bernal, A., Crammer, K., Hatzigeorgiou, A., & Pereira, F. (2007). Global discriminative learning for higher-accuracy computational gene prediction. PLoS Computational Biology, 3(3).

  • Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.

    Article  MathSciNet  Google Scholar 

  • Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika, 64(3), 616–618.

    Article  MATH  MathSciNet  Google Scholar 

  • Choi, A., Chavira, M., & Darwiche, A. (2007). Node splitting: a scheme for generating upper bounds in Bayesian networks. In Conference on uncertainty in artificial intelligence (UAI).

  • Cox, D. R., & Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika, 91, 729–737.

    Article  MATH  MathSciNet  Google Scholar 

  • Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3, 951–991.

    Article  MATH  MathSciNet  Google Scholar 

  • Daumé III, H., & Marcu, D. (2005). Learning as search optimization: Approximate large margin methods for structured prediction. In International Conference on Machine Learning (ICML), Bonn, Germany.

  • Finkel, J. R., Manning, C. D., & Ng, A. Y. (2006). Solving the problem of cascading errors: approximate Bayesian inference for linguistic annotation pipelines. In Conference on empirical methods in natural language processing (EMNLP).

  • Freeman, W. T., Pasztor, E. C., & Carmichael, O. T. (2000). Learning low-level vision. International Journal of Computer Vision, 40(1), 24–57.

    Article  Google Scholar 

  • Freitag, D. (1998). Machine learning for information extraction in informal domains. PhD thesis, Carnegie Mellon University.

  • Ganapathi, V., Vickrey, D., Duchi, J., & Koller, D. (2008). Constrained approximate maximum entropy learning. In Conference on uncertainty in artificial intelligence (UAI).

  • Gidas, B. (1988). Consistency of maximum likelihood and pseudolikelihood estimators for Gibbs distributions. In W. Fleming & P.-L. Lions (Eds.), Stochastic differential systems, stochastic control theory and applications. New York: Springer.

    Google Scholar 

  • Greiner, R., Guo, Y., & Schuurmans, D. (2005). Learning coordination classifiers. In International joint conference on artificial intelligence (IJCAI).

  • Huang, F., & Ogata, Y. (1999). Improvements of the maximum pseudo-likelihood estimators in various spatial statistical models. Journal of Computational and Graphical Statistics, 8(3), 510–530, ISSN 10618600. URL http://www.jstor.org/stable/1390872.

    Article  Google Scholar 

  • Hyvarinen, A. (2006). Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Computation, 18(10), 2283–2292.

    Article  MathSciNet  Google Scholar 

  • Kakade, S., Teh, Y. W., & Roweis, S. (2002). An alternative objective function for Markovian fields. In Proceedings of the nineteenth international conference on machine learning.

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).

  • Li, S. Z. (2001). Markov random field modeling in image analysis. New York: Springer.

    MATH  Google Scholar 

  • Liang, P., Taskar, B., & Klein, D. (2006). Alignment by agreement. In Human language technology and North American association for computational linguistics (HLT/NAACL).

  • Liang, P., Klein, D., & Jordan, M. I. (2008). Agreement-based learning. In Advances in neural information processing systems (NIPS).

  • Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics, 221–239.

  • McCallum, A., & Wellner, B. (2005). Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 17 (pp. 905–912). Cambridge: MIT Press.

    Google Scholar 

  • McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In International conference on machine learning (ICML) (pp. 591–598). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • McDonald, R., Crammer, K., & Pereira, F. (2005). Spanning tree methods for discriminative training of dependency parsers (Technical Report MS-CIS-05-11). University of Pennsylvania CIS.

  • Minka, T. (2001a). Expectation propagation for approximate Bayesian inference. In 17th conference on uncertainty in artificial intelligence (UAI) (pp. 362–369).

  • Minka, T. P. (2001b). The EP energy function and minimization schemes. http://research.microsoft.com/~minka/papers/ep/minka-ep-energy.pdf.

  • Minka, T. (2005). Divergence measures and message passing (Technical Report MSR-TR-2005-173). Microsoft Research.

  • Parise, S., & Welling, M. (2005). Learning in Markov random fields: an empirical study. In Joint Statistical Meeting (JSM2005).

  • Punyakanok, V., Roth, D., Yih, W.-T., & Zimak, D. (2005). Learning and inference over constrained output. In Proceedings of the international joint conference on artificial intelligence (IJCAI) (pp. 1124–1129).

  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language proceeding (EMNLP).

  • Rosen-Zvi, M., Yuille, A. L., & Jordan, M. I. (2005). The dlr hierarchy of approximate inference. In Conference on uncertainty in artificial intelligence (UAI).

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision (ECCV).

  • Stern, D. H., Graepel, T., & MacKay, D. J. C. (2005). Modelling uncertainty in the game of go. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 17 (pp. 1353–1360). Cambridge: MIT Press.

    Google Scholar 

  • Sutton, C. (2008). Efficient training methods for conditional random fields. PhD thesis. University of Massachusetts.

  • Sutton, C., & McCallum, A. (2004). Collective segmentation and labeling of distant entities in information extraction. In ICML workshop on statistical relational learning and its connections to other fields.

  • Sutton, C., & McCallum, A. (2005). Piecewise training of undirected models. In Conference on uncertainty in artificial intelligence (UAI).

  • Sutton, C., & McCallum, A. (2007a). An introduction to conditional random fields for relational learning. In L. Getoor & B. Taskar (Eds.), Introduction to statistical relational learning. Cambridge: MIT Press.

    Google Scholar 

  • Sutton, C., & McCallum, A. (2007b). Piecewise pseudolikelihood for efficient CRF training. In International conference on machine learning (ICML).

  • Sutton, C., & Minka, T. (2006). Local training and belief propagation (Technical Report TR-2006-121). Microsoft Research.

  • Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In International conference on machine learning (ICML).

  • Sutton, C., McCallum, A., & Rohanimanesh, K. (2007). Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research, 8, 693–723.

    Google Scholar 

  • Taskar, B., Guestrin, C., & Koller, D. (2004a). Max-margin Markov networks. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge: MIT Press.

    Google Scholar 

  • Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004b). Max-margin parsing. In Empirical methods in natural language processing (EMNLP04).

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In HLT-NAACL.

  • Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. (2006). Accelerated training of conditional random fields with stochastic meta-descent. In International conference on machine learning (ICML) (pp. 969–976).

  • Wainwright, M. J., Jaakkola, T., & Willsky, A. S. (2002). A new class of upper bounds on the log partition function. In Uncertainty in artificial intelligence.

  • Wainwright, M. J., Jaakkola, T., & Willsky, A. S. (2003a). Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 45(9), 1120–1146.

    Article  MathSciNet  Google Scholar 

  • Wainwright, M. J., Jaakkola, T., & Willsky, A. S. (2003b). Tree-reweighted belief propagation and approximate ML estimation by pseudo-moment matching. In Ninth workshop on artificial intelligence and statistics.

  • Wellner, B., McCallum, A., Peng, F., & Hay, M. (2004). An integrated, conditional model of information extraction and coreference with application to citation graph construction. In 20th conference on uncertainty in artificial intelligence (UAI).

  • Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2005). Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7), 2282–2312.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Author notes
  1. Charles Sutton

    Present address: Computer Science Division, University of California, Berkeley, CA, 94720, USA

Authors and Affiliations

  1. Department of Computer Science, University of Massachusetts, Amherst, MA, 01003, USA

    Charles Sutton & Andrew McCallum

Authors
  1. Charles Sutton
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Andrew McCallum
    View author publications

    You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles Sutton.

Additional information

Editor: Charles Parker.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Sutton, C., McCallum, A. Piecewise training for structured prediction. Mach Learn 77, 165–194 (2009). https://doi.org/10.1007/s10994-009-5112-z

Download citation

  • Received: 28 April 2008

  • Accepted: 13 April 2009

  • Published: 16 June 2009

  • Issue Date: December 2009

  • DOI: https://doi.org/10.1007/s10994-009-5112-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Graphical models
  • Conditional random fields
  • Local training
  • Belief propagation
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

103.230.141.187

Not affiliated

Springer Nature

© 2024 Springer Nature