Skip to main content
Log in

Margin-based active learning for structured predictions

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Margin-based active learning remains the most widely used active learning paradigm due to its simplicity and empirical successes. However, most works are limited to binary or multiclass prediction problems, thus restricting the applicability of these approaches to many complex prediction problems where active learning would be most useful. For example, machine learning techniques for natural language processing applications often require combining multiple interdependent prediction problems—generally referred to as learning in structured output spaces. In many such application domains, complexity is further managed by decomposing a complex prediction into a sequence of predictions where earlier predictions are used as input to later predictions—commonly referred to as a pipeline model. This work describes methods for extending existing margin-based active learning techniques to these two settings, thus increasing the scope of problems for which active learning can be applied. We empirically validate these proposed active learning techniques by reducing the annotated data requirements on multiple instances of synthetic data, a semantic role labeling task, and a named entity and relation extraction system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. \(I\left[\kern-0.15em\left[ {} \right.\right.p\left.\left. {} \right]\kern-0.15em\right]\) is an indicator function such that \(I\left[\kern-0.15em\left[ {} \right.\right.p\left.\left. {} \right]\kern-0.15em\right]\) if p is true and 0 otherwise.

  2. Empirical discrepancies between the performance reported in this work and that of [54] is accounted for by the use of averaged Perceptron and smaller batch sizes during instance selection.

References

  1. Abney S (2002) Bootstrapping. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 360–367

  2. Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    Article  MathSciNet  Google Scholar 

  3. Anderson B, Moore A (2005) Active learning for hidden Markov models: objective functions and algorithms. In: Proceedings of the international conference on machine learning (ICML), pp 9–16

  4. Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342

    Google Scholar 

  5. Balcan M-F, Beygelzimer A, Langford J (2006) Agnostic active learning. In: Proceedings of the international conference on machine learning (ICML), pp 65–72

  6. Balcan M-F, Broder A, Zhang T (2007) Margin-based active learning. In: Proceedings of the annual ACM workshop on computational learning theory (COLT), pp 35–50

  7. Balcan MF, Hanneke S, Wortman J (2008) The true sample complexity of active learning. In: Proceedings of the annual ACM workshop on computational learning theory (COLT), pp 45–56

  8. Baldridge J, Osborne M (2004) Active learning and the total cost of annotation. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 9–16

  9. Baram Y, El-Yaniv R, Luz K (2004) Online choice of active learning algorithms. J Mach Learn Res 5:255–291

    MathSciNet  Google Scholar 

  10. Becker M (2008) Active learning: an explicit treatment of unreliable parameters. PhD thesis, University of Edinburgh

  11. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the annual ACM workshop on computational learning theory (COLT), pp 92–100

  12. Brinker K (2004) Active learning of label ranking functions. In: Proceedings of the international conference on machine learning (ICML), pp 129–136

  13. Bunescu RC (2008) Learning with probabilistic features for improved pipeline models. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 670–679

  14. Campbell C, Cristianini N, Smola A (2000) Query learning with large margin classifiers. In: Proceedings of the international conference on machine learning (ICML), pp 111–118

  15. Carreras X, Marquez L (2004) Introduction to the conll-2004 shared tasks: semantic role labeling. In:Proceedings of the annual conference on computational natural language learning (CoNLL)

  16. Castro RM, Nowak RD (2007) Minimax bounds for active learning. In: Proceedings of the Annual ACM workshop on computational learning theory (COLT), pp 5–19

  17. Chan YS, Ng HT (2007) Domain adaptation with active learning for word sense disambiguation. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 49–56

  18. Chang M-W, Do Q, Roth D (2006) Multilingual dependency parsing: a pipeline approach. In: Recent advances in natural language processing. Springer, Berlin, pp 195–204

  19. Chang M-W, Ratinov L, Rizzolo N, Roth D (2008) Learning and inference with constraints. In: Proceedings of the national conference on artificial intelligence (AAAI), pp 1513–1518

  20. Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–222

    Google Scholar 

  21. Cohn DA, Ghahramani Z, Jordan MI (1996) Active learning with statistical models. J Artif Intell Res 4:129–145

    MATH  Google Scholar 

  22. Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 1–8

  23. Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: Proceedings of the national conference on artificial intelligence (AAAI), pp 746–751

  24. Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Proceedings of the international conference on machine learning (ICML), pp 150–157

  25. Dasgupta S (2004) Analysis of a greedy active learning strategy. In: The conference on advances in neural information processing systems (NIPS), pp 337–344

  26. Dasgupta S, Hsu D, Monteleoni C (2007) A general agnostic active learning algorithm. In: The conference on advances in neural information processing systems (NIPS), vol 20, pp 353–360

  27. Dasgupta S, Kalai AT, Monteleoni C (2005) Analysis of perceptron-based active learning. In: Proceedings of the annual ACM workshop on computational learning theory (COLT), pp 249–263

  28. Daumé III H, Langford J, Marcu D (2009) Search-based structured prediction. Mach Learn 75(3):297–325

    Article  Google Scholar 

  29. Davis PC (2002) Stone soup translation: the linked automata model. PhD thesis, Ohio State University

  30. Donmez P, Carbonell J (2008) Optimizing estimated loss reduction for active sampling in rank learning. In: Proceedings of the international conference on machine learning (ICML), pp 248–255

  31. Donmez P, Carbonell JG, Bennett PN (2007) Dual strategy active learning. In: Proceedings of the European conference on machine learning (ECML), pp 116–127

  32. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley-Interscience, New York

  33. Finkel JR, Manning CD, Ng AY (2006) Solving the problem of cascading errors: approximate bayesian inference for linguistic annotation pipelines. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 618–626

  34. Freund Y, Schapire RE (1997) An decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MATH  MathSciNet  Google Scholar 

  35. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296

    Article  MATH  Google Scholar 

  36. Godbole S, Harpale A, Sarawagi S, Chakrabarti S (2004) Document classification through interactive supervision of document and term labels. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 185–196

  37. Hanneke S (2007) A bound o the label complexity of agnostic active learning. In: Proceedings of the international conference on machine learning (ICML), pp 353–360

  38. Hanneke S (2007) Teaching dimension and the complexity of active learning. In: Proceedings of the annual ACM workshop on computational learning theory (COLT), pp 66–81

  39. Har-Peled S, Roth D, Zimak D (2002) Constraint classification for multiclass classification and ranking. In: The conference on advances in neural information processing systems (NIPS), pp 785–792

  40. Hinton G, Sejnowski TJ (1999) Unsupervised learning: foundations of neural computation. MIT Press, Cambridge

  41. Hwa R (2004) Sample selection for statistical parsing. Comput Linguist 30(3):253–276

    Article  MathSciNet  Google Scholar 

  42. Kearns MJ, Schapire RE, Sellie LM (1994) Toward efficient agnostic learning. Mach Learn 17(2–3):115–141

    MATH  Google Scholar 

  43. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML), pp 282–289

  44. Laws F, Schütze H (2008) Stopping criteria for active learning of named entity recognition. In: Proceedings of the international conference on computational linguistics (COLING), pp 465–472

  45. Luo T, Kramer K, Goldgof DB, Hall LO, Samson S, Remsen A, Hopkins T (2005) Active learning to recognize multiple types of plankton. J Mach Learn Res 6:589–613

    MathSciNet  Google Scholar 

  46. Nguyen HT, Smeulders A (2004) Active learning using pre-clustering. In: Proceedings of the international conference on machine learning (ICML), pp 623–630

  47. Och FJ, Tillmann C, Ney H (1999) Improved alignment models for statistical machine translation. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 20–28

  48. Olsson F (2009) A literature survey of active machine learning in the context of natural language processing. Technical report, Swedish Institute of Computer Science

  49. Punyakanok V, Roth D, tau Yih W, Zimak D (2005) Learning and inference over constrained output. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1124–1129

  50. Punyakanok V, Roth D, Yih W, Zimak D (2004) Semantic role labeling via integer linear programming inference. In: Proceedings of the international conference on computational linguistics (COLING)

  51. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

  52. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. IEEE 77(2):257–286

    Article  Google Scholar 

  53. Rai P, Saha A, Hal Daume III HD, Venkatasubramanian S (2010) Domain adaptation meets active learning. In:NAACL workshop on active learning for NLP (ALNLP)

  54. Roth D, Small K (2006) Margin-based active learning for structured output spaces. In: Proceedings of the European conference on machine learning (ECML), pp 413–424

  55. Roth D, Small K (2008) Active learning for pipeline models. In: Proceedings of the national conference on artificial intelligence (AAAI), pp 683–688

  56. Roth D, Small K, Titov I (2009) Sequential learning of classifiers for structured prediction problems. In: Proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 440–447

  57. Roth D, Yih W-T (2004) A linear programming formulation for global inference in natural language tasks. In: Proceedings of the annual conference on computational natural language learning (CoNLL), pp 1–8

  58. Roth D, Yih W-T (2005) Integer linear programming inference for conditional random fields. In: Proceedings of the international conference on machine learning (ICML), pp 737–744

  59. Roth D, Yih W-T (2007) Global inference for entity and relation identification via a linear programming formulation. In: Introduction to statistical relational learning

  60. Scheffer T, Wrobel S (2001) Active learning of partially hidden Markov models. In: Proceedings of the ECML/PKDD workshop on instance selection

  61. Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In: Proceedings of the international conference on machine learning (ICML), pp 839–846

  62. Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of the international conference on language resources and evaluation (LREC), pp 1818–1824

  63. Settles B (2009) Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison

  64. Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 1069–1078

  65. Shen D, Zhang J, Su J, Zhou G, Tan C-L (2004) Multi-criteria-based active learning for named entity recognition. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 589–596

  66. Small K (2005) Interactive learning protocols for natural language applications. PhD thesis, University of Illinois at Urbana-Champaign

  67. Tang M, Luo X, Roukos S (2002) Active learning for statistical natural language parsing. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 120–127

  68. Taskar B, Guestrin C, Koller D (2003) Max-margin Markov networks. In: The conference on advances in neural information processing systems (NIPS)

  69. Thompson CA, Califf ME, Mooney RJ (1999) Active learning for natural language parsing and information extraction. In: Proceedings of the international conference on machine learning (ICML), pp 406–414

  70. Tomanek K, Hahn U (2009) Semi-supervised active learning for sequence labeling. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 1039–1047

  71. Tomanek K, Wermter J, Hahn U (2007) An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In: Proceedings of the conference on empirical methods for natural language processing (EMNLP), pp 486–495

  72. Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66

    Article  Google Scholar 

  73. Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the international conference on machine learning (ICML), pp 823–830

  74. Valiant LG (1984) A theory of the learnable. Commun ACM, pp 1134–1142

  75. Vapnik VN (1999) The nature of statistical learning theory, 2nd edn. Springer, Berlin

  76. Vlachos A (2008) A stopping criterion for active learning. Comput Speech Lang 22(3):295–312

    Article  Google Scholar 

  77. Waterman DA (1986) A guide to expert systems. Addison-Wesley, Reading

  78. Yan R, Yang J, Hauptmann A (2003) Automatically labeling video data using multiclass active learning. In: Proceedings of the international conference on computer vision (ICCV), pp 516–523

  79. Zhu J, Wang H, Hovy EH (2008) Learning a stopping criterion for active learning for word sense disambiguation and text classification. In: Proceedings of the international joint conference on natural language processing (IJCNLP), pp 366–372

  80. Zhu J, Wang H, Hovy EH (2008) Multi-criteria-based strategy to stop active learning for data annotation. In: Proceedings of the international conference on computational linguistics (COLING), pp 1129–1136

  81. Zhu X (2005) Semi-supervised learning learning literature survey. Computer Sciences 1530, University of Wisconsin-Madison

Download references

Acknowledgments

The authors would like to thanks Ming-Wei Chang, Alex Klementiev, Vasin Punyakanok, Nick Rizzolo, and the reviewers for their helpful comments regarding this work. This work has been partially funded by NSF grant ITR IIS-0428472, a research grant from Motorola Labs, DARPA funding under the Bootstrap Learning Program, and by MIAS, a DHS funded Center for Multimodal Information Access and Synthesis at UIUC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kevin Small.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Small, K., Roth, D. Margin-based active learning for structured predictions. Int. J. Mach. Learn. & Cyber. 1, 3–25 (2010). https://doi.org/10.1007/s13042-010-0003-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-010-0003-y

Keywords

Navigation