Machine Learning

, Volume 107, Issue 8–10, pp 1229–1256 | Cite as

Output Fisher embedding regression

  • Moussab DjerrabEmail author
  • Alexandre Garcia
  • Maxime Sangnier
  • Florence d’Alché-Buc
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


We investigate the use of Fisher vector representations in the output space in the context of structured and multiple output prediction. A novel, general and versatile method called output Fisher embedding regression is introduced. Based on a probabilistic modeling of training output data and the minimization of a Fisher loss, it requires to solve a pre-image problem in the prediction phase. For Gaussian Mixture Models and State-Space Models, we show that the pre-image problem enjoys a closed-form solution with an appropriate choice of the embedding. Numerical experiments on a wide variety of tasks (time series prediction, multi-output regression and multi-class classification) highlight the relevance of the approach for learning under limited supervision like learning with a handful of data per label and weakly supervised learning.


Fisher vector Structured output prediction Output kernel regression Small data regime Weak supervision 



The authors are very grateful to Slim Essid, Chloé Clavel (LTCI, Télécom Paristech) and Zoltán Szabó (CMAP, Ecole Polytechnique) for fruitful discussions about this work. Moussab Djerrab is supported by the Télécom ParisTech Machine Learning for Big Data Chair.


  1. Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification. IEEE Transactions on Pattern Analysis Machine Intelligence, 38(7), 1425–1438.CrossRefGoogle Scholar
  2. Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.CrossRefzbMATHGoogle Scholar
  3. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.CrossRefGoogle Scholar
  4. Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B., & Vishwanathan, S. (2007). Predicting structured data. Cambridge: MIT Press.Google Scholar
  5. Brouard, C., d’Alché-Buc, F., & Szafranski, M. (2011). Semi-supervised penalized output kernel regression for link prediction. In International conference on machine learning (ICML) (pp. 593–600).Google Scholar
  6. Brouard, C., d’Alché Buc, F., & Szafranski, M. (2016). Input output kernel regression. Journal of Machine Learning Research, 17(176), 1–48.zbMATHGoogle Scholar
  7. Chen, L., Schwing, A. G., Yuille, A. L., & Urtasun, R. (2015). Learning deep structured models. In Proceedings of the 32nd international conference on machine learning, ICML 2015 (pp. 1785–1794).Google Scholar
  8. Ciliberto, C., Rosasco, L., & Rudi, A. (2016). A consistent regularization approach for structured prediction. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 4412–4420). New York: Curran Associates Inc.Google Scholar
  9. Cortes, C., Mohri, M., & Weston. J. (2005). A general regression technique for learning transductions. In International conference on machine learning (ICML) (pp. 153–160).Google Scholar
  10. Cuturi, M., Vert, J., Birkenes, Ø., & Matsui, T. (2007). A kernel for time series based on global alignments. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2007.Google Scholar
  11. Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis Machine Intelligence, 28, 594–611.CrossRefGoogle Scholar
  12. Geurts, P., Wehenkel, L., & d’Alché-Buc, F. (2006). Kernelizing the output of tree-based methods. In International conference on machine learning (ICML) (pp. 345–352).Google Scholar
  13. Geurts, P., Wehenkel, L., & d’Alché-Buc, F. (2007). Gradient boosting for kernelized output spaces. In Machine learning, proceedings of the twenty-fourth international conference (ICML 2007), Corvallis, Oregon, USA, June 20–24, 2007 (pp. 289–296).Google Scholar
  14. Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K. Müller (Eds.), Advances in neural information processing systems 12 (pp. 914–920). MIT Press.Google Scholar
  15. Honeine, P., & Richard, C. (2011). Preimage problem in kernel-based machine learning. IEEE Signal Processing Magazine, 28(2), 77–88.CrossRefGoogle Scholar
  16. Hou, Y., Hsu, W., Lee, M. L., & Bystroff, C. (2003). Efficient remote homology detection using local structure. Bioinformatics, 19(17), 2294.CrossRefGoogle Scholar
  17. Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In M. J. Kearns, S. A. Solla, & D. A. Cohn (Eds.), In advances in neural information processing systems 11 (pp. 487–493). Cambridge: MIT Press.Google Scholar
  18. John, O. P., & Srivastava, S. (1999). The big five trait taxonomy: History, measurement, and theoretical perspectives. Handbook of Personality: Theory and Research, 2(1999), 102–138.Google Scholar
  19. Kadri, H., Ghavamzadeh, M., & Preux, P. (2013). A generalized kernel approach to structured output learning. In International conference on machine learning (ICML) (pp. 471–479).Google Scholar
  20. Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.CrossRefGoogle Scholar
  21. Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In Proceedings of the human language technologies: NAACL’09.Google Scholar
  22. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML).Google Scholar
  23. Lebret, R., Pinheiro, P. H. O., & Collobert, R. (2015). Phrase-based image captioning. In International conference on machine learning (ICML) (pp. 2085–2094).Google Scholar
  24. LeCun, Y., & Huang, F. (2005). Loss functions for discriminative training of energy-based models. In Proceedings of the 10-th international workshop on artificial intelligence and statistics (AIStats’05).Google Scholar
  25. Micchelli, C. A., & Pontil, M. A. (2005). On learning vector-valued functions. Neural Computation, 17, 177–204.MathSciNetCrossRefzbMATHGoogle Scholar
  26. Nowozin, S., & Lampert, C. H. (2011). Structured learning and prediction in computer vision. Foundations and Trends Computer Graphics and Vision, 6(3:8211;4), 185–365.zbMATHGoogle Scholar
  27. Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with Fisher vectors on a compact feature set. In Proceedings of the IEEE international conference on computer vision (pp. 1817–1824).Google Scholar
  28. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP) (pp. 1532–1543).Google Scholar
  29. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In Proceedings of the 11th european conference on computer vision: Part IV, ECCV’10 (pp. 143–156). Berlin, Heidelberg: Springer.Google Scholar
  30. Ponce-López, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., Baró, X., Escalante, H. J., & Escalera, S. (2016). Chalearn lap 2016: First round challenge on first impressions-dataset and results. In Computer vision–ECCV 2016 workshops (pp. 400–418). Berlin: Springer.Google Scholar
  31. Pugelj, M., & Džeroski, S. (2011). Predicting structured outputs k-nearest neighbours method. In Proceedings of the 14th international conference on discovery science, DS’11 (pp. 262–276). Berlin, Heidelberg: Springer.Google Scholar
  32. Salvador, S., & Chan, P. (2004). Fastdtw: Toward accurate dynamic time warping in linear time and space. In KDD workshop on mining temporal and sequential data. Citeseer.Google Scholar
  33. Siolas, G., & d’Alché-Buc, F. (2002). Mixtures of probabilistic pcas and Fisher kernels for word and document modeling. In ICANN 2002 (pp. 769–776).Google Scholar
  34. Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R.  Garnett, (Eds.), Advances in neural information processing systems 28 (pp. 3483–3491). Curran Associates, Inc.Google Scholar
  35. Su, H., Heinonen, M., & Rousu, J. (2010). Structured output prediction of anti-cancer drug activity. In International conference on pattern recognition in bioinformatics (PRIB) (pp. 38–49). Berlin: Springer.Google Scholar
  36. Sydorov, V., Sakurada, M., & Lampert, C. H. (2014). Deep fisher kernels—End to end learning of the Fisher kernel GMM parameters. In 2014 IEEE conference on computer vision and pattern recognition, CVPR (pp. 1402–1409).Google Scholar
  37. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetzbMATHGoogle Scholar
  38. Vezhnevets, A., Ferrari, V., & Buhmann, J. M. (2012). Weakly supervised structured output learning for semantic segmentation. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 845–852). IEEE.Google Scholar
  39. Vinyals, O., Blundell, C., Lillicrap, T. P., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. CoRR. arXiv:1606.04080.
  40. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In The IEEE conference on computer vision and pattern recognition (CVPR), June 2015.Google Scholar
  41. Yu, C. J., & Joachims, T. (2009). Learning structural svms with latent variables. In A. P. Danyluk, L. Bottou, and M. L. Littman (Eds.) Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009 (Vol. 382, pp. 1169–1176). ACM.Google Scholar
  42. Zhu, L., Chen, Y., Yuille, A. L., & Freeman, W. T. (2010). Latent hierarchical structural learning for object detection. In The twenty-third IEEE conference on computer vision and pattern recognition, CVPR 2010 (pp. 1062–1069).Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Moussab Djerrab
    • 1
    Email author
  • Alexandre Garcia
    • 1
  • Maxime Sangnier
    • 2
  • Florence d’Alché-Buc
    • 1
  1. 1.Télécom ParisTechUniversité Paris-SaclayParisFrance
  2. 2.UPMC Univ Paris 06, CNRSSorbonne UniversitésParisFrance

Personalised recommendations