# Output Fisher embedding regression

- 242 Downloads

**Part of the following topical collections:**

## Abstract

We investigate the use of Fisher vector representations in the output space in the context of structured and multiple output prediction. A novel, general and versatile method called *output Fisher embedding regression* is introduced. Based on a probabilistic modeling of training output data and the minimization of a Fisher loss, it requires to solve a pre-image problem in the prediction phase. For Gaussian Mixture Models and State-Space Models, we show that the pre-image problem enjoys a closed-form solution with an appropriate choice of the embedding. Numerical experiments on a wide variety of tasks (time series prediction, multi-output regression and multi-class classification) highlight the relevance of the approach for learning under limited supervision like learning with a handful of data per label and weakly supervised learning.

## Keywords

Fisher vector Structured output prediction Output kernel regression Small data regime Weak supervision## Notes

### Acknowledgements

The authors are very grateful to Slim Essid, Chloé Clavel (LTCI, Télécom Paristech) and Zoltán Szabó (CMAP, Ecole Polytechnique) for fruitful discussions about this work. Moussab Djerrab is supported by the Télécom ParisTech Machine Learning for Big Data Chair.

## References

- Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification.
*IEEE Transactions on Pattern Analysis Machine Intelligence*,*38*(7), 1425–1438.CrossRefGoogle Scholar - Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review.
*Foundations and Trends in Machine Learning*,*4*(3), 195–266.CrossRefzbMATHGoogle Scholar - Amari, S.-I. (1998). Natural gradient works efficiently in learning.
*Neural Computation*,*10*(2), 251–276.CrossRefGoogle Scholar - Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B., & Vishwanathan, S. (2007).
*Predicting structured data*. Cambridge: MIT Press.Google Scholar - Brouard, C., d’Alché-Buc, F., & Szafranski, M. (2011). Semi-supervised penalized output kernel regression for link prediction. In
*International conference on machine learning (ICML)*(pp. 593–600).Google Scholar - Brouard, C., d’Alché Buc, F., & Szafranski, M. (2016). Input output kernel regression.
*Journal of Machine Learning Research*,*17*(176), 1–48.zbMATHGoogle Scholar - Chen, L., Schwing, A. G., Yuille, A. L., & Urtasun, R. (2015). Learning deep structured models. In
*Proceedings of the 32nd international conference on machine learning, ICML 2015*(pp. 1785–1794).Google Scholar - Ciliberto, C., Rosasco, L., & Rudi, A. (2016). A consistent regularization approach for structured prediction. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.),
*Advances in neural information processing systems 29*(pp. 4412–4420). New York: Curran Associates Inc.Google Scholar - Cortes, C., Mohri, M., & Weston. J. (2005). A general regression technique for learning transductions. In
*International conference on machine learning (ICML)*(pp. 153–160).Google Scholar - Cuturi, M., Vert, J., Birkenes, Ø., & Matsui, T. (2007). A kernel for time series based on global alignments. In
*IEEE international conference on acoustics, speech and signal processing (ICASSP), 2007*.Google Scholar - Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories.
*IEEE Transactions on Pattern Analysis Machine Intelligence*,*28*, 594–611.CrossRefGoogle Scholar - Geurts, P., Wehenkel, L., & d’Alché-Buc, F. (2006). Kernelizing the output of tree-based methods. In
*International conference on machine learning (ICML)*(pp. 345–352).Google Scholar - Geurts, P., Wehenkel, L., & d’Alché-Buc, F. (2007). Gradient boosting for kernelized output spaces. In
*Machine learning, proceedings of the twenty-fourth international conference (ICML 2007), Corvallis, Oregon, USA, June 20–24, 2007*(pp. 289–296).Google Scholar - Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K. Müller (Eds.),
*Advances in neural information processing systems 12*(pp. 914–920). MIT Press.Google Scholar - Honeine, P., & Richard, C. (2011). Preimage problem in kernel-based machine learning.
*IEEE Signal Processing Magazine*,*28*(2), 77–88.CrossRefGoogle Scholar - Hou, Y., Hsu, W., Lee, M. L., & Bystroff, C. (2003). Efficient remote homology detection using local structure.
*Bioinformatics*,*19*(17), 2294.CrossRefGoogle Scholar - Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In M. J. Kearns, S. A. Solla, & D. A. Cohn (Eds.),
*In advances in neural information processing systems 11*(pp. 487–493). Cambridge: MIT Press.Google Scholar - John, O. P., & Srivastava, S. (1999). The big five trait taxonomy: History, measurement, and theoretical perspectives.
*Handbook of Personality: Theory and Research*,*2*(1999), 102–138.Google Scholar - Kadri, H., Ghavamzadeh, M., & Preux, P. (2013). A generalized kernel approach to structured output learning. In
*International conference on machine learning (ICML)*(pp. 471–479).Google Scholar - Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2013). Tree ensembles for predicting structured outputs.
*Pattern Recognition*,*46*(3), 817–833.CrossRefGoogle Scholar - Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In
*Proceedings of the human language technologies: NAACL’09*.Google Scholar - Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In
*Proceedings of the international conference on machine learning (ICML)*.Google Scholar - Lebret, R., Pinheiro, P. H. O., & Collobert, R. (2015). Phrase-based image captioning. In
*International conference on machine learning (ICML)*(pp. 2085–2094).Google Scholar - LeCun, Y., & Huang, F. (2005). Loss functions for discriminative training of energy-based models. In
*Proceedings of the 10-th international workshop on artificial intelligence and statistics (AIStats’05)*.Google Scholar - Micchelli, C. A., & Pontil, M. A. (2005). On learning vector-valued functions.
*Neural Computation*,*17*, 177–204.MathSciNetCrossRefzbMATHGoogle Scholar - Nowozin, S., & Lampert, C. H. (2011). Structured learning and prediction in computer vision.
*Foundations and Trends Computer Graphics and Vision*,*6*(3:8211;4), 185–365.zbMATHGoogle Scholar - Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with Fisher vectors on a compact feature set. In
*Proceedings of the IEEE international conference on computer vision*(pp. 1817–1824).Google Scholar - Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In
*Empirical methods in natural language processing (EMNLP)*(pp. 1532–1543).Google Scholar - Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In
*Proceedings of the 11th european conference on computer vision: Part IV, ECCV’10*(pp. 143–156). Berlin, Heidelberg: Springer.Google Scholar - Ponce-López, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., Baró, X., Escalante, H. J., & Escalera, S. (2016). Chalearn lap 2016: First round challenge on first impressions-dataset and results. In
*Computer vision–ECCV 2016 workshops*(pp. 400–418). Berlin: Springer.Google Scholar - Pugelj, M., & Džeroski, S. (2011). Predicting structured outputs k-nearest neighbours method. In
*Proceedings of the 14th international conference on discovery science, DS’11*(pp. 262–276). Berlin, Heidelberg: Springer.Google Scholar - Salvador, S., & Chan, P. (2004). Fastdtw: Toward accurate dynamic time warping in linear time and space. In
*KDD workshop on mining temporal and sequential data*. Citeseer.Google Scholar - Siolas, G., & d’Alché-Buc, F. (2002). Mixtures of probabilistic pcas and Fisher kernels for word and document modeling. In
*ICANN 2002*(pp. 769–776).Google Scholar - Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett, (Eds.),
*Advances in neural information processing systems 28*(pp. 3483–3491). Curran Associates, Inc.Google Scholar - Su, H., Heinonen, M., & Rousu, J. (2010). Structured output prediction of anti-cancer drug activity. In
*International conference on pattern recognition in bioinformatics (PRIB)*(pp. 38–49). Berlin: Springer.Google Scholar - Sydorov, V., Sakurada, M., & Lampert, C. H. (2014). Deep fisher kernels—End to end learning of the Fisher kernel GMM parameters. In
*2014 IEEE conference on computer vision and pattern recognition, CVPR*(pp. 1402–1409).Google Scholar - Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables.
*Journal of Machine Learning Research*,*6*, 1453–1484.MathSciNetzbMATHGoogle Scholar - Vezhnevets, A., Ferrari, V., & Buhmann, J. M. (2012). Weakly supervised structured output learning for semantic segmentation. In
*2012 IEEE conference on computer vision and pattern recognition (CVPR)*(pp. 845–852). IEEE.Google Scholar - Vinyals, O., Blundell, C., Lillicrap, T. P., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning.
*CoRR*. arXiv:1606.04080. - Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In
*The IEEE conference on computer vision and pattern recognition (CVPR), June 2015*.Google Scholar - Yu, C. J., & Joachims, T. (2009). Learning structural svms with latent variables. In A. P. Danyluk, L. Bottou, and M. L. Littman (Eds.)
*Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009*(Vol. 382, pp. 1169–1176). ACM.Google Scholar - Zhu, L., Chen, Y., Yuille, A. L., & Freeman, W. T. (2010). Latent hierarchical structural learning for object detection. In
*The twenty-third IEEE conference on computer vision and pattern recognition, CVPR 2010*(pp. 1062–1069).Google Scholar