Analysis of loss functions for fast single-class classification

Keren, Gil; Sabato, Sivan; Schuller, Björn

doi:10.1007/s10115-019-01395-6

Analysis of loss functions for fast single-class classification

Regular Paper
Published: 19 August 2019

Volume 62, pages 337–358, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

377 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

We consider neural network training, in applications in which there are many possible classes, but at test time, the task is a binary classification task of determining whether the given example belongs to a specific class. We define the single logit classification (SLC) task: training the network so that at test time, it would be possible to accurately identify whether the example belongs to a given class in a computationally efficient manner, based only on the output logit for this class. We propose a natural principle, the Principle of Logit Separation, as a guideline for choosing and designing losses suitable for the SLC task. We show that the cross-entropy loss function is not aligned with the Principle of Logit Separation. In contrast, there are known loss functions, as well as novel batch loss functions that we propose, which are aligned with this principle. Our experiments show that indeed in almost all cases, losses that are aligned with the Principle of Logit Separation obtain at least 20% relative accuracy improvement in the SLC task compared to losses that are not aligned with it, and sometimes considerably more. Furthermore, we show that fast SLC does not cause any drop in binary classification accuracy, compared to standard classification in which all logits are computed, and yields a speedup which grows with the number of classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

A survey on semi-supervised learning

Article Open access 15 November 2019

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

References

Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of CVPR, Miami, FL, pp 248–255
Partalas I, Kosmopoulos A, Baskiotis N, Artières T, Paliouras G, Gaussie É, Androutsopoulos I, Amini M, Gallinari P (2015) LSHTC: a benchmark for large-scale text classification. arXiv:1503.08581
Weston J, Makadia A, Yee H (2013) Label partitioning for sublinear ranking. In: Proceedings of ICML, Atlanta, GA, pp 181–189
Gupta MR, Bengio S, Weston J (2014) Training highly multiclass classifiers. J Mach Learn Res 15(1):1461–1492
MathSciNet MATH Google Scholar
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR, San Diago, CA
Józefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:1602.02410
Dean TL, Ruzon MA, Segal M, Shlens J, Vijayanarasimhan S, Yagnik J (2013) Fast, accurate detection of 100,000 object classes on a single machine. In: Proceedings of CVPR, Portland, OR, pp 1814–1821
Grave E, Joulin A, Cissé M, Grangier D, Jégou H (2017) Efficient softmax approximation for GPUs. In: Proceedings of ICML, Sydney, Australia, pp 1302–1310
Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: Proceedings of IROS, Hamburg, Germany, pp 922–928
Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, pp 779–788
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceddings of CVPR, Las Vegas, NV, pp 2818–2826
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz RM, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. ACL, Baltimore, MD, pp 1370–1380
Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. In: Proceedings ICML, Edinburgh, Scotland, UK
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of CVPR, NV, Las Vegas, pp 2285–2294
Huang Y, Wang W, Wang L, Tan T (2013) Multi-task deep neural network for multi-label learning. In: Procedding of IEEE international conference on image processing. ICIP, Melbourne, Australia, pp 2897–2900
Janocha K, Czarnecki WM (2017) On loss functions for deep neural networks in classification. arXiv:1702.05659
Czarnecki WM, Jozefowicz R, Tabor J (2015) Maximum entropy linear manifold for learning discriminative low-dimensional representation. ECML PKDD 2015:52–67
Google Scholar
Keren G, Sabato S, Schuller B (2018) Fast single-class classification and the principle of logit separation. In: Proceedings of the international conference on data mining (ICDM), Singapore, pp 227–236
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of AISTATS. Bridgetown, Barbados
Chen W, Grangier D, Auli M (2016) Strategies for training large vocabulary neural language models. In: Proceedings of the 54th annual meeting of the association for computational linguistics. ACL, Berlin, Germany
Bengio Y, Senecal J (2008) Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans Neural Netw 19(4):713–722
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS. Lake Tahoe, NV, pp 3111–3119
Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of AISTATS. Chia Laguna Resort, Sardinia, Italy, pp 297–304
Hinton GE (1989) Connectionist learning procedures. Artif Intell 40(1):185–234
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS. Lake Tahoe, NV, pp 1106–1114
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
MATH Google Scholar
Socher R, Lin CC, Ng AY, Manning CD (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of ICML. Bellevue, WA, pp 129–136
Kampa K, Hasanbelliu E, Príncipe JC (2011) Closed-form Cauchy–Schwarz PDF divergence for mixture of gaussians. IJCNN 2011:2578–2585
Google Scholar
Gower JC (1985) Measures of similarity, dissimilarity and distance. Encycl Stat Sci 5:397–405
Google Scholar
Andreas J, When Klein D (2015) why are log-linear models self-normalizing In: Proceddings of NAACL HLT, the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies. Denver, CO, pp. 244–249
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning. Granada, Spain
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML. Lille, France, pp 448–456
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR. San Diago, CA
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of CVPR. Las Vegas, NV, pp 770–778
Keren G, Sabato S, Schuller B (2017) Tunable sensitivity to large errors in neural network training. In: Proceedings of AAAI, San Francisco, CA, pp 2087–2093
Keren G, Cummins N, Schuller BW (2018) Calibrated prediction intervals for neural network regressors. IEEE Access 6:54033–54041
Article Google Scholar
Keren G, Han J, Schuller B (2018) Scaling speech enhancement in unseen environments with noise embeddings. In: Proceedings of the CHiME workshop on speech processing in everyday environments. Hyderabad, India, pp 25–29

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 826506 (sustAGE). Sivan Sabato was supported in part by the Israel Science Foundation (Grant No. 555/15).

Author information

Authors and Affiliations

ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany
Gil Keren & Björn Schuller
Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
Sivan Sabato
GLAM – Group on Language, Audio & Music Imperial College, London, UK
Björn Schuller

Authors

Gil Keren
View author publications
You can also search for this author in PubMed Google Scholar
Sivan Sabato
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gil Keren.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Omitted proofs: alignment with the Principle of Logit Separation

All the considered losses are a function of the output logits and the example labels. For a network model $\theta $, denote the vector of logits it assigns to example x by $z^\theta (x) = (z^\theta _1(x),\ldots ,z^\theta _k(x))$. When $\theta $ and x are clear from context, we write $z_j$ instead of $z_j^\theta (x)$. Denote the logit output of the sample by $S_\theta = ((z^\theta (x_1),y_1),\ldots ,(z^\theta (x_n),y_n))$. A loss function $\ell :\cup _{n=1}^\infty ({\mathbb {R}}^k \times [k])^n \rightarrow {\mathbb {R}}_+$ assigns a loss to a training sample based on the output logits of the model and on the labels of the training examples. The goal of training is to find a model $\theta $ which minimizes $\ell (S_\theta ) \equiv \ell (S,\theta )$. In almost all the losses we study, the loss on the training sample is the sum over all examples of a loss defined on a single example: $\ell (S_\theta ) \equiv \sum _{i=1}^n \ell (z^\theta (x_i),y_i)$, thus we only define $\ell (z,y)$. We explicitly define $\ell (S_\theta )$ below only when this is not the case.

1.1 Self-normalization

We prove that the self-normalization loss satisfies the PoLS: let a training sample S and a neural network model $\theta $, and consider an example $(x,y) \in S$. We consider the two terms of the loss in order. First, consider $-\log (p_y)$. From the definition of $p_y$ (Eq. 1) we have that

$$\begin{aligned} - \log (p_y) = \log \left( \sum _{j=1}^k e^{z_j-z_y}\right) = \log \left( 1+ \sum _{j\ne y} e^{z_j-z_y}\right) . \end{aligned}$$

Set $\epsilon _0 := \log (1+e^{-2})$. Then, if $-\log (p_y) < \epsilon _0$, we have $\sum _{j\ne y} e^{z_j-z_y} \le e^{-2}$, which implies that (a) $\forall j \ne y, z_j \le z_y - 2$ and (b) $e^{z_y} \ge \sum _{j=1}^k e^{z_j}/(1+e^{-2}) \ge \frac{1}{2} \sum _{j=1}^ke^{z_j}.$ Second, consider the second term. There is an $\epsilon _1 > 0$ such that if $\log ^2 (\sum _{j=1}^k e^{z_j}) < \epsilon _1$ then (c) $2e^{-1}< \sum _{j=1}^k e^{z_j} < e$, which implies $e^{z_y} < e$ and hence (d) $z_y < 1$. Now, let $\theta $ such that $\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)$. Then $\forall (x,y) \in S$, $\ell (z^\theta (x),y) \le \epsilon $. From (b) and (c), $e^{-1}< \frac{1}{2} \sum _{j=1}^k e^{z_j} < e^{z^y}$, hence $z_y > -1$. Combining with (d), we get $-1< z_y < 1$. Combined with (a), we get that for $j \ne y$, $z_j < -1$. To summarize, $\forall (x,y),(x',y') \in S$ and $\forall y'' \ne y'$, we have that $z_y^\theta (x)> -1 > z_{y''}^\theta (x')$, implying PoLS alignment.

1.2 Noise-contrastive estimation

Recall the definition of the NCE loss from Eq. (8):

$$\begin{aligned} \ell (z,y) = -\log g_y - t\cdot {\mathbb {E}}_{j \sim q}\left[ \log (1-g_j)\right] \end{aligned}$$

where $g_j := (1 + t\cdot q(j)\cdot e^{-z_j})^{-1}.$ We prove that the NCE loss satisfies the PoLS: $g_j$ is monotonic increasing in $z_j$. Hence, if the loss is small, $g_y$ is large and $g_j$ for $j \ne y$, is small. Formally, fix t, and let a training sample S. There is an $\epsilon _0 > 0$ such that if $-\log g_j \le \epsilon _0$, then $z_j > 0$. Also, there is an $\epsilon _1 > 0$ (which depends on q) such that if $- {\mathbb {E}}_{j \sim q}\left[ \log (1-g_j)\right] \le \epsilon _1$ then $\forall j \ne y$, $\log (1-g_j)$ must be small enough so that $z_j < 0$. Now, consider $\theta $ such that $\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)$. Then for every $(x,y) \in S$, $\ell (z^\theta (x),y) \le \epsilon $. This implies that for every $(x,y),(x',y') \in S$ and $y'' \ne y'$, we have that $z^\theta _y(x)> 0 > z^\theta _{y''}(x')$, thus this loss is aligned with the PoLS.

1.3 Binary cross-entropy

This loss is similar in form to the NCE loss: for $g_j$ as in Eq. (8), $g_j = \sigma (z_j - \ln (t \cdot q(j)))$. Since $\sigma (z_j)$ is monotonic, the proof method for NCE carries over and thus the binary cross-entropy loss satisfies the PoLS as well.

1.4 Batch losses

Recall that the batch losses are defined as $\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]$, where $B_\theta $ is a random batch out of $S_\theta $ and L is $L_c$ for the cross-entropy (Definition 2), and $L_m$ is the max-margin loss (Definition 3). If true logits are greater than false logits in every batch separately when using, then the PoLS is satisfied on the whole sample, since every pair of examples appears together in some batch. The following lemma formalizes this:

Lemma 1

If L is aligned with the PoLS, and $\ell $ is defined by $\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]$, then $\ell $ is also aligned with the PoLS.

Proof

Assume a training sample S and a neural network model $\theta $. Since L is aligned with the PoLS, there is some $\epsilon ' > 0$ such if $L(B_\theta ) < \epsilon '$, then for each $(x,y),(x',y') \in B$ and $y'' \ne y'$ we have that $z_y^\theta (x) > z_{y''}^\theta (x')$. Let $\epsilon = \epsilon '/\left( {\begin{array}{c}n\\ m\end{array}}\right) $, and assume $\ell (S_\theta ) < \epsilon $. Since there are $\left( {\begin{array}{c}n\\ m\end{array}}\right) $ batches of size m in S, this implies that for every batch B of size m, $L(B_\theta ) \le \epsilon '$. For any $(x,y),(x',y') \in S$, there is a batch B that includes both examples. Thus, for $y'' \ne y'$, $z_y^\theta (x) > z_{y''}^\theta (x')$. Since this holds for any two examples in S, $\ell $ is also PoLS-aligned. $\square $

(1) Batch cross-entropy To show that the batch cross-entropy satisfies the PoLS, we show that $L_c$ does, which by Lemma 1 implies this for $\ell $. By the continuity of $\mathrm {KL}$, and since for discrete distributions, $\mathrm {KL}(P||Q) =0 \iff P \equiv Q$, there is an $\epsilon > 0$ such that if $L(B_\theta ) \equiv \mathrm {KL}(P_B || Q^\theta _B)] < \epsilon $, then for all i, j, $|P_B(i,j) - Q^\theta _B(i,j)| \le \frac{1}{2m}$. Therefore, for each example $(x,y) \in B$,

$$\begin{aligned} \frac{e^{z_y^\theta (x)}}{Z(B)} > \frac{1}{2m}, \qquad \text { and }\qquad \forall j \ne y, \quad \frac{e^{z_j^\theta (x)}}{Z(B)} < \frac{1}{2m}. \end{aligned}$$

It follows that for any two examples $(x,y),(x',y') \in B$, if $y \ne y'$, then $z_y^\theta (x)> \frac{1}{2m} > z_{y'}^\theta (x')$. Therefore L satisfies the PoLS, which completes the proof.

(2) Batch max-margin To show that the batch max-margin loss satisfies the PoLS, we show this for $L_m$ and invoke Lemma 1. Set $\epsilon = \gamma /m$. If $L(B_\theta ) < \epsilon $, then $\gamma -z_{+}^B + z_{-}^B < \gamma $, implying $z_{+}^B > z_{-}^B$. Hence, any $(x,y),(x',y') \in B$ such that $y \ne y'$ satisfy $z^\theta _y(x) \ge z_{+}^B > z_{-}^B \ge z^\theta _y(x')$. Thus L is aligned with the PoLS, implying the same for $\ell $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keren, G., Sabato, S. & Schuller, B. Analysis of loss functions for fast single-class classification. Knowl Inf Syst 62, 337–358 (2020). https://doi.org/10.1007/s10115-019-01395-6

Download citation

Received: 01 March 2019
Revised: 25 July 2019
Accepted: 05 August 2019
Published: 19 August 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10115-019-01395-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of loss functions for fast single-class classification

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey on semi-supervised learning

ImageNet Large Scale Visual Recognition Challenge

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Omitted proofs: alignment with the Principle of Logit Separation

1.1 Self-normalization

1.2 Noise-contrastive estimation

1.3 Binary cross-entropy

1.4 Batch losses

Lemma 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analysis of loss functions for fast single-class classification

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey on semi-supervised learning

ImageNet Large Scale Visual Recognition Challenge

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Omitted proofs: alignment with the Principle of Logit Separation

1.1 Self-normalization

1.2 Noise-contrastive estimation

1.3 Binary cross-entropy

1.4 Batch losses

Lemma 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation