Skip to main content
Log in

Analysis of loss functions for fast single-class classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We consider neural network training, in applications in which there are many possible classes, but at test time, the task is a binary classification task of determining whether the given example belongs to a specific class. We define the single logit classification (SLC) task: training the network so that at test time, it would be possible to accurately identify whether the example belongs to a given class in a computationally efficient manner, based only on the output logit for this class. We propose a natural principle, the Principle of Logit Separation, as a guideline for choosing and designing losses suitable for the SLC task. We show that the cross-entropy loss function is not aligned with the Principle of Logit Separation. In contrast, there are known loss functions, as well as novel batch loss functions that we propose, which are aligned with this principle. Our experiments show that indeed in almost all cases, losses that are aligned with the Principle of Logit Separation obtain at least 20% relative accuracy improvement in the SLC task compared to losses that are not aligned with it, and sometimes considerably more. Furthermore, we show that fast SLC does not cause any drop in binary classification accuracy, compared to standard classification in which all logits are computed, and yields a speedup which grows with the number of classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of CVPR, Miami, FL, pp 248–255

  2. Partalas I, Kosmopoulos A, Baskiotis N, Artières T, Paliouras G, Gaussie É, Androutsopoulos I, Amini M, Gallinari P (2015) LSHTC: a benchmark for large-scale text classification. arXiv:1503.08581

  3. Weston J, Makadia A, Yee H (2013) Label partitioning for sublinear ranking. In: Proceedings of ICML, Atlanta, GA, pp 181–189

  4. Gupta MR, Bengio S, Weston J (2014) Training highly multiclass classifiers. J Mach Learn Res 15(1):1461–1492

    MathSciNet  MATH  Google Scholar 

  5. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR, San Diago, CA

  6. Józefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:1602.02410

  7. Dean TL, Ruzon MA, Segal M, Shlens J, Vijayanarasimhan S, Yagnik J (2013) Fast, accurate detection of 100,000 object classes on a single machine. In: Proceedings of CVPR, Portland, OR, pp 1814–1821

  8. Grave E, Joulin A, Cissé M, Grangier D, Jégou H (2017) Efficient softmax approximation for GPUs. In: Proceedings of ICML, Sydney, Australia, pp 1302–1310

  9. Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: Proceedings of IROS, Hamburg, Germany, pp 922–928

  10. Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, pp 779–788

  11. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceddings of CVPR, Las Vegas, NV, pp 2818–2826

  12. Devlin J, Zbib R, Huang Z, Lamar T, Schwartz RM, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. ACL, Baltimore, MD, pp 1370–1380

  13. Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. In: Proceedings ICML, Edinburgh, Scotland, UK

  14. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of CVPR, NV, Las Vegas, pp 2285–2294

  15. Huang Y, Wang W, Wang L, Tan T (2013) Multi-task deep neural network for multi-label learning. In: Procedding of IEEE international conference on image processing. ICIP, Melbourne, Australia, pp 2897–2900

  16. Janocha K, Czarnecki WM (2017) On loss functions for deep neural networks in classification. arXiv:1702.05659

  17. Czarnecki WM, Jozefowicz R, Tabor J (2015) Maximum entropy linear manifold for learning discriminative low-dimensional representation. ECML PKDD 2015:52–67

    Google Scholar 

  18. Keren G, Sabato S, Schuller B (2018) Fast single-class classification and the principle of logit separation. In: Proceedings of the international conference on data mining (ICDM), Singapore, pp 227–236

  19. Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of AISTATS. Bridgetown, Barbados

  20. Chen W, Grangier D, Auli M (2016) Strategies for training large vocabulary neural language models. In: Proceedings of the 54th annual meeting of the association for computational linguistics. ACL, Berlin, Germany

  21. Bengio Y, Senecal J (2008) Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans Neural Netw 19(4):713–722

    Article  Google Scholar 

  22. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS. Lake Tahoe, NV, pp 3111–3119

  23. Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of AISTATS. Chia Laguna Resort, Sardinia, Italy, pp 297–304

  24. Hinton GE (1989) Connectionist learning procedures. Artif Intell 40(1):185–234

    Article  Google Scholar 

  25. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS. Lake Tahoe, NV, pp 1106–1114

  26. Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292

    MATH  Google Scholar 

  27. Socher R, Lin CC, Ng AY, Manning CD (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of ICML. Bellevue, WA, pp 129–136

  28. Kampa K, Hasanbelliu E, Príncipe JC (2011) Closed-form Cauchy–Schwarz PDF divergence for mixture of gaussians. IJCNN 2011:2578–2585

    Google Scholar 

  29. Gower JC (1985) Measures of similarity, dissimilarity and distance. Encycl Stat Sci 5:397–405

    Google Scholar 

  30. Andreas J, When Klein D (2015) why are log-linear models self-normalizing In: Proceddings of NAACL HLT, the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies. Denver, CO, pp. 244–249

  31. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  32. Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning. Granada, Spain

  33. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto

  34. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  35. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

    Article  Google Scholar 

  36. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML. Lille, France, pp 448–456

  37. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR. San Diago, CA

  38. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of CVPR. Las Vegas, NV, pp 770–778

  39. Keren G, Sabato S, Schuller B (2017) Tunable sensitivity to large errors in neural network training. In: Proceedings of AAAI, San Francisco, CA, pp 2087–2093

  40. Keren G, Cummins N, Schuller BW (2018) Calibrated prediction intervals for neural network regressors. IEEE Access 6:54033–54041

    Article  Google Scholar 

  41. Keren G, Han J, Schuller B (2018) Scaling speech enhancement in unseen environments with noise embeddings. In: Proceedings of the CHiME workshop on speech processing in everyday environments. Hyderabad, India, pp 25–29

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 826506 (sustAGE). Sivan Sabato was supported in part by the Israel Science Foundation (Grant No. 555/15).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gil Keren.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Omitted proofs: alignment with the Principle of Logit Separation

All the considered losses are a function of the output logits and the example labels. For a network model \(\theta \), denote the vector of logits it assigns to example x by \(z^\theta (x) = (z^\theta _1(x),\ldots ,z^\theta _k(x))\). When \(\theta \) and x are clear from context, we write \(z_j\) instead of \(z_j^\theta (x)\). Denote the logit output of the sample by \(S_\theta = ((z^\theta (x_1),y_1),\ldots ,(z^\theta (x_n),y_n))\). A loss function \(\ell :\cup _{n=1}^\infty ({\mathbb {R}}^k \times [k])^n \rightarrow {\mathbb {R}}_+\) assigns a loss to a training sample based on the output logits of the model and on the labels of the training examples. The goal of training is to find a model \(\theta \) which minimizes \(\ell (S_\theta ) \equiv \ell (S,\theta )\). In almost all the losses we study, the loss on the training sample is the sum over all examples of a loss defined on a single example: \(\ell (S_\theta ) \equiv \sum _{i=1}^n \ell (z^\theta (x_i),y_i)\), thus we only define \(\ell (z,y)\). We explicitly define \(\ell (S_\theta )\) below only when this is not the case.

1.1 Self-normalization

We prove that the self-normalization loss satisfies the PoLS: let a training sample S and a neural network model \(\theta \), and consider an example \((x,y) \in S\). We consider the two terms of the loss in order. First, consider \(-\log (p_y)\). From the definition of \(p_y\) (Eq. 1) we have that

$$\begin{aligned} - \log (p_y) = \log \left( \sum _{j=1}^k e^{z_j-z_y}\right) = \log \left( 1+ \sum _{j\ne y} e^{z_j-z_y}\right) . \end{aligned}$$

Set \(\epsilon _0 := \log (1+e^{-2})\). Then, if \(-\log (p_y) < \epsilon _0\), we have \(\sum _{j\ne y} e^{z_j-z_y} \le e^{-2}\), which implies that (a) \(\forall j \ne y, z_j \le z_y - 2\) and (b) \(e^{z_y} \ge \sum _{j=1}^k e^{z_j}/(1+e^{-2}) \ge \frac{1}{2} \sum _{j=1}^ke^{z_j}.\) Second, consider the second term. There is an \(\epsilon _1 > 0\) such that if \(\log ^2 (\sum _{j=1}^k e^{z_j}) < \epsilon _1\) then (c) \(2e^{-1}< \sum _{j=1}^k e^{z_j} < e\), which implies \(e^{z_y} < e\) and hence (d) \(z_y < 1\). Now, let \(\theta \) such that \(\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)\). Then \(\forall (x,y) \in S\), \(\ell (z^\theta (x),y) \le \epsilon \). From (b) and (c), \(e^{-1}< \frac{1}{2} \sum _{j=1}^k e^{z_j} < e^{z^y}\), hence \(z_y > -1\). Combining with (d), we get \(-1< z_y < 1\). Combined with (a), we get that for \(j \ne y\), \(z_j < -1\). To summarize, \(\forall (x,y),(x',y') \in S\) and \(\forall y'' \ne y'\), we have that \(z_y^\theta (x)> -1 > z_{y''}^\theta (x')\), implying PoLS alignment.

1.2 Noise-contrastive estimation

Recall the definition of the NCE loss from Eq. (8):

$$\begin{aligned} \ell (z,y) = -\log g_y - t\cdot {\mathbb {E}}_{j \sim q}\left[ \log (1-g_j)\right] \end{aligned}$$

where \(g_j := (1 + t\cdot q(j)\cdot e^{-z_j})^{-1}.\) We prove that the NCE loss satisfies the PoLS: \(g_j\) is monotonic increasing in \(z_j\). Hence, if the loss is small, \(g_y\) is large and \(g_j\) for \(j \ne y\), is small. Formally, fix t, and let a training sample S. There is an \(\epsilon _0 > 0\) such that if \(-\log g_j \le \epsilon _0\), then \(z_j > 0\). Also, there is an \(\epsilon _1 > 0\) (which depends on q) such that if \(- {\mathbb {E}}_{j \sim q}\left[ \log (1-g_j)\right] \le \epsilon _1\) then \(\forall j \ne y\), \(\log (1-g_j)\) must be small enough so that \(z_j < 0\). Now, consider \(\theta \) such that \(\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)\). Then for every \((x,y) \in S\), \(\ell (z^\theta (x),y) \le \epsilon \). This implies that for every \((x,y),(x',y') \in S\) and \(y'' \ne y'\), we have that \(z^\theta _y(x)> 0 > z^\theta _{y''}(x')\), thus this loss is aligned with the PoLS.

1.3 Binary cross-entropy

This loss is similar in form to the NCE loss: for \(g_j\) as in Eq. (8), \(g_j = \sigma (z_j - \ln (t \cdot q(j)))\). Since \(\sigma (z_j)\) is monotonic, the proof method for NCE carries over and thus the binary cross-entropy loss satisfies the PoLS as well.

1.4 Batch losses

Recall that the batch losses are defined as \(\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]\), where \(B_\theta \) is a random batch out of \(S_\theta \) and L is \(L_c\) for the cross-entropy (Definition 2), and \(L_m\) is the max-margin loss (Definition 3). If true logits are greater than false logits in every batch separately when using, then the PoLS is satisfied on the whole sample, since every pair of examples appears together in some batch. The following lemma formalizes this:

Lemma 1

If L is aligned with the PoLS, and \(\ell \) is defined by \(\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]\), then \(\ell \) is also aligned with the PoLS.

Proof

Assume a training sample S and a neural network model \(\theta \). Since L is aligned with the PoLS, there is some \(\epsilon ' > 0\) such if \(L(B_\theta ) < \epsilon '\), then for each \((x,y),(x',y') \in B\) and \(y'' \ne y'\) we have that \(z_y^\theta (x) > z_{y''}^\theta (x')\). Let \(\epsilon = \epsilon '/\left( {\begin{array}{c}n\\ m\end{array}}\right) \), and assume \(\ell (S_\theta ) < \epsilon \). Since there are \(\left( {\begin{array}{c}n\\ m\end{array}}\right) \) batches of size m in S, this implies that for every batch B of size m, \(L(B_\theta ) \le \epsilon '\). For any \((x,y),(x',y') \in S\), there is a batch B that includes both examples. Thus, for \(y'' \ne y'\), \(z_y^\theta (x) > z_{y''}^\theta (x')\). Since this holds for any two examples in S, \(\ell \) is also PoLS-aligned. \(\square \)

(1) Batch cross-entropy To show that the batch cross-entropy satisfies the PoLS, we show that \(L_c\) does, which by Lemma 1 implies this for \(\ell \). By the continuity of \(\mathrm {KL}\), and since for discrete distributions, \(\mathrm {KL}(P||Q) =0 \iff P \equiv Q\), there is an \(\epsilon > 0\) such that if \(L(B_\theta ) \equiv \mathrm {KL}(P_B || Q^\theta _B)] < \epsilon \), then for all ij, \(|P_B(i,j) - Q^\theta _B(i,j)| \le \frac{1}{2m}\). Therefore, for each example \((x,y) \in B\),

$$\begin{aligned} \frac{e^{z_y^\theta (x)}}{Z(B)} > \frac{1}{2m}, \qquad \text { and }\qquad \forall j \ne y, \quad \frac{e^{z_j^\theta (x)}}{Z(B)} < \frac{1}{2m}. \end{aligned}$$

It follows that for any two examples \((x,y),(x',y') \in B\), if \(y \ne y'\), then \(z_y^\theta (x)> \frac{1}{2m} > z_{y'}^\theta (x')\). Therefore L satisfies the PoLS, which completes the proof.

(2) Batch max-margin To show that the batch max-margin loss satisfies the PoLS, we show this for \(L_m\) and invoke Lemma 1. Set \(\epsilon = \gamma /m\). If \(L(B_\theta ) < \epsilon \), then \(\gamma -z_{+}^B + z_{-}^B < \gamma \), implying \(z_{+}^B > z_{-}^B\). Hence, any \((x,y),(x',y') \in B\) such that \(y \ne y'\) satisfy \(z^\theta _y(x) \ge z_{+}^B > z_{-}^B \ge z^\theta _y(x')\). Thus L is aligned with the PoLS, implying the same for \(\ell \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Keren, G., Sabato, S. & Schuller, B. Analysis of loss functions for fast single-class classification. Knowl Inf Syst 62, 337–358 (2020). https://doi.org/10.1007/s10115-019-01395-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01395-6

Keywords

Navigation