Abstract
We consider neural network training, in applications in which there are many possible classes, but at test time, the task is a binary classification task of determining whether the given example belongs to a specific class. We define the single logit classification (SLC) task: training the network so that at test time, it would be possible to accurately identify whether the example belongs to a given class in a computationally efficient manner, based only on the output logit for this class. We propose a natural principle, the Principle of Logit Separation, as a guideline for choosing and designing losses suitable for the SLC task. We show that the cross-entropy loss function is not aligned with the Principle of Logit Separation. In contrast, there are known loss functions, as well as novel batch loss functions that we propose, which are aligned with this principle. Our experiments show that indeed in almost all cases, losses that are aligned with the Principle of Logit Separation obtain at least 20% relative accuracy improvement in the SLC task compared to losses that are not aligned with it, and sometimes considerably more. Furthermore, we show that fast SLC does not cause any drop in binary classification accuracy, compared to standard classification in which all logits are computed, and yields a speedup which grows with the number of classes.
Similar content being viewed by others
References
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of CVPR, Miami, FL, pp 248–255
Partalas I, Kosmopoulos A, Baskiotis N, Artières T, Paliouras G, Gaussie É, Androutsopoulos I, Amini M, Gallinari P (2015) LSHTC: a benchmark for large-scale text classification. arXiv:1503.08581
Weston J, Makadia A, Yee H (2013) Label partitioning for sublinear ranking. In: Proceedings of ICML, Atlanta, GA, pp 181–189
Gupta MR, Bengio S, Weston J (2014) Training highly multiclass classifiers. J Mach Learn Res 15(1):1461–1492
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR, San Diago, CA
Józefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:1602.02410
Dean TL, Ruzon MA, Segal M, Shlens J, Vijayanarasimhan S, Yagnik J (2013) Fast, accurate detection of 100,000 object classes on a single machine. In: Proceedings of CVPR, Portland, OR, pp 1814–1821
Grave E, Joulin A, Cissé M, Grangier D, Jégou H (2017) Efficient softmax approximation for GPUs. In: Proceedings of ICML, Sydney, Australia, pp 1302–1310
Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: Proceedings of IROS, Hamburg, Germany, pp 922–928
Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, pp 779–788
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceddings of CVPR, Las Vegas, NV, pp 2818–2826
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz RM, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. ACL, Baltimore, MD, pp 1370–1380
Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. In: Proceedings ICML, Edinburgh, Scotland, UK
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of CVPR, NV, Las Vegas, pp 2285–2294
Huang Y, Wang W, Wang L, Tan T (2013) Multi-task deep neural network for multi-label learning. In: Procedding of IEEE international conference on image processing. ICIP, Melbourne, Australia, pp 2897–2900
Janocha K, Czarnecki WM (2017) On loss functions for deep neural networks in classification. arXiv:1702.05659
Czarnecki WM, Jozefowicz R, Tabor J (2015) Maximum entropy linear manifold for learning discriminative low-dimensional representation. ECML PKDD 2015:52–67
Keren G, Sabato S, Schuller B (2018) Fast single-class classification and the principle of logit separation. In: Proceedings of the international conference on data mining (ICDM), Singapore, pp 227–236
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of AISTATS. Bridgetown, Barbados
Chen W, Grangier D, Auli M (2016) Strategies for training large vocabulary neural language models. In: Proceedings of the 54th annual meeting of the association for computational linguistics. ACL, Berlin, Germany
Bengio Y, Senecal J (2008) Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans Neural Netw 19(4):713–722
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS. Lake Tahoe, NV, pp 3111–3119
Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of AISTATS. Chia Laguna Resort, Sardinia, Italy, pp 297–304
Hinton GE (1989) Connectionist learning procedures. Artif Intell 40(1):185–234
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS. Lake Tahoe, NV, pp 1106–1114
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
Socher R, Lin CC, Ng AY, Manning CD (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of ICML. Bellevue, WA, pp 129–136
Kampa K, Hasanbelliu E, Príncipe JC (2011) Closed-form Cauchy–Schwarz PDF divergence for mixture of gaussians. IJCNN 2011:2578–2585
Gower JC (1985) Measures of similarity, dissimilarity and distance. Encycl Stat Sci 5:397–405
Andreas J, When Klein D (2015) why are log-linear models self-normalizing In: Proceddings of NAACL HLT, the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies. Denver, CO, pp. 244–249
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning. Granada, Spain
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML. Lille, France, pp 448–456
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR. San Diago, CA
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of CVPR. Las Vegas, NV, pp 770–778
Keren G, Sabato S, Schuller B (2017) Tunable sensitivity to large errors in neural network training. In: Proceedings of AAAI, San Francisco, CA, pp 2087–2093
Keren G, Cummins N, Schuller BW (2018) Calibrated prediction intervals for neural network regressors. IEEE Access 6:54033–54041
Keren G, Han J, Schuller B (2018) Scaling speech enhancement in unseen environments with noise embeddings. In: Proceedings of the CHiME workshop on speech processing in everyday environments. Hyderabad, India, pp 25–29
Acknowledgements
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 826506 (sustAGE). Sivan Sabato was supported in part by the Israel Science Foundation (Grant No. 555/15).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Omitted proofs: alignment with the Principle of Logit Separation
All the considered losses are a function of the output logits and the example labels. For a network model \(\theta \), denote the vector of logits it assigns to example x by \(z^\theta (x) = (z^\theta _1(x),\ldots ,z^\theta _k(x))\). When \(\theta \) and x are clear from context, we write \(z_j\) instead of \(z_j^\theta (x)\). Denote the logit output of the sample by \(S_\theta = ((z^\theta (x_1),y_1),\ldots ,(z^\theta (x_n),y_n))\). A loss function \(\ell :\cup _{n=1}^\infty ({\mathbb {R}}^k \times [k])^n \rightarrow {\mathbb {R}}_+\) assigns a loss to a training sample based on the output logits of the model and on the labels of the training examples. The goal of training is to find a model \(\theta \) which minimizes \(\ell (S_\theta ) \equiv \ell (S,\theta )\). In almost all the losses we study, the loss on the training sample is the sum over all examples of a loss defined on a single example: \(\ell (S_\theta ) \equiv \sum _{i=1}^n \ell (z^\theta (x_i),y_i)\), thus we only define \(\ell (z,y)\). We explicitly define \(\ell (S_\theta )\) below only when this is not the case.
1.1 Self-normalization
We prove that the self-normalization loss satisfies the PoLS: let a training sample S and a neural network model \(\theta \), and consider an example \((x,y) \in S\). We consider the two terms of the loss in order. First, consider \(-\log (p_y)\). From the definition of \(p_y\) (Eq. 1) we have that
Set \(\epsilon _0 := \log (1+e^{-2})\). Then, if \(-\log (p_y) < \epsilon _0\), we have \(\sum _{j\ne y} e^{z_j-z_y} \le e^{-2}\), which implies that (a) \(\forall j \ne y, z_j \le z_y - 2\) and (b) \(e^{z_y} \ge \sum _{j=1}^k e^{z_j}/(1+e^{-2}) \ge \frac{1}{2} \sum _{j=1}^ke^{z_j}.\) Second, consider the second term. There is an \(\epsilon _1 > 0\) such that if \(\log ^2 (\sum _{j=1}^k e^{z_j}) < \epsilon _1\) then (c) \(2e^{-1}< \sum _{j=1}^k e^{z_j} < e\), which implies \(e^{z_y} < e\) and hence (d) \(z_y < 1\). Now, let \(\theta \) such that \(\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)\). Then \(\forall (x,y) \in S\), \(\ell (z^\theta (x),y) \le \epsilon \). From (b) and (c), \(e^{-1}< \frac{1}{2} \sum _{j=1}^k e^{z_j} < e^{z^y}\), hence \(z_y > -1\). Combining with (d), we get \(-1< z_y < 1\). Combined with (a), we get that for \(j \ne y\), \(z_j < -1\). To summarize, \(\forall (x,y),(x',y') \in S\) and \(\forall y'' \ne y'\), we have that \(z_y^\theta (x)> -1 > z_{y''}^\theta (x')\), implying PoLS alignment.
1.2 Noise-contrastive estimation
Recall the definition of the NCE loss from Eq. (8):
where \(g_j := (1 + t\cdot q(j)\cdot e^{-z_j})^{-1}.\) We prove that the NCE loss satisfies the PoLS: \(g_j\) is monotonic increasing in \(z_j\). Hence, if the loss is small, \(g_y\) is large and \(g_j\) for \(j \ne y\), is small. Formally, fix t, and let a training sample S. There is an \(\epsilon _0 > 0\) such that if \(-\log g_j \le \epsilon _0\), then \(z_j > 0\). Also, there is an \(\epsilon _1 > 0\) (which depends on q) such that if \(- {\mathbb {E}}_{j \sim q}\left[ \log (1-g_j)\right] \le \epsilon _1\) then \(\forall j \ne y\), \(\log (1-g_j)\) must be small enough so that \(z_j < 0\). Now, consider \(\theta \) such that \(\ell (S_\theta ) \le \epsilon := \min (\epsilon _0, \epsilon _1)\). Then for every \((x,y) \in S\), \(\ell (z^\theta (x),y) \le \epsilon \). This implies that for every \((x,y),(x',y') \in S\) and \(y'' \ne y'\), we have that \(z^\theta _y(x)> 0 > z^\theta _{y''}(x')\), thus this loss is aligned with the PoLS.
1.3 Binary cross-entropy
This loss is similar in form to the NCE loss: for \(g_j\) as in Eq. (8), \(g_j = \sigma (z_j - \ln (t \cdot q(j)))\). Since \(\sigma (z_j)\) is monotonic, the proof method for NCE carries over and thus the binary cross-entropy loss satisfies the PoLS as well.
1.4 Batch losses
Recall that the batch losses are defined as \(\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]\), where \(B_\theta \) is a random batch out of \(S_\theta \) and L is \(L_c\) for the cross-entropy (Definition 2), and \(L_m\) is the max-margin loss (Definition 3). If true logits are greater than false logits in every batch separately when using, then the PoLS is satisfied on the whole sample, since every pair of examples appears together in some batch. The following lemma formalizes this:
Lemma 1
If L is aligned with the PoLS, and \(\ell \) is defined by \(\ell (S_\theta ) := {\mathbb {E}}_B[L(B_\theta )]\), then \(\ell \) is also aligned with the PoLS.
Proof
Assume a training sample S and a neural network model \(\theta \). Since L is aligned with the PoLS, there is some \(\epsilon ' > 0\) such if \(L(B_\theta ) < \epsilon '\), then for each \((x,y),(x',y') \in B\) and \(y'' \ne y'\) we have that \(z_y^\theta (x) > z_{y''}^\theta (x')\). Let \(\epsilon = \epsilon '/\left( {\begin{array}{c}n\\ m\end{array}}\right) \), and assume \(\ell (S_\theta ) < \epsilon \). Since there are \(\left( {\begin{array}{c}n\\ m\end{array}}\right) \) batches of size m in S, this implies that for every batch B of size m, \(L(B_\theta ) \le \epsilon '\). For any \((x,y),(x',y') \in S\), there is a batch B that includes both examples. Thus, for \(y'' \ne y'\), \(z_y^\theta (x) > z_{y''}^\theta (x')\). Since this holds for any two examples in S, \(\ell \) is also PoLS-aligned. \(\square \)
(1) Batch cross-entropy To show that the batch cross-entropy satisfies the PoLS, we show that \(L_c\) does, which by Lemma 1 implies this for \(\ell \). By the continuity of \(\mathrm {KL}\), and since for discrete distributions, \(\mathrm {KL}(P||Q) =0 \iff P \equiv Q\), there is an \(\epsilon > 0\) such that if \(L(B_\theta ) \equiv \mathrm {KL}(P_B || Q^\theta _B)] < \epsilon \), then for all i, j, \(|P_B(i,j) - Q^\theta _B(i,j)| \le \frac{1}{2m}\). Therefore, for each example \((x,y) \in B\),
It follows that for any two examples \((x,y),(x',y') \in B\), if \(y \ne y'\), then \(z_y^\theta (x)> \frac{1}{2m} > z_{y'}^\theta (x')\). Therefore L satisfies the PoLS, which completes the proof.
(2) Batch max-margin To show that the batch max-margin loss satisfies the PoLS, we show this for \(L_m\) and invoke Lemma 1. Set \(\epsilon = \gamma /m\). If \(L(B_\theta ) < \epsilon \), then \(\gamma -z_{+}^B + z_{-}^B < \gamma \), implying \(z_{+}^B > z_{-}^B\). Hence, any \((x,y),(x',y') \in B\) such that \(y \ne y'\) satisfy \(z^\theta _y(x) \ge z_{+}^B > z_{-}^B \ge z^\theta _y(x')\). Thus L is aligned with the PoLS, implying the same for \(\ell \).
Rights and permissions
About this article
Cite this article
Keren, G., Sabato, S. & Schuller, B. Analysis of loss functions for fast single-class classification. Knowl Inf Syst 62, 337–358 (2020). https://doi.org/10.1007/s10115-019-01395-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01395-6