Abstract
Supervised learning in neural nets means optimizing synaptic weights W such that outputs y(x;W) for inputs x match as closely as possible the corresponding targets t from the training data set. This optimization means minimizing a loss function \({\mathscr{L}}(\mathbf {W})\) that usually motivates from maximum-likelihood principles, silently making some prior assumptions on the distribution of output errors y −t. While classical crossentropy loss assumes triangular error distributions, it has recently been shown that generalized power error loss functions can be adapted to more realistic error distributions by fitting the exponent q of a power function used for initializing the backpropagation learning algorithm. This approach can significantly improve performance, but computing the loss function requires the antiderivative of the function f(y) := yq− 1/(1 − y) that has previously been determined only for natural \(q\in \mathbb {N}\). In this work I extend this approach for rational q = n/2m where the denominator is a power of 2. I give closed-form expressions for the antiderivative \({\int \limits } f(y) dy\) and the corresponding loss function. The benefits of such an approach are demonstrated by experiments showing that optimal exponents q are often non-natural, and that error exponents q best fitting output error distributions vary continuously during learning, typically decreasing from large q > 1 to small q < 1 during convergence of learning. These results suggest new adaptive learning methods where loss functions could be continuously adapted to output error distributions during learning.
Article PDF
Similar content being viewed by others
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al: Tensorflow: A system for large-scale machine learning. In: 12Th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
AlAhmad, R., Almefleh, H.: Antiderivatives and integrals involving incomplete beta functions with applications. Aust. J. Math. Anal. Appl. 17(2), 11 (2020)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Bryson, A.E., Ho, Y.C.: Applied Optimal Control: Optimization, Estimation, and Control. Blaisdell, New York (1969)
Chollet, F.: Keras. https://github.com/fchollet/keras (2015)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
Ferreira, C., Lopez, J., Perez Sinusia, E.: Uniform representations of the incomplete beta function in terms of elementary functions. Electron. Trans. Numer. Anal. 48, 450–461 (2018)
Gers, F., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y., Titterington, M. (eds.) Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp 249–256. JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Sardinia, Italy (2010)
Good, I.: Some terminology and notation in information theory. Proceedings of the IEE - Part C: Monographs 103(3), 200–204 (1956)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http://www.deeplearningbook.org (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.90 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp 448–456. PMLR, Lille, France (2015)
Janocha, K., Czarnecki, W.: On loss functions for deep neural networks in classification. arXiv:1702.05659 (2017)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd Proceedings of the International Conference on Learning Representations (ICLR), arXiv:1412.6980v9 (2015)
Knoblauch, A.: Power function error initialization can improve convergence of backpropagation learning in neural networks for classification. Neural Comput. 33(8), 2193–2225 (2021)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., Department of Computer Science University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf, vol. 25. Curran Associates, Inc. (2012)
Krotov, D., Hopfield, J.: Dense associative memory for pattern recognition. arXiv:1606.01164 (2016)
Linnainmaa, S.: Taylor expansion of the accumulated rounding error. BIT Numer. Math. 16(2), 146–160 (1976)
Mathematika: Version 12.3.1. Wolfram Research, Inc., Champaign, IL (2021). https://www.wolfram.com/mathematica
MATLAB: version 9.7.0.1247435 (R2019b). The MathWorks Inc., Natick, Massachusetts (2019)
Maxima: Maxima, a computer algebra system. version 5.43.2 (2020). http://maxima.sourceforge.net/. See also www.integral-calculator.com
Palm, G.: Novelty, Information and Surprise. Springer, Berlin (2012)
Parker, D.: Learning-logic: casting the cortex of the human brain in silicon. Tech. Rep. Tr-47, Center for Computational Research in Economics and Management Science. MIT Cambridge MA (1985)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf, pp 8024–8035. Curran Associates, Inc. (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, vol. 9351, pp 234–241. Springer, Berlin (2015)
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016)
Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Rumelhart, D., McClelland, J., Group, P.R. (eds.): Parallel Distributed Processing, Explorations in the Microstructure of Cognition, vol. 1. Foundations. MIT Press, Cambridge (1986)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana/Chicago (1949)
Smith, S.: The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, San Diego (1997)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp 6105–6114. PMLR (2019)
Tenne, N.: Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function. Math. Comput. 29(132), 1109–1114 (1975)
Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard University (1974)
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The author declares that he has no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by Deutsches Bundesministerium für Verkehr und digitale Infrastruktur (Modernitätsfonds/mFUND) via project AI4Infra (FKZ 19F2112C), and by Ministerium für Wirtschaft, Arbeit und Wohnungsbau Baden-Württemberg (KI-Innovationswettbewerb BW für Verbundforschungsprojekte) via project KI4Audio (FKZ 36-3400.7/98). The author as well acknowledges support by the state of Baden-Württemberg through bwHPC.
Appendices
Appendix A: Proofs and supplements of Sections 2 and 3
Proof of Theorem 2:
By iterated polynomial division it is easy to verify that for \(n\in \mathbb {N}\)
and for \(q\in \mathbb {N}\) and y ∈ (0; 1) therefore \(F(y):={{\int \limits }_{0}^{y}} \frac {y^{q-1}}{1-y}dy=-{{\int \limits }_{0}^{y}}{\sum }_{i=0}^{q-2}y^{i}+\frac {1}{1-y}dy=-{\sum }_{i=0}^{q-2}\frac {y^{i+1}}{i+1}-\log (1-y)\) showing (12). The second form (13) has been used previously [19] and is given here for completeness. The two forms are equivalent as (13) satisfies F(0) = 0 and, with the binomial sum, has the correct derivative \(F^{\prime }(y)=\frac {1}{1-y}+{\sum }_{r=0}^{q-2}{q-1\choose r}(-1)^{q-r}(-(1-y)^{q-2-r})=\frac {1}{1-y}-\frac {(-1)^{q}}{1-y}({\sum }_{r=0}^{q-1}{q-1\choose r}(-1)^{r}(1-y)^{q-1-r}-(-1)^{q-1})=\frac {1-(-1)^{q}((-y)^{q-1}+(-1)^{q})}{1-y}\)\(=\frac {y^{q-1}}{1-y}\). Then (14) follows from inserting (12) into (10),
With the binomial sum \((1-y)^{i}={\sum }_{j=0}^{i}{i\choose j}(-y)^{j}\), the sum in (14) writes as the polynomial
from which we can read the polynomial coefficients \(a^{(q)}_{0}={\sum }_{i=1}^{q-1}\frac {t}{i}\) and
for i = 1,…,q − 1. Table 1 gives examples for the alternative coefficients \(b_{i}^{(q)}\) of (16). □
Proof of Proposition 1:
It is well known that \(\frac {d}{dz}_{2}F_{1}(a,b;a+1,z)=a\cdot \frac {(1-z)^{-b}-_{2}F_{1}(a,b;a+1,z)}{z}\). Defining G(y) :=2F1(q, 1; q + 1,y) for brevity we get from this \((y^{q}G(y)/q)'=y^{q-1}G(y)+y^{q}(q\frac {(1-y)^{-1}-G(y)}{y})/q=y^{q-1}/(1-y)\). □
Proof of Proposition 2:
We have to show that \(\tilde {F}^{\prime }(y)=f(y)\). In fact, it is
where the last equation involves the DFT of the discrete N-periodic signal
Using some well known facts of the DFT [36] for signals and
it follows . Thus, inserting \(U[N-n]=cNe^{\alpha (N-n)}=cNe^{-(N-n)\log (y)/N}=cNy^{-(N-n)/N}\) in (2) with \(e^{\alpha N}=e^{-\log (y)}=1/y\) yields
proving (19), (20). We still have to prove (21) because determining the generalized loss function (10) with (17) involves subtracting \(\tilde {F}(0)\): With the geometric-type sum
(which can easily be proved by induction), the complex logarithm \(\log (jr) \in (-\pi ;\pi ]\) in the primary sheet
using z := ej2πn/N and DFT{1}[n] = δ[n] = 0 for n = 1,…,N − 1, where δ[n] is the discrete Dirac impulse (that is, δ[n] = 0 for n = 0, and δ[n] = 0 for n≠ 0), we get from (20)
□
Proof of Proposition 3:
We can rewrite (20) as
where \(\tilde {F}[k]\) is a discrete N-periodic signal. In polar form,
where the last two cases are necessary to distinguish between different sheets of the complex logarithm, and thus
As (3) obviously implies conjugate complex symmetry \(\tilde {F}[-k]=\tilde {F}[N-k]={\tilde {F}}^{\ast }[k]\), we get for 0 ≤ y < 1
where, in the sum \(\tilde {G}(y):={\sum }_{k=1}^{N/2-1}\cos \limits (2\pi \frac {kn}{N})\cdot \log (r_{k})-\sin \limits (2\pi \frac {kn}{N})\cdot \varphi _{k}\), the terms with index \(k=1,...,\frac {N}{4}-1\) are similar to those with index \(\frac {N}{2}-k\). Specifically,
where the last equation for φN/2−k follows from y ≥ 0 and \(\cos \limits \left (2\pi \frac {N/2-k}{N}\right )<0\) for \(k=1,\ldots ,\frac {N}{4}-1\). Thus for N ≥ 4
such that inserting (41) in (39) and using (21) we obtain (22). The remaining special case (23) is easily shown by the derivative \(\left (\log \frac {1+\sqrt {y}}{1-\sqrt {y}}\right )'=\frac {1-\sqrt {y}}{1+\sqrt {y}}\cdot \frac {\frac {1}{2\sqrt {y}}(1-\sqrt {y})+(1+\sqrt {y})\frac {1}{2\sqrt {y}}}{(1-\sqrt {y})^{2}}=\frac {1}{(1-y)\sqrt {y}}\). □
Proof of Proposition 4:
For irreducible q = n/N = n/2m ∈ (0; 1), we have odd n for m ≥ 1 and (− 1)n = − 1. From (7) we then obtain for \(k=1,\ldots ,\frac {N}{4}-1\)
with the Heaviside function H(y) = 1 if y ≥ 0 and H(y) = 0 otherwise. We can further simplify the last equation using the addition theorem of the \(\arctan \) function
Specifically, for \(a:=\frac {\sin \limits \left (2\pi \frac {k}{N}\right )}{y^{1/N}-\cos \limits \left (2\pi \frac {k}{N}\right )}\) and \(b:=\frac {\sin \limits \left (2\pi \frac {k}{N}\right )}{y^{1/N}+\cos \limits \left (2\pi \frac {k}{N}\right )}\) we get
Note that for \(k\in \{1,\ldots ,\frac {N}{4}-1\}\) both \(\sin \limits \left (2\pi \frac {k}{N}\right )>0\) and \(\cos \limits \left (2\pi \frac {k}{N}\right )>0\). Thus, for 0 < y < 1 the conditions ab < 1 and a + b < 0 are both equivalent to \(y^{1/N}<\cos \limits \left (2\pi \frac {k}{N}\right )\). Therefore the addition theorem implies
such that with (9), the identity \(\arctan (\frac {1}{x})=\frac {\pi }{2}-\arctan (x)\), and (− 1)n = − 1 the antiderivative (22) becomes for odd n with \(\cos \limits (\pi \frac {n}{2})=0\) and \(\sin \limits (\pi \frac {n}{2})=(-1)^{(n-1)/2}\)
Simplifying this result yields Proposition 4: As \(\arctan (0)=0\), \(\log (1)=0\), and F(y) = 0 by definition (see (22)), the trigonometric identity (24) follows from (45) for y = 0. Inserting (24) back into (45) yields (25). □
Proof of Proposition 5:
Recapitulating our results so far, we find that Proposition 2 (including all equations in the proof) and (3)–(5) hold also for y > 1. With this we can easily verify that for y > 1 we have real-valued \(\log (y^{1/N}-1)\in \mathbb {R}\) and the antiderivative (6) becomes
Then, because of \(y>1\ge \cos \limits (2\pi \frac {k}{n})\), we see that (7) simplifies to
whereas all equations of (7) and (8) still hold true. Thus, in correspondence to (22) we get for y > 1 and any n = 1, 2, 3, 4,…,N − 1
employing real-valued computations only (whereas (22) is still a valid antiderivative for y > 1, but includes imaginary numbers). For irreducible q = n/2m with odd n = 1, 3, 5,…,N − 1, it is easy to verify that (9) and (10) remain valid, whereas \(y^{1/N}-\cos \limits \left (2\pi \frac {k}{N}\right )>0\) for y > 1 implies always ab < 1 and therefore (11) simplifies to
Thus, (14) becomes for odd n with (9), (11), \(\arctan (\frac {1}{x})=\frac {\pi }{2}-\arctan (x)\), (− 1)n = − 1, \(\cos \limits (\pi \frac {n}{2})=0\), and \(\sin \limits (\pi \frac {n}{2})=(-1)^{(n-1)/2}\)
Merging this with Proposition 4 gives almost immediately Proposition 5: Skipping the constant C in (49) and then comparing to (25) reveals that, after taking absolute values |1 − y1/N|, both (25) for 0 ≤ y < 1 and (49) for y > 1 represent the same function that is unified by (26). The case N = 2 corresponding to (27) can be shown as in (23). □
Proof of Proposition 6:
We assume irreducible q = n/N > 1 with N := 2m > 1 and odd n > N. Using odd η := n − N > 0 and substituting u := y1/N with \(du/dy=\frac {y^{1/N-1}}{N}=\frac {1}{Ny^{(N-1)/N}}\), y = uN, dy = Ny(N− 1)/Ndu = NuN− 1du we obtain
By polynomial division we get for M := (η − 1) div N ≥ 0 and even R := (η − 1) mod N = η − 1 − M ⋅ N ∈{0,…,N − 2}
and therefore, with re-substituting u := y1/N, η := n − N, and \(du=\frac {dy}{Ny^{(N-1)/N}}\), (18) becomes
where (20) uses η div N = (η − 1) div N and η mod N = ((η − 1) mod N) + 1 = R + 1 as η := n − N is odd while N := 2m is even, and so η mod N ∈{1, 3, 5,…,N − 1} is odd. Using (20) completes the proof: As \(\tilde {n}:=R+1=\eta \ \text {mod}\ N=n\ \text {mod}\ N\in \{1,3,5,\ldots ,N-1\}\) is odd, \(\tilde {q}:=(R+1)/N=\tilde {n}/N\) is irreducible and \(0<\tilde {q}<1\), thus the remaining integral in (53) follows from (26) for N ≥ 4 or from (27) for N = 2, replacing q and n by \(\tilde {q}\) and \(\tilde {n}\), respectively. Thus we get (28) for N ≥ 4 and (29) for N = 2. Both antiderivatives have already the correct offset 0 for y = 0, so they correspond to continuations of F(y) as defined earlier. □
Proof of Theorem 3:
First, let us show that (30) contains Proposition 6 (as special case for irreducible q = n/N > 1 with N ≥ 2 and \(y\in \mathbb {R}^{+}_{0}\backslash \{1\}\)): Indeed, for \(M:=(n-N-1)\ \text {div}\ N=(n \ \text {div}\ N)-1=\tilde {q}-1\), the first sums in (6) and (29) become after substituting i by M − i
which, for the case N ≥ 2 with H(N − 2) = H(q − 1) = 1, equals the first line of (30) writing the first summand for i = 0 as a separate term for later case distinctions. It is easy to verify that also the remaining terms in (6) and (29) are equivalent to those in (30) as here H(N − 2) and H(N − 4) select the necessary terms, respectively.
Second, we show that (30) contains Theorem 2 (as special case for \(q\in \mathbb {N}\), 0 ≤ y < 1) and extends it for y > 1: As here \(q=n/N=n\in \mathbb {N}\) with N = 1, and thus \(\tilde {n}=0\), \(\tilde {q}=n=q\), M = q − 1, and H(N − 2) = H(N − 4) = 0,H(q − 1) = 1, the antiderivative (30) becomes
which equals (12) for 0 ≤ y < 1 and, also for the case y > 1, is still a proper antiderivative of f(y), as can easily be seen by recapitulating the proof of (12) using (34).
Third, it is easy to verify that (30) contains Proposition 5 (as special case for irreducible 0 < q = n/N < 1 with N ≥ 2 and \(y\in \mathbb {R}^{+}_{0}\backslash \{1\}\)) using H(q − 1) = 0, H(N − 2) = 1, and either H(N − 4) = 1 (for (26)) or H(N − 4) = 0 (for (27)).
Finally, the monotonicity of F(y) follows from f(y) > 0 for 0 < y < 1 and f(y) < 0 for y > 1, and the limits and asymptotic expressions (31) are easily verified by inspecting each case. In particular, F(0) = 0 follows from 0i = 0, \(\log (1)=0\), and \(\arctan (0)=0\). \(F(1)=\infty \) follows from \(\lim _{y\rightarrow 1}\log (\frac {1+y^{\frac {1}{N}}\cdot H(N-2)}{|1-y^{1/N}|})=\infty \) and all other terms remaining finite for \(y\rightarrow 1\). For \(y\rightarrow \infty \), approximation (31) follows because \(\arctan (\frac {2\sin \limits (\frac {2\pi k}{N})y^{\frac {1}{N}}}{1-y^{\frac {2}{N}}})\rightarrow 0\), \(\log (\frac {1+2y^{\frac {1}{N}}\cos \limits (\frac {2\pi k}{N})+y^{\frac {2}{N}}}{1-2y^{\frac {1}{N}}\cos \limits (\frac {2\pi k}{N})+y^{\frac {2}{N}}})\rightarrow 0\), \(\arctan (y^{\frac {1}{N}})\rightarrow \frac {\pi }{2}\), and \(\log (\frac {1+y^{\frac {1}{N}}\cdot H(N-2)}{|1-y^{\frac {1}{N}}|})\rightarrow 0\) for N ≥ 2. From this the asymptotics can easily be seen, in particular for q > 1 the dominating term in (31) is \(-\frac {N}{n-N}y^{\tilde {n}/N+\tilde {q}-1}=-\frac {1}{(n-N)/N}y^{(n\ \text {mod}\ N)/N+n\ \text {div}\ N-1}=-\frac {1}{q-1}y^{(n-N)/N}=-\frac {1}{q-1}y^{q-1}\). □
Appendix B: Implementation details
Figures 2–3 used the Simple RNN of Fig. 2A with D inputs, M = 10 hidden units, and K outputs. The layers are linked by dense connections A, U, B and include also bias weights for layers u, y. Activation functions are \(\tanh \) for u and the logistic sigmoid σ for y. Synaptic connections are initialized by uniform Xavier [10]. Training used ADAM optimizer [18, 31] with standard parameters β1 = 0.9, β2 = 0.999. Experiments used either Keras 2.2.5 with a Tensorflow 1.14.0 backend [1, 6] or a custom neural network library for backpropagation. While Keras computes gradients based on automatic differentiation [23] of the loss function (see Python code below for the power error loss (10) with (30)), the custom implementation uses (5) with the power error initialization (7).
The Embedded Reber Grammar Problem is to predict the next output symbol of a finite automaton with non-deterministic state transitions [15]. Inputs are symbol sequences xn(1),xn(2),…,xT generated by the automaton, representing each of the D = 7 symbols by a one-hot input vector xn(τ). The output to be predicted at time τ is the next symbol xn(τ + 1) generated by the automaton in the next time step (K = 7). Due to the non-determinism, target vectors tn(τ) can have multiple one-entries, one for each possible output symbol. Network decision \(\hat {\mathbf {y}}_{n}(\tau )\) is evaluated as correct if \(\hat {\mathbf {y}}_{n}(\tau )=\mathbf {t}_{n}(\tau )\) at decision threshold 0.5. Learning used N = 2048 sequences (90% training, 10% validating/testing). Average sequence length is T = 12 (maximum 40).
Remarks to Fig. 3: Reaching zero median errors suggests that all 16 learning trials of an experiment converged to zero error, while previous works reported problems of simple RNN solving this data set [9, 15]. However, as zero median means that at least half of trials reached zero errors, I have analyzed also mean test errors (instead of medians), which also reached zero within the total learning duration of 65 epochs for all exponents 0.5 ≤ q ≤ 15 (data not shown). The profile of minimal epoch numbers until zero test errors was similar to Fig. 3B, although absolute values were about factor 2-2.5 larger (best was 9 epochs for q = 1.5 vs. 26 epochs for q = 1). The discrepancy may be that earlier works used a suboptimal initialization of synaptic weights, causing either vanishing or exploding gradients [3, 15]. Additional experiments (data not shown) revealed that at least q > 1 can still reach zero error, even if initial weights deviate substantially (by factors between 0.125 and 3) from Xavier initialization. This is consistent with the idea that, for q > 1, the power error loss provides a better gradient-to-loss ratio and, thereby, avoids flat loss landscapes and vanishing gradients [19].
Remarks to Fig. 2D: Panel (D) shows estimated power exponents \(\tilde {q}\) of absolute output error distributions |𝜖| = |t − y| corresponding to panels (B) and (C). Estimated \(\tilde {q}\) are obtained from selecting the minimum Euclidean distance of histogram vectors (10 equally spaced bins of interval |𝜖|∈ [0; 1]) between experimental and theoretical distributions (3) evaluated for \(q\in \{\frac {1}{8},\ldots ,\frac {7}{8},1,1.25,1.5,1.75,2,2.5,3,4,\ldots ,20\}\). As expected, \(\tilde {q}\) decreases gradually with epoch number, and may therefore be used to characterize learning progress as discussed in Section 4. Again, there are some deviations between Keras and the custom implementation, in particular, in later learning phases (but note the logarithmic scale).
Figure 5 used a sequential 2D Convolutional Neural Network (CNN) architecture implemented with PyTorch 1.7.0+cu101 [29], including 6 CNN-Layers (kernel size 3, stride 1, “same” padding), 2 Max-Pooling layers (MaxPool; kernel size 2; stride 2), 2 Batch-Normalization (BN) layers, 2 Fully Connected (FC) layers, and 2 Dropout-Layers: Input → CNN(32) → BN(32) → ReLU → CNN(64) → ReLU → MaxPool → CNN(128) → BN → ReLU → CNN(128) → ReLU → MaxPool → Dropout(p = 0.05) → CNN(256) → BN → ReLU → CNN(256) → ReLU → MaxPool → Dropout(p = 0.1) → FC(1024) → ReLU → FC(512) → Dropout(p = 0.1) → FC(512) → σ. Numbers in brackets correspond to channels (CNN/BN) and size (FC). Activation functions were rectified linear units (ReLU) except σ for outputs (or softmax for CCE loss). The CIFAR10 dataset consists of 50000 training and 10000 test images (RGB) of size 32 × 32 from 10 classes [20]. Accuracies were computed from maximum decisions averaged (mean value) over 16 learning trials. Standard parameters were used without any further optimizations (ADAM optimizer as above, but initial learning rate 0.0001, minibatch 64).
Backpropagation involving automatic differentiation (as with Keras and PyTorch) requires implementing the power error loss function (10) with (30). All experiments involving Keras used the following Python code (for PyTorch replace K and t f by t orch and c lip_by_value by c lamp):
[fontsize=\mysize] import numpy as np, tensorflow as tf from keras import backend as K def powererrorloss_wrapper(n=3,N=1,_eps=1e-7): """ computes power error loss function eq.10 for q=n/N using eq.30 it is assumed that n and N are positive integers with N=2^m being a power of two _eps: clip y and powers of y to interval [_eps;1-_eps] to avoid NaN etc. """ def F(y): # eq.30: assumes that n is odd, N=2^m is a power of two for m>=0, and 0<y<1 y=tf.clip_by_value(y,_eps,1.0-_eps) n_tilde = n q_tilde = n//N y_1divN = tf.clip_by_value(K.pow(y,1.0/N),_eps,1.0-_eps) y_2divN = tf.clip_by_value(K.pow(y,2.0/N),_eps,1.0-_eps) y_ntildedivN = tf.clip_by_value(K.pow(y,n_tilde/N),_eps,1.0-_eps) # (i) first line of F(y) in eq.30 L = 0 if n>=N: # if q>=1 if N>=2: L=L+1.0/(n-q_tilde*N) for i in range(1,q_tilde): L=L+K.pow(y,i)*(1.0/(n-(q_tilde-i)*N)) L=L*(-N*y_ntildedivN) # (ii) second line of eq.30 tmp=1.0/tf.clip_by_value(1.0-y_1divN,_eps,1.0-_eps) # abs(.) not necessary (clipping) if N>=2: tmp=tmp*(1.0+y_1divN); L=L+K.log(tmp); # (iii) third line of eq.30 if N>=4: tmp=(n_tilde-1)//2 # exponent of (-1)^((n_tilde-1)/2) if ((tmp//2)*2) == tmp: sg=1 # sign of (-1)^((n_tilde-1)/2) else: sg=-1 L=L+sg*2.0*tf.atan(y_1divN) # (iv) fourth line of eq.30 if N>=4: for k in range(1,N//4): tmp=tf.clip_by_value(1.0+2.0*y_1divN*np.cos(2.0*np.pi*k/N)+y_2divN,_eps,4.) tmp=tmp/tf.clip_by_value(1.0-2.0*y_1divN*np.cos(2.0*np.pi*k/N)+y_2divN,_eps,4.) L=L+np.cos(2.0*np.pi*k*n_tilde/N)*K.log(tmp) # (v) fifth line of eq.30 for k in range(1,N//4): tmp=2.0*np.sin(2.0*np.pi*k/N)*y_1divN tmp=tf.atan(tmp/tf.clip_by_value(1.0-y_2divN,_eps,1.0-_eps)) L=L+2.0*np.sin(2.0*np.pi*k*n_tilde/N)*tmp return L def powererrorloss(t,y): loss=(1.0-t)*F(y)+t*F(1.0-y) # eq.10 return K.mean(loss) # averaging to be consistent with other Keras losses # main program of powererrorloss_wrapper assert n>0 and isinstance(n,int),'n must be positive integer!' assert N>0 and isinstance(N,int),'N must be positive integer!' m,N_=0,N while N_>1: # check if N is power of 2 N_old=N_ N_=N_//2 assert N_*2==N_old,'N must be power of two!' m=m+1 while ((n//2)*2==n) and (N>1): # reduce n/N n=n//2 N=N//2 return powererrorloss
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Knoblauch, A. On the antiderivatives of xp/(1 − x) with an application to optimize loss functions for classification with neural networks. Ann Math Artif Intell 90, 425–452 (2022). https://doi.org/10.1007/s10472-022-09786-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-022-09786-2
Keywords
- Supervised learning
- Classification
- Crossentropy
- Power error loss function
- Deep learning
- Incomplete beta function
- Hypergeometric function