On the antiderivatives of xp/(1 − x) with an application to optimize loss functions for classification with neural networks

Supervised learning in neural nets means optimizing synaptic weights W such that outputs y(x;W) for inputs x match as closely as possible the corresponding targets t from the training data set. This optimization means minimizing a loss function L(W)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${\mathscr{L}}(\mathbf {W})$\end{document} that usually motivates from maximum-likelihood principles, silently making some prior assumptions on the distribution of output errors y −t. While classical crossentropy loss assumes triangular error distributions, it has recently been shown that generalized power error loss functions can be adapted to more realistic error distributions by fitting the exponent q of a power function used for initializing the backpropagation learning algorithm. This approach can significantly improve performance, but computing the loss function requires the antiderivative of the function f(y) := yq− 1/(1 − y) that has previously been determined only for natural q∈ℕ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$q\in \mathbb {N}$\end{document}. In this work I extend this approach for rational q = n/2m where the denominator is a power of 2. I give closed-form expressions for the antiderivative ∫f(y)dy\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${\int \limits } f(y) dy$\end{document} and the corresponding loss function. The benefits of such an approach are demonstrated by experiments showing that optimal exponents q are often non-natural, and that error exponents q best fitting output error distributions vary continuously during learning, typically decreasing from large q > 1 to small q < 1 during convergence of learning. These results suggest new adaptive learning methods where loss functions could be continuously adapted to output error distributions during learning.


Introduction
Special functions like the beta, gamma or hypergeometric functions have many applications in various domains including probability theory, computational chemistry and statistical physics [2,8,39]. They are often used to express antiderivatives that are otherwise difficult to compute. Here I focus on the antiderivative being a limit case of the incomplete beta function B(y; a, b) := y 0 t a−1 (1−t) b−1 dt defined for Re(a), Re(b) > 0 and y ∈ (0; 1). It is easy to see that F (y) can also be written in terms of the hypergeometric function (see Proposition 1), but for many applications it is desirable to have expressions in closed form using only common functions that can be automatically derived and efficiently computed around the poles (here y → 1). One such application is machine learning where the antiderivative F (y) relates to loss functions like cross entropy (for q = 1) that are minimized to solve classification and related AI tasks [4]. This needs typically large amounts of annotated training data D := {(x n , t n )|n = 1, ..., N} for supervised learning of a prediction model with y n := y(x n ; W) for inputs x n ∈ R D and targets or labels t n . Specifically, the learning task is to find "good" parameters W such that the model function y(x; W) applied to the inputs x n reproduces the annotation labels t n as closely as possible. The model performance can be quantified by a loss function L({(y n , t n )|n = 1, . . . , N}) evaluating the differences between model outputs y n and targets t n . For example, for binary or multi-label classification tasks with binary labels t nk ∈ {0, 1} a good choice is binary cross entropy (BCE) whereas for multi-class classification or regression problems we may use categorical cross entropy L CCE := − N n=1 K k=1 t nk log(y nk ) with one-hot coding (t nk ∈ {0, 1}, k t nk = 1) or sum-of-squared-error L SSE := 1 2 N n=1 ||y n − t n || 2 with t nk ∈ R [4,7,11,17,22,27,35].
In the last decade, deep neural network models have become dominant for applications related to classification including object recognition and detection, image segmentation, speech understanding, autonomous driving, or robot control [12,14,21,30,34,38]. This success can be attributed to an improved understanding of large-scale deep neural architectures and solving earlier problems like vanishing gradients blocking learning progress [3]. For example, to overcome such problems, improved activation function, weight initialization, regularization, and optimization methods have been developed [10,13,16,18,31,37]. The current work complements these efforts by proposing a new family of improved loss functions based on the antiderivative (1) that enables continuous adaptation to training data or learning progress as explained in the following.
In deep neural networks, loss functions are typically minimized by gradient descent using the error backpropagation algorithm [4,5,28,32,33,40]: After forward-propagating an input vector x n through the network, the backpropagation algorithm initializes so-called error signals δ nk := ∂L n ∂a nk (4) for each output unit with firing rate y nk := σ k (a nk ) and dendritic activation potential a nk := j W kj z nj computed from a typically sigmoidal activation function σ k . Similarly, for each hidden neuron j , the firing rate z nj := σ j (a j ) with a j := j W kj z ni is computed recursively in the forward pass, where z ni := a ni := x ni for input units. After the initialization (4), error signals are backpropagated through the transposed synaptic weights W T towards the input layer using the recursion δ nj = h (a nj ) k W kj δ k . After this backward pass, each neuron j knows both its firing rate z nj and error signal δ nj . With this each synapse W ji can compute its partial derivative ∂L n ∂W ji = ∂L n ∂a nj ∂a nj ∂W ji = δ nj z ni as the product of postsynaptic error signal and presynaptic firing rate, and thus the corresponding weight change according to (stochastic) gradient descent is W nj i := −η ∂L n ∂W ji = −ηδ nj z ni (5) where η > 0 is the learning rate. Thus, the initialization (4) determines synaptic learning (5) and should therefore be chosen as simple as possible for the sake of biological plausibility and computational efficiency. Indeed, for the three most commonly employed settings of I) regression with L SSE and linear outputs y nk = a nk ∈ R, II) binary classification with L BCE and sigmoidal outputs y nk = σ (a nk ) ∈ (0; 1) with the logistic sigmoid σ (a) := 1/(1 + e −a ), and III) categorical (multi-class) classification with L CCE and softmax outputs y nk = S k (a n ) := exp(−a nk ) j exp(−a nj ) ∈ (0; 1), the initialization (4) becomes simply the difference δ nk = y nk − t nk = − nk for nk := t nk − y nk (6) between model outputs and targets, that is (up to the sign), the output error nk . However, as I have argued in previous work [19], such settings do not always maximize likelihood or other relevant performance measures like accuracy, as they rely on unrealistic prior assumptions like a triangular distribution of output errors, which is often not fulfilled. Therefore a novel more general initialization of error signals in the output layer has been suggested (see eq. 3.1 in [19]) using powers of the output errors with exponent q > 0. Interestingly, this new initialization method can significantly speed up learning and improve convergence of the backpropagation algorithm by adapting the exponent q to the true distribution of output errors [19]. However, many software platforms for machine learning like Keras, Tensorflow, and PyTorch [1,6,29]) do not directly initialize error signals like in (6), (7), but instead compute gradients via automatic differentiation [23] of the loss function. For this we require an explicit symbolic representation of the corresponding loss functions which, as we will see, involves integrating (4) for (7) or determining appropriate expressions for the antiderivative (1). While for the special case q ∈ N this problem is easy to solve, and corresponding loss functions have already been determined previously [19], it is more demanding to integrate (7) for general q ∈ R + , and the corresponding loss functions have been unknown so far. However, continually adapting q to training data and learning progress with arbitrary distributions of output errors requires also q ∈ N, including the case 0 < q < 1.
In this paper I compute the loss functions that correspond to the power function initialization of error signals (7) for rational exponents of the form q = n/2 m , where the numerator is a positive integer and the denominator is a power of 2 (n ∈ N, m ∈ N 0 ). With this it becomes possible to approximate the loss functions corresponding to (7) for any q ∈ R + with arbitrary precision. To this end the paper is organized as follows: Section 2 shows that determining the loss functions corresponding to (7) for binary classification problems involves the antiderivative (1) and briefly recapitulates the solution for q ∈ N. Section 3 determines (1) for the more general case of positive rational exponents q = n/2 m , where the most convenient final form is given by (30) in Theorem 3. Section 4 shows results from numerical learning experiments verifying correctness and demonstrating the benefits of the new loss functions. Finally, Section 5 gives a summary and discussion of the results.

Generalized loss functions for exponents q ∈ N
In order to determine the generalized loss function that is minimized by the backpropagation algorithm, we have to integrate (4) using the generalized error signal initialization (7). Here it is sufficient to consider the loss contribution L nk of one output unit y nk = σ k (a nk ) after presenting the training vector tuple (x n , t n ). Thus, up to an additive constant, the loss function corresponding to the generalized initialization (7) with exponent q > 0 is where the last equation follows from the substitution y nk = σ k (a nk ) with the derivative dy nk da nk = σ k (a nk ). For the most common case of logistic sigmoids in the output layer, σ k (a nk ) = σ (a nk ) ∈ (0, 1) for σ (a) := 1 1+e −a with σ (a) = σ (a)(1 − σ (a)), we have σ k (a nk ) = σ (a nk )(1 − σ (a nk )) = y nk (1 − y nk ) and thus, skipping variable indices y := y nk and t := t nk for brevity, where the last equation (for t = 1) follows with the substitution y → 1 − y and corresponds to an improper integral with diverging F (1) → ∞. This shows that in order to determine the loss function for generalized error signal initialization, we have to find the antiderivative (1), where we choose the additive constants in each case such that the resulting loss is zero for correct predictions: Theorem 1 (Loss for power function error initialization) For feed-forward neural networks using the logistic sigmoid function y = σ (a) ∈ (0; 1) in the output layer, the loss function corresponding to the power function initialization (7) of the error signals for backpropagation with exponent q > 0 is where F (y) = y 0 y q−1 1−y dy is the antiderivative (1) being a limit case of the incomplete beta function. In particular, the resulting loss function has zero baseline and is symmetric, Proof : The theorem follows immediately from (9) by merging the two cases t = 0 and t = 1, and noting that F (0) = 0, such that we just have to skip the offset F (1) for t = 1 in order to get zero loss in case the neural network makes a correct prediction y = t ∈ {0, 1}.
Therefore, the remainder of the paper deals mainly with determining closed form expressions for the antiderivative F (y). This is particularly easy for natural exponents q ∈ N [19]: Theorem 2 (Power error loss for natural exponent q ∈ N) For q ∈ N the antiderivative (1) and corresponding loss function of Theorem 1 become For the proof of this and the following Theorems and Propositions see Appendix A. It is convenient to rewrite (15) in terms of another set of coefficients b (16) where the coefficients b 0 := . . , q −1 can be precomputed as shown by Table 1 in Appendix A for q = 1, 2, . . . , 12. 3 The antiderivative y q−1 1−y dy for real and rational exponents q > 0 We see from (10) that computing the generalized loss functions L (q) nk (y, t) requires the antiderivative (1) with It is easy to verify that F (y) can be expressed in terms of the hypergeometric function n! with the (x) n := π n−1 i=0 x + i being rising Pochhammer symbols: Proposition 1 (F (y) for real q and outputs |y| < 1) For q ∈ R\{0, −1, −2, · · · } and |y| < 1 we have Like with the incomplete beta function in (1), expressing F (y) in terms of the limit of an infinite sum is not viable as current software libraries employing automatic differentiation (like Tensorflow or PyTorch) cannot efficiently handle such expressions. For example, using (18) to approximate F (y) → ∞ for y → 1 would need to sum a very large number of terms (as each term except the first one is < 1). Instead, we have to find finite expressions for F (y) in terms of common functions that have simple derivatives. With computer algebra systems (CAS) like Mathematica or Maxima [24,26] it is possible to further explore (17). Trying some particular values shows that for rational exponents q = n N with n, N ∈ N appropriate antiderivatives exist in closed form. However, for larger values of the denominator N the results of the CAS are inconvenient and involve complicated sums over complex roots. Still, the results are relatively simple if N = 2 m is a power of 2. Therefore we focus on the case q = n 2 m ∈ Q + for n ∈ N, m ∈ N 0 , which is sufficient to approximate F (y) for any q ∈ R + 0 with arbitrary precision. We start with the case 0 < q < 1 and 0 ≤ y < 1 and then generalize to y ≥ 0 and q > 0. Trying the CAS for some special cases q = n/2 m ∈ (0; 1) with n, m ∈ N and 0 < n < 2 m =: N gives the correct hypothesis for the antiderivativẽ F (y): where j is the imaginary unit and the sum is over the N := 2 m th complex unit roots Z = e j 2π k N for k = 0, 1, . . . , N − 1, and thusF (y) corresponds to the N − nth value of the Discrete Fourier Transform (DFT) of the discrete N -periodic signal s[k] := log y 1/N − e j 2π k N .
Via (17), Proposition 2 provides closed-form expressions for F (y) for 0 < q, y < 1 that involve complex numbers. However, by definition (1), there should exist an equivalent realvalued representation for F (y) ∈ R. In the following we simplify computation of F (y) by reducing the number of terms and eliminating complex-valued expressions: Proposition 3 (Real-valued representation of F (y) for 0 < q < 1 and 0 ≤ y < 1) Let q = n/2 m ∈ (0; 1) with n, m ∈ N and 0 ≤ y < 1. Then (17) with (20) and (21) where r k and ϕ k must be computed from (37) or (40). For the remaining case N = 2 corresponding to q = 1/2 we have If q = n/N = n/2 m ∈ (0; 1) is in reduced form, we gain a considerable simplification because then n is odd for any m ≥ 1, we have (−1) n = −1, and the case disctinctions involved in computing r k and ϕ k get aligned: 2 sin(2π kn N ) · arctan 2 sin 2π k N y 1/N 1 − y 2/N So far we have determined the antiderivative F (y) for 0 ≤ y < 1, which is sufficient for computing loss functions for binary classification, where y corresponds to a class probability. For other applications, it may be useful to include also the case y > 1: whereas for the remaining case N = 2 corresponding to q = 1/2 we have Let us now re-address the antiderivative y q−1 1−y dy for q = n/N > 1 and y ≥ 0: (28) whereas for N = 2 we have To account for the different ranges of the variable y and the exponent q, we can finally merge the results of Theorem 2 and Propositions 5,6 to a unifying theorem using the Heaviside function H (y) and the discrete Dirac 1−y is given by where the last two sums are relevant only for N ≥ 2 3 = 8. F (y) is strictly increasing for 0 ≤ y < 1 and strictly decreasing for y > 1.
Note that, by adding a constant C := −(−1) (ñ−1)/2 · π = (−1) (ñ+1)/2 · π for N ≥ 4 and y > 1 we obtain an equivalent antiderivativê for f (y) with the same properties as F (y) in Theorem 3, except thatF (y) has a unique limit lim y→∞F (y) = 0 for 0 < q < 1. Note also that C corresponds to the constant (50) that we have skipped previously to get F (0) = 0 for all q > 0. Figure 1 illustrates f (y) and F (y) from (30) for different values of the exponent q. Note that the dependency of F (y) on q is monotonic for 0 ≤ y < 1, but non-monotonic and even discontinuous for finite 1 < y ∞, where monotonicity and continuity are restored in the limit of very large y → ∞ due to (31).
I have also verified F (y) from (30) by numerical differentiation using Matlab [25] with variable precision arithmetics (function vpa with a precision of 500 decimal digits) to compute relative errors between the numerical derivative F num (y) := F (y+ y)−F (y) y and f (y) := y q−1 1−y sampling from y ∈ [y 0 ; 1 − y 0 ] ∪ [1 + y 0 ; 1000]. The relative errors were largest around the poles at y ≈ 0 and y ≈ 1, whereas apart from the poles, they generally decreased for larger y and increased for larger q. For minimal pole distance y 0 = 10 −6 and difference y = 10 −20 , the relative error for q ≤ 50 was always below 10 −12 , thus confirming Theorem 3.  (30)

Applying the generalized loss function to neural networks
The purpose of the following experiments is, first, to verify Theorem 1 with (30) as used for implementing L (q) nk in Appendix B, and, second, to demonstrate the usefulness of the loss function L (q) nk for general q > 0. For the verification, I have implemented a simple recurrent neural network (Simple RNN; see Fig. 2A) for sequence classification and trained it on the Embedded Reber Grammar data set (see Appendix B for details; cf. [19]). Figure 2 shows results from single trial learning experiments using fixed standard parameters (initial learning rate η 0 = 0.001 and minibatch size MBS=4) without any further hyperparameter optimization: Fig. 2B shows test error as function of learning epoch for different error exponents q ∈ { 1 4 , 1 2 , 1, 711 256 ≈ 2.777, 3157 512 ≈ 6.166}. The exponents q have been chosen to test all (conditional) terms of (30) in sufficient detail. It can be seen that the two implementations (Keras vs. custom) yield very similar, but not identical results. At least in the initial phase of learning, test errors are virtually identical for both implementations, suggesting the correctness of (10), (30), and the implementation in Appendix B. To further confirm correctness, Fig. 2C shows for the custom implementation the maximum relative error of gradients estimated from backpropagation compared to computing gradients from the partial derivatives of the loss function with respect to all synaptic weights (where maximum is over all partial gradient vectors for synaptic connections A, B, U , and the two bias vectors of u and y). It can be seen that, at least initially, all relative gradient errors are below 10 −6 , which finally confirms the correctness of (10), (30), and the implementation in Appendix B. During learning, relative errors typically increase, but are always below 10 −4 . The increase is most pronounced for large q 1 or large denominators of q. This increase U Fig. 2 Single trial learning experiments to verify the formulas for the generalized power error loss functions (10) and (30). A: Architecture of the Simple Recurrent Neural Network model. B: Test errors for the Embedded Reber Grammar data set as obtained from neural network implementations using either Keras (solid; automatic differentiation of (10) with (30)) or a custom neural network library (dotted; backpropagation algorithm using (5), (7)). C: Maximal relative error between gradients computed with backpropagation (as in (B)) and a naive estimation of partial derivatives from differential quotients (adding δ = 10 −8 to each synaptic weight). D: Estimated power error exponentq obtained from the (absolute) output error distributions of the experiments in (B) represented as histograms with 10 equally spaced bins. For each histogram, q is estimated by selecting the best fitting theoretical histogram (minimal Euclidean distance) obtained from (33) for q ∈ { 1 8 , . . . , 7 8 , 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 4, . . . , 20} (cf., Fig. 4). All experiments employ identical non-optimized standard hyperparameters (ADAM optimizer, η 0 = 0.001, minibatch size 8), identical initial synaptic weights (Glorot/Xavier uniform), and identical presentation order of training data may be explained by steeper loss surfaces for q 1 and increasing numerical errors due to increasing numbers of mutually canceling terms in (30).
Although hyperparameters have not yet been optimized, Fig. 2A shows the existence of an optimal error exponent somewhere between q = 1 and q = 7. In particular, learning for q = 711/256 ≈ 2.777 and q = 3157/512 ≈ 6.166 reaches an error count < 300 by factor 1.5-2 faster than for binary cross entropy loss (q = 1). This is consistent with previous results evaluating more complex network models involving LSTM layers and integer q ∈ N (see Fig.8, Fig. 9A in [19]). Figure 3 shows corresponding results after optimizing the hyperparameters initial learning rate (η 0 ) and minibatch size (MBS), and averaging over 16 learning trials (by taking medians, similar as in previous works [19]). For all tested q ∈ {0.25, 0.5, . . . , 2.75, 3, 3.5, . . . , 9, 12, 15} it was possible to reach zero average test error (see also remarks in Appendix B). Therefore the first epoch number reaching zero average errors was used as a criterion for optimizing hyperparameters. Best q = 1.75 reached zero errors after 4.12 learning epochs, whereas q = 1 (BCE) required 10.63 epochs. Thus, optimizing the exponent q of the generalized power error loss function (10) yields factor > 2.5 improvement in learning time. This demonstrates that optimal exponents q may in general be non-integer. Still, there is a broad range of q between 1.25 and 9 where learning performance improves significantly compared to classical BCE. Note also that optimal hyperparameters are quite independent of q, mostly being η 0 = 0.005 and MBS= 1. This suggests that optimizing q may cause only little additional costs during hyperparameter optimization.
To understand the potential usefulness of the general case q > 0, let us reconsider a relationship found in [19] between the loss functions L (q) nk (y, t) from (10) and the corresponding distributions r( ) of output errors := t − y ∈ (−1; 1) defined by (6): Specifically, L (q) nk (y, t) turns out to be optimal in maximizing the likelihood of the classification model if output errors are distributed with density function (see [19], eq. 5.7) where p t are the prior class probabilities that an input belongs to class t, r t ( ) are the conditional output error densities given t, and C t := p t / 1 0 e −L (q) nk (y,0) dy are corresponding normalization constants. While [19] has computed r( ) only for q ∈ N, we can now use (30) to approximate r( ) for any q ∈ R + with arbitrary precision. Figure 4 shows the output error distributions r( ) for some values of q. It can be seen that q 1 corresponds to a uniform (rectangular) distribution, q = 1 to a linear (triangular) distribution, and q < 1 to distributions where most output errors are close to zero. This suggests two hypotheses about the relation between learning progress, error distributions, and an optimal choice for the exponent q of L (q) nk . First, for any reasonable loss function, the exponent parameterq best fitting the current error distribution should decrease with learning progress from large valuesq > 1 towards small valuesq < 1. This is confirmed by Fig. 2D: For all investigated loss functions, the best fitq decreases with training epochs. Whileq = 4 for initial synaptic weights, most error distributions haveq < 1 after 5 learning epochs (see Appendix B for further details). Second, adapting the error exponent q of the loss function L (q) nk during learning to the distribution of output errors should improve learning performance.
Although a thorough investigation of the latter hypothesis is out of the scope of the current work, Fig. 5 shows results for a simplified setting employing a Convolutional Neural Network of moderate depth classifying the CIFAR-10 dataset after 15 training epochs (see Appendix B for details). In previous works, employing the power error loss function with fixed q in similar networks improved learning only marginally [19]. In the current experiments, the exponent parameter q of L (q) nk can be adapted once after 5 training epochs. For the control experiments with fixed q the results are in line with the previous findings: The case q > 1 improves learning to some degree, whereas q < 1 typically impairs learning performance. However, loss functions with adaptation, employing q > 1 in the early learning phase (epoch 1-5) and q < 1 in a later phase (epoch 6-15), can significantly improve accuracy (e.g., from 0.855 for fixed BCE or q = 1 to 0.863 for early q = 60/8 = 7.5 and late q = 0.5). By contrast, employing the reverse order (early q < 1 and late q > 1) impairs learning. This confirms the second hypothesis and shows that the case q < 1 can be useful if employed in a later training phase.

Summary and discussion
Motivated from classification applications with neural networks, this work gives closedform expressions for the antiderivative F (y) of the function f (y) = y q−1 1−y defined in (1), where the exponent q = n/N should be rational with n ∈ N and N a power of 2. The most general and convenient form for F (y) is given by (30) in Theorem 3. The special case for q ∈ N simplifies to (12) or (13) in Theorem 2, and has already been discussed in prior work [19]. Other intermediate representations involving complex roots and further special cases are given by Propositions 2-6. In principle, it would be possible to extend the range of exponents q = n/N to more general forms with N ∈ N being an arbitrary integer, but Results for BCE and CCE are given for reference. B: Test accuracy after 15 training epochs as function of q for the cases fixed q (qx15), changing q at epoch 5 for the remaining epochs to 4/8 (qx5+4/8x10), and changing q in the reverse order (4/8x5+qx10). Note that starting with large q > 1 and then changing to small q < 1 (but not the reverse order) can significantly improve performance this seems to lead to much more inconvenient formulas. As q = n/N for N = 2 m can approximate any rational or real-valued exponent with arbitrary precision, the current results seem sufficient for most applications.
Here I have considered a neural network application involving binary classification with logistic sigmoidal output units. For this network type, maximum-likelihood optimization is equivalent to minimizing the power error loss function (10) of Theorem 1 with the antiderivative F (y) from Theorem 3. For that the exponent q can be related to the distribution of output errors (Fig. 4) and the initialization of error signals (7) for backpropagation learning [19]. Although knowing the correct loss function is actually not necessary for a custom gradient descent implementation based on error backpropagation with the power error initialization (7), modern neural network libraries like Keras, Tensorflow, and PyTorch [1, 6, 29] employ automatic differentiation [23] of the loss function to determine gradients for learning synaptic weights. Therefore Theorems 1 and 3 with the Python-based implementation of the power error loss function in Appendix B enable using such libraries for neural network learning with power error initialization.
For the special case of natural exponents q ∈ N, this power error loss function has been derived and evaluated already in previous work [19]. There it has also been shown that optimizing q can significantly improve learning performance and convergence over various classical loss functions (like BCE, CCE, SSE), in particular for binary classification tasks in deep or recurrent networks. The current work extends these previous results for rational error exponents q = n/2 m > 0. Numerical and learning experiments have verified the correctness of Theorems 1 and 3 and the implementation of (10) and (30) in Python for Keras given in Appendix B. The experiments confirm that the usual outcome of optimizing error exponents is at least a moderate improvement of learning performance and convergence compared to cross entropy (q = 1), where optimal q is typically larger than one and not integer. Moreover, they show that adaptive loss functions decreasing q to values below 1 during learning may provide significant further improvements. A more thorough investigation of a continuous adaptation of q to the current distribution of output errors should be done in future work.

Appendix A: Proofs and supplements of Sections 2 and 3
Proof of Theorem 2: By iterated polynomial division it is easy to verify that for n ∈ N y n y − 1 = y n−1 + y n−2 + . . . + y + 1 + 1 y − 1 (34) and for q ∈ N and y ∈ (0; 1) therefore F (y) := y 0 y) showing (12). The second form (13) has been used previously [19] and is given here for completeness. The two forms are equivalent as (13) satisfies F (0) = 0 and, with the binomial sum, has the correct derivative F (y) Then (14) follows from inserting (12) into (10), i j (−y) j , the sum in (14) writes as the polynomial i y i from which we can read the polynomial coefficients a (q) Table 1 gives examples for the alternative coefficients b (q) i of (16). 1 (a,b;a+1,z

Proof of Proposition 2: We have to show thatF (y) = f (y). In fact, it is
proving (19), (20). We still have to prove (21) because determining the generalized loss function (10) with (17) involves subtractingF (0): With the geometric-type sum (which can easily be proved by induction), the complex logarithm log(j r) ∈ (−π; π] in the primary sheet
Merging this with Proposition 4 gives almost immediately Proposition 5: Skipping the constant C in (49) and then comparing to (25) reveals that, after taking absolute values |1 − y 1/N |, both (25) for 0 ≤ y < 1 and (49) for y > 1 represent the same function that is unified by (26). The case N = 2 corresponding to (27) can be shown as in (23).
Finally, the monotonicity of F (y) follows from f (y) > 0 for 0 < y < 1 and f (y) < 0 for y > 1, and the limits and asymptotic expressions (31) are easily verified by inspecting each case. In particular, F (0) = 0 follows from 0 i = 0, log(1) = 0, and arctan(0) = 0.  Fig. 2A with D inputs, M = 10 hidden units, and K outputs. The layers are linked by dense connections A, U , B and include also bias weights for layers u, y. Activation functions are tanh for u and the logistic sigmoid σ for y. Synaptic connections are initialized by uniform Xavier [10]. Training used ADAM optimizer [18,31] with standard parameters β 1 = 0.9, β 2 = 0.999. Experiments used either Keras 2.2.5 with a Tensorflow 1.14.0 backend [1,6] or a custom neural network library for backpropagation. While Keras computes gradients based on automatic differentiation [23] of the loss function (see Python code below for the power error loss (10) with (30)), the custom implementation uses (5) with the power error initialization (7). The Embedded Reber Grammar Problem is to predict the next output symbol of a finite automaton with non-deterministic state transitions [15]. Inputs are symbol sequences x n (1), x n (2), . . . , x T generated by the automaton, representing each of the D = 7 symbols by a one-hot input vector x n (τ ). The output to be predicted at time τ is the next symbol x n (τ + 1) generated by the automaton in the next time step (K = 7). Due to the non-determinism, target vectors t n (τ ) can have multiple one-entries, one for each possible output symbol. Network decisionŷ n (τ ) is evaluated as correct ifŷ n (τ ) = t n (τ ) at decision threshold 0.5. Learning used N = 2048 sequences (90% training, 10% validating/testing). Average sequence length is T = 12 (maximum 40).

Appendix B: Implementation details
Remarks to Fig. 3: Reaching zero median errors suggests that all 16 learning trials of an experiment converged to zero error, while previous works reported problems of simple RNN solving this data set [9,15]. However, as zero median means that at least half of trials reached zero errors, I have analyzed also mean test errors (instead of medians), which also reached zero within the total learning duration of 65 epochs for all exponents 0.5 ≤ q ≤ 15 (data not shown). The profile of minimal epoch numbers until zero test errors was similar to Fig. 3B, although absolute values were about factor 2-2.5 larger (best was 9 epochs for q = 1.5 vs. 26 epochs for q = 1). The discrepancy may be that earlier works used a suboptimal initialization of synaptic weights, causing either vanishing or exploding gradients [3,15]. Additional experiments (data not shown) revealed that at least q > 1 can still reach zero error, even if initial weights deviate substantially (by factors between 0.125 and 3) from Xavier initialization. This is consistent with the idea that, for q > 1, the power error loss provides a better gradient-to-loss ratio and, thereby, avoids flat loss landscapes and vanishing gradients [19].

Conflict of Interests
The author declares that he has no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.