On the antiderivatives of xp/(1 − x) with an application to optimize loss functions for classification with neural networks

Knoblauch, Andreas

doi:10.1007/s10472-022-09786-2

On the antiderivatives of x^p/(1 − x) with an application to optimize loss functions for classification with neural networks

Regular Submission
Open access
Published: 18 March 2022

Volume 90, pages 425–452, (2022)
Cite this article

Download PDF

You have full access to this open access article

Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

On the antiderivatives of x^p/(1 − x) with an application to optimize loss functions for classification with neural networks

Download PDF

Andreas Knoblauch ORCID: orcid.org/0000-0002-2534-0250¹

312 Accesses
3 Citations
Explore all metrics

Abstract

Supervised learning in neural nets means optimizing synaptic weights W such that outputs y(x;W) for inputs x match as closely as possible the corresponding targets t from the training data set. This optimization means minimizing a loss function ${\mathscr{L}}(\mathbf {W})$ that usually motivates from maximum-likelihood principles, silently making some prior assumptions on the distribution of output errors y −t. While classical crossentropy loss assumes triangular error distributions, it has recently been shown that generalized power error loss functions can be adapted to more realistic error distributions by fitting the exponent q of a power function used for initializing the backpropagation learning algorithm. This approach can significantly improve performance, but computing the loss function requires the antiderivative of the function f(y) := y^q− 1/(1 − y) that has previously been determined only for natural $q\in \mathbb {N}$. In this work I extend this approach for rational q = n/2^m where the denominator is a power of 2. I give closed-form expressions for the antiderivative ${\int \limits } f(y) dy$ and the corresponding loss function. The benefits of such an approach are demonstrated by experiments showing that optimal exponents q are often non-natural, and that error exponents q best fitting output error distributions vary continuously during learning, typically decreasing from large q > 1 to small q < 1 during convergence of learning. These results suggest new adaptive learning methods where loss functions could be continuously adapted to output error distributions during learning.

Article PDF

Pushing Stochastic Gradient towards Second-Order Methods – Backpropagation Learning with Transformations in Nonlinearities

Adapting Loss Functions to Learning Progress Improves Accuracy of Classification in Neural Networks

Fundamentals of Artificial Neural Networks and Deep Learning

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al: Tensorflow: A system for large-scale machine learning. In: 12Th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
AlAhmad, R., Almefleh, H.: Antiderivatives and integrals involving incomplete beta functions with applications. Aust. J. Math. Anal. Appl. 17(2), 11 (2020)
MathSciNet MATH Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Bryson, A.E., Ho, Y.C.: Applied Optimal Control: Optimization, Estimation, and Control. Blaisdell, New York (1969)
Google Scholar
Chollet, F.: Keras. https://github.com/fchollet/keras (2015)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
Book Google Scholar
Ferreira, C., Lopez, J., Perez Sinusia, E.: Uniform representations of the incomplete beta function in terms of elementary functions. Electron. Trans. Numer. Anal. 48, 450–461 (2018)
Article MathSciNet Google Scholar
Gers, F., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Article Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y., Titterington, M. (eds.) Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp 249–256. JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Sardinia, Italy (2010)
Good, I.: Some terminology and notation in information theory. Proceedings of the IEE - Part C: Monographs 103(3), 200–204 (1956)
MathSciNet Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http://www.deeplearningbook.org (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.90 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp 448–456. PMLR, Lille, France (2015)
Janocha, K., Czarnecki, W.: On loss functions for deep neural networks in classification. arXiv:1702.05659 (2017)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd Proceedings of the International Conference on Learning Representations (ICLR), arXiv:1412.6980v9 (2015)
Knoblauch, A.: Power function error initialization can improve convergence of backpropagation learning in neural networks for classification. Neural Comput. 33(8), 2193–2225 (2021)
Article MathSciNet Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., Department of Computer Science University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf, vol. 25. Curran Associates, Inc. (2012)
Krotov, D., Hopfield, J.: Dense associative memory for pattern recognition. arXiv:1606.01164 (2016)
Linnainmaa, S.: Taylor expansion of the accumulated rounding error. BIT Numer. Math. 16(2), 146–160 (1976)
Article MathSciNet Google Scholar
Mathematika: Version 12.3.1. Wolfram Research, Inc., Champaign, IL (2021). https://www.wolfram.com/mathematica
MATLAB: version 9.7.0.1247435 (R2019b). The MathWorks Inc., Natick, Massachusetts (2019)
Maxima: Maxima, a computer algebra system. version 5.43.2 (2020). http://maxima.sourceforge.net/. See also www.integral-calculator.com
Palm, G.: Novelty, Information and Surprise. Springer, Berlin (2012)
Book Google Scholar
Parker, D.: Learning-logic: casting the cortex of the human brain in silicon. Tech. Rep. Tr-47, Center for Computational Research in Economics and Management Science. MIT Cambridge MA (1985)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf, pp 8024–8035. Curran Associates, Inc. (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, vol. 9351, pp 234–241. Springer, Berlin (2015)
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016)
Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Article Google Scholar
Rumelhart, D., McClelland, J., Group, P.R. (eds.): Parallel Distributed Processing, Explorations in the Microstructure of Cognition, vol. 1. Foundations. MIT Press, Cambridge (1986)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Article Google Scholar
Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana/Chicago (1949)
Smith, S.: The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, San Diego (1997)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp 6105–6114. PMLR (2019)
Tenne, N.: Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function. Math. Comput. 29(132), 1109–1114 (1975)
Article MathSciNet Google Scholar
Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard University (1974)

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

KEIM Institute, Albstadt-Sigmaringen-University, Jakobstrasse 6, 72458, Albstadt, Germany
Andreas Knoblauch

Authors

Andreas Knoblauch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Knoblauch.

Ethics declarations

Conflict of Interests

The author declares that he has no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Deutsches Bundesministerium für Verkehr und digitale Infrastruktur (Modernitätsfonds/mFUND) via project AI4Infra (FKZ 19F2112C), and by Ministerium für Wirtschaft, Arbeit und Wohnungsbau Baden-Württemberg (KI-Innovationswettbewerb BW für Verbundforschungsprojekte) via project KI4Audio (FKZ 36-3400.7/98). The author as well acknowledges support by the state of Baden-Württemberg through bwHPC.

Appendices

Appendix A: Proofs and supplements of Sections 2 and 3

Proof of Theorem 2:

By iterated polynomial division it is easy to verify that for $n\in \mathbb {N}$

$$ \begin{array}{@{}rcl@{}} \frac{y^{n}}{y-1} = y^{n-1}+y^{n-2}+\ldots+y+1+\frac{1}{y-1} \end{array} $$

(34)

and for $q\in \mathbb {N}$ and y ∈ (0; 1) therefore $F(y):={{\int \limits }_{0}^{y}} \frac {y^{q-1}}{1-y}dy=-{{\int \limits }_{0}^{y}}{\sum }_{i=0}^{q-2}y^{i}+\frac {1}{1-y}dy=-{\sum }_{i=0}^{q-2}\frac {y^{i+1}}{i+1}-\log (1-y)$ showing (12). The second form (13) has been used previously [19] and is given here for completeness. The two forms are equivalent as (13) satisfies F(0) = 0 and, with the binomial sum, has the correct derivative $F^{\prime }(y)=\frac {1}{1-y}+{\sum }_{r=0}^{q-2}{q-1\choose r}(-1)^{q-r}(-(1-y)^{q-2-r})=\frac {1}{1-y}-\frac {(-1)^{q}}{1-y}({\sum }_{r=0}^{q-1}{q-1\choose r}(-1)^{r}(1-y)^{q-1-r}-(-1)^{q-1})=\frac {1-(-1)^{q}((-y)^{q-1}+(-1)^{q})}{1-y}$$=\frac {y^{q-1}}{1-y}$. Then (14) follows from inserting (12) into (10),

$$ \begin{array}{@{}rcl@{}} \mathcal{L}^{(q)}_{nk}(y,t)&=&(1-t)(-\log(1-y)-\sum\limits_{i=1}^{q-1}\frac{y^{i}}{i})+t(-\log(y)-\sum\limits_{i=1}^{q-1}\frac{(1-y)^{i}}{i})\\ &=&-t\log(y)-(1-t)\log(1-y)-\sum\limits_{i=1}^{q-1}\frac{(1-t)y^{i}+t(1-y)^{i}}{i}\ . \end{array} $$

With the binomial sum $(1-y)^{i}={\sum }_{j=0}^{i}{i\choose j}(-y)^{j}$, the sum in (14) writes as the polynomial

$$ \begin{array}{@{}rcl@{}} a^{(q)}(y)&=&\sum\limits_{i=1}^{q-1}\frac{(1-t)y^{i}+t\sum\limits_{j=0}^{i}{i\choose j}(-y)^{j}}{i} =\sum\limits_{i=1}^{q-1}\frac{(1-t)y^{i}+t}{i}+\sum\limits_{i=1}^{q-1}\sum\limits_{j=1}^{i}\frac{t{i\choose j}(-y)^{j}}{i} \\ &=&\sum\limits_{i=1}^{q-1}\frac{t}{i}+\sum\limits_{i=1}^{q-1}\frac{1-t}{i}y^{i}+\sum\limits_{j=1}^{q-1}\sum\limits_{i=j}^{q-1}\frac{t}{i}{i\choose j}(-y)^{j} \stackrel{!}{=} \sum\limits_{i}a^{(q)}_{i}y^{i} \end{array} $$

from which we can read the polynomial coefficients $a^{(q)}_{0}={\sum }_{i=1}^{q-1}\frac {t}{i}$ and

$$ \begin{array}{@{}rcl@{}} a^{(q)}_{i}=\frac{1-t}{i}+(-1)^{i}\sum\limits_{j=i}^{q-1}\frac{t{j\choose i}}{j} =\frac{1-t}{i}+(-1)^{i}\sum\limits_{j=i}^{q-1}\frac{t{j-1\choose i-1}}{i} =\frac{\left( (-1)^{i}\sum\limits_{j=i}^{q-1}{j-1\choose i-1}-1\right)t+1}{i} \end{array} $$

for i = 1,…,q − 1. Table 1 gives examples for the alternative coefficients $b_{i}^{(q)}$ of (16). □

Table 1 Coefficients $b^{(q)}_{i}$ to compute generalized loss functions ${\mathscr{L}}^{(q)}_{nk}(y,t)$ for $q\in \mathbb {N}$ from (16)

Full size table

Proof of Proposition 1:

It is well known that $\frac {d}{dz}_{2}F_{1}(a,b;a+1,z)=a\cdot \frac {(1-z)^{-b}-_{2}F_{1}(a,b;a+1,z)}{z}$. Defining G(y) :=₂F₁(q, 1; q + 1,y) for brevity we get from this $(y^{q}G(y)/q)'=y^{q-1}G(y)+y^{q}(q\frac {(1-y)^{-1}-G(y)}{y})/q=y^{q-1}/(1-y)$. □

Proof of Proposition 2:

We have to show that $\tilde {F}^{\prime }(y)=f(y)$. In fact, it is

$$ \begin{array}{@{}rcl@{}} \tilde{F}^{\prime}(y) &=& -\sum\limits_{Z:Z^{N}=1}Z^{n}\frac{d}{dy}\log(y^{1/N}-Z) \\ &= &-\sum\limits_{Z:Z^{N}=1}Z^{n}\frac{\frac{1}{N}y^{1/N-1}}{y^{1/N}-Z} = \frac{-1}{Ny^{(N-1)/N}}\sum\limits_{Z:Z^{N}=1}\frac{1}{y^{1/N}-Z}\cdot Z^{n} \\ &=& \frac{-1}{Ny^{(N-1)/N}}\!\sum\limits_{k=0}^{N-1}\!\frac{1}{y^{1/N}-e^{j2\pi\frac{k}{N}}}\!\cdot\! e^{j2\pi\frac{kn}{N}} = \frac{-1}{Ny^{(N-1)/N}}\text{DFT}\{u[k]\}[N - n] \end{array} $$

(35)

where the last equation involves the DFT of the discrete N-periodic signal

$$ \begin{array}{@{}rcl@{}} u[k] &:=& \frac{1}{y^{1/N}-e^{j2\pi\frac{k}{N}}} = \frac{1}{y^{1/N}}\cdot\frac{1}{1-y^{-1/N}e^{j2\pi\frac{k}{N}}} \\ &=& c\cdot \frac{1-e^{\alpha N}}{1-e^{\alpha}\cdot e^{j2\pi\frac{k}{N}}} \quad\quad\text{with}\quad c:=\frac{1}{y^{1/N}(1-e^{\alpha N})}\quad\text{and}\quad \alpha:=-\log(y)/N\ . \end{array} $$

Using some well known facts of the DFT [36] for signals and

it follows . Thus, inserting $U[N-n]=cNe^{\alpha (N-n)}=cNe^{-(N-n)\log (y)/N}=cNy^{-(N-n)/N}$ in (2) with $e^{\alpha N}=e^{-\log (y)}=1/y$ yields

$$ \begin{array}{@{}rcl@{}} \tilde{F}^{\prime}(y) &=&\frac{-1}{Ny^{(N-1)/N}}U[N-n]=\frac{-cNy^{-(N-n)/N}}{Ny^{(N-1)/N}} = \frac{-y^{-(N-n)/N}}{y^{(N-1)/N}y^{1/N}(1-e^{\alpha N})} \\ &=& \frac{-y^{-(1-n/N)}}{y(1-\frac{1}{y})} = \frac{y^{-(1-q)}}{1-y} = \frac{1}{(1-y)\cdot y^{1-q}}=f(y) \end{array} $$

proving (19), (20). We still have to prove (21) because determining the generalized loss function (10) with (17) involves subtracting $\tilde {F}(0)$: With the geometric-type sum

$$ \sum\limits_{k=0}^{N-1}kz^{k} =\frac{(N-1)z^{N+1}-Nz^{N}+z}{(z-1)^{2}} $$

(which can easily be proved by induction), the complex logarithm $\log (jr) \in (-\pi ;\pi ]$ in the primary sheet

$$ \begin{array}{@{}rcl@{}} \log(-e^{j2\pi\frac{k}{N}}) &=& \log(e^{j\frac{2\pi}{N}k-j\pi}) = \log(e^{j\frac{2\pi}{N}(k-\frac{N}{2})}) = j\frac{2\pi}{N}(k-\frac{N}{2}) + j2\pi K\ \quad\text{for}\ K\in\mathbb{Z} \\ &=&\begin{cases} j\frac{2\pi}{N}(k-\frac{N}{2}) &, k=1,2,\ldots,N-1 \\ j\frac{2\pi}{N}(k-\frac{N}{2}) +j2\pi &, k=0 \end{cases}\ , \end{array} $$

using z := e^j2πn/N and DFT{1}[n] = δ[n] = 0 for n = 1,…,N − 1, where δ[n] is the discrete Dirac impulse (that is, δ[n] = 0 for n = 0, and δ[n] = 0 for n≠ 0), we get from (20)

$$ \begin{array}{@{}rcl@{}} \tilde{F}(0)&=&-\sum\limits_{k=0}^{N-1}\log\left( -e^{j2\pi\frac{k}{N}}\right)\cdot e^{j2\pi\frac{kn}{N}}=-\sum\limits_{k=0}^{N-1}(-j\pi+j2\pi\frac{k}{N})e^{j2\pi\frac{kn}{N}} -j2\pi \\ &=&j\pi\text{DFT}\{1\}[N-n] - j\frac{2\pi}{N}\sum\limits_{k=0}^{N-1}kz^{k}-j2\pi = - j\frac{2\pi}{N}\cdot\frac{(N-1)z^{N+1}-Nz^{N}+z}{(z-1)^{2}}-j2\pi \\ &=&- j\frac{2\pi}{N}\cdot\frac{(N-1)e^{j2\pi n(N+1)/N}-Ne^{j2\pi n}+e^{j2\pi n/N}}{(z-1)^{2}}-j2\pi \\ &=& - j\frac{2\pi}{N}\cdot\frac{(N-1)e^{j2\pi n/N}-N+e^{j2\pi n/N}}{(z-1)^{2}}-j2\pi \\ &=& - j\frac{2\pi}{N}\cdot\frac{Ne^{j2\pi n/N}-N}{(z-1)^{2}}-j2\pi = - j\frac{2\pi}{N}\frac{N(z-1)}{(z-1)^{2}}-j2\pi = \frac{2\pi j}{1-e^{j2\pi n/N}} \\ &=&\frac{2\pi j}{1-\cos(2\pi \frac{n}{N})-j\sin(2\pi \frac{n}{N})}-j2\pi = \frac{2\pi j (1-\cos(2\pi \frac{n}{N})+j\sin(2\pi \frac{n}{N}))}{(1-\cos(2\pi \frac{n}{N}))^{2}+\sin^{2}(2\pi \frac{n}{N})}-j2\pi \\ &=& \frac{2\pi j(1-\cos(2\pi \frac{n}{N}))-2\pi\sin(2\pi \frac{n}{N})}{1-2\cos(2\pi \frac{n}{N})+1}-j2\pi = \frac{-\pi\sin(2\pi \frac{n}{N})}{1-\cos(2\pi \frac{n}{N})}-j\pi \hfill \end{array} $$

□

Proof of Proposition 3:

We can rewrite (20) as

$$ \begin{array}{@{}rcl@{}} \tilde{F}(y) = -\sum\limits_{k=0}^{N-1}\tilde{F}[k] \quad\text{for}\quad \tilde{F}[k]:=\log\left( y^{1/N}-e^{j2\pi\frac{k}{N}}\right)\cdot e^{j2\pi\frac{kn}{N}} \end{array} $$

(36)

where $\tilde {F}[k]$ is a discrete N-periodic signal. In polar form,

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\! \!\!\!\!\!\!\!y^{1/N}\!-e^{j2\pi\frac{k}{N}} = y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)-j\sin\left( 2\pi\frac{k}{N}\right) = r_{k}e^{j\varphi_{k}} \quad\quad\text{with} \\ &&\!\!\!\!\!\!\!\!\!\!\!\!\!r_{k} := \sqrt{\left( y^{1/N}\!-\cos\left( 2\pi\frac{k}{N}\right)\right)^{2}\!+\sin^{2}\left( 2\pi\frac{k}{N}\right)} = \sqrt{y^{2/N}\!-2y^{1/N}\cos\left( 2\pi\frac{k}{N}\right)+1} \\ &&\!\!\!\!\!\!\!\! \!\!\!\!\!\varphi_{k} \!:=\! \begin{cases} -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right) &, \text{if}~ y^{1/N}\ge\cos\left( 2\pi\frac{k}{N}\right) \\ -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right) - \pi\! &, \text{if}~ y^{1/N}\!\!<\!\cos\left( 2\pi\frac{k}{N}\right)~\text{and}~ \sin\left( 2\pi\frac{k}{N}\right)\!\!\ge\! 0 \\ -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right)+\pi &, \text{if}~ y^{1/N}\!<\cos\left( 2\pi\frac{k}{N}\right) ~\text{and}~ \sin\left( 2\pi\frac{k}{N}\right)< 0\!\!\!\!\! \end{cases}, \end{array} $$

(37)

where the last two cases are necessary to distinguish between different sheets of the complex logarithm, and thus

$$ \begin{array}{@{}rcl@{}} \tilde{F}[k]\! &:=&\!\log\left( r_{k}e^{j\varphi_{k}}\right)\cdot e^{j2\pi\frac{kn}{N}} = (\log(r_{k}) + j\varphi_{k})\cdot(\cos(2\pi\frac{kn}{N})+j\sin(2\pi\frac{kn}{N})) \\ &=&\cos(2\pi\frac{kn}{N})\cdot\log(r_{k})-\sin(2\pi\frac{kn}{N})\cdot\varphi_{k} +\! j\left( \cos(2\pi\frac{kn}{N})\cdot\varphi_{k}+\sin(2\pi\frac{kn}{N})\cdot\log(r_{k})\right). \end{array} $$

(38)

As (3) obviously implies conjugate complex symmetry $\tilde {F}[-k]=\tilde {F}[N-k]={\tilde {F}}^{\ast }[k]$, we get for 0 ≤ y < 1

$$ \begin{array}{@{}rcl@{}} &&\tilde{F}(y) = -\sum\limits_{k=0}^{N-1}\tilde{F}[k] = -\tilde{F}[0]-\tilde{F}[N/2]-\sum\limits_{k=1}^{N/2-1}(\tilde{F}[k]+\tilde{F}[-k]) \\ &=& -\log(y^{1/N}-1)-(-1)^{n}\log(y^{1/N}+1)-2\sum\limits_{k=1}^{N/2-1}\text{Re}\{\tilde{F}[k]\} \\ &=& -j\pi -\log(1-y^{1/N})-(-1)^{n}\log(1+y^{1/N}) -2\sum\limits_{k=1}^{N/2-1}\cos(2\pi\frac{kn}{N})\log(r_{k})-\sin(2\pi\frac{kn}{N})\varphi_{k} \\ &=& -j\pi + \log\left( \frac{(1+y^{1/N})^{(-1)^{n+1}}}{1-y^{1/N}}\right) -2\tilde{G}(y) \end{array} $$

(39)

where, in the sum $\tilde {G}(y):={\sum }_{k=1}^{N/2-1}\cos \limits (2\pi \frac {kn}{N})\cdot \log (r_{k})-\sin \limits (2\pi \frac {kn}{N})\cdot \varphi _{k}$, the terms with index $k=1,...,\frac {N}{4}-1$ are similar to those with index $\frac {N}{2}-k$. Specifically,

$$ \begin{array}{@{}rcl@{}} &&\cos(2\pi\frac{(\frac{N}{2}-k)n}{N}) = \cos(-(2\pi \frac{kn}{N}-\pi n)) = (-1)^{n}\cos(2\pi \frac{kn}{N}), \\ &&-\sin(2\pi\frac{(\frac{N}{2}-k)n}{N}) = -\sin(-(2\pi \frac{kn}{N}-\pi n)) = (-1)^{n}\sin(2\pi \frac{kn}{N}), \\ &&\cos(2\pi\frac{(\frac{N}{2}-k)}{N}) = -\cos(2\pi \frac{kn}{N}), \quad\quad \sin(2\pi\frac{(\frac{N}{2}-k)n}{N}) = \sin(2\pi \frac{kn}{N}), \\ &&r_{k} =\sqrt{y^{2/N}-2y^{1/N}\cos\left( 2\pi\frac{k}{N}\right)+1}, \quad\quad r_{N/2-k} =\sqrt{y^{2/N}+2y^{1/N}\cos\left( 2\pi\frac{k}{N}\right)+1} \\ &&\varphi_{k} = \begin{cases} -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right) &, \text{if}~ y^{1/N}\ge\cos\left( 2\pi\frac{k}{N}\right) \\ -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right)-\pi &, \text{otherwise} \end{cases} \\ &&\varphi_{N/2-k} = -\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}+\cos\left( 2\pi\frac{k}{N}\right)}\right) \ , \end{array} $$

(40)

where the last equation for φ_N/2−k follows from y ≥ 0 and $\cos \limits \left (2\pi \frac {N/2-k}{N}\right )<0$ for $k=1,\ldots ,\frac {N}{4}-1$. Thus for N ≥ 4

$$ \begin{array}{@{}rcl@{}} \tilde{G}(y) &=& \cos(\pi\frac{n}{2})\cdot\log(r_{N/4})-\sin(\pi\frac{n}{2})\cdot\varphi_{N/4} \\ && + \sum\limits_{k=1}^{N/4-1}\cos(2\pi\frac{kn}{N})\cdot\log(r_{k}) -\sin(2\pi\frac{kn}{N})\cdot\varphi_{k} \\ && +(-1)^{n}\cos(2\pi \frac{kn}{N})\log(r_{N/2-k})+(-1)^{n}\sin(2\pi \frac{kn}{N})\varphi_{N/2-k} \\ &=& \frac{\cos(\pi\frac{n}{2})\cdot \log(y^{2/N}+1)}{2}+\sin(\pi\frac{n}{2})\cdot\arctan(y^{-1/N})\\ && + \sum\limits_{k=1}^{N/4-1}\cos(2\pi\frac{kn}{N})\cdot\log\left( r_{k}\cdot r_{N/2-k}^{(-1)^{n}}\right) -\sin(2\pi\frac{kn}{N})\cdot\left( \varphi_{k}-(-1)^{n}\varphi_{N/2-k}\right) \end{array} $$

(41)

such that inserting (41) in (39) and using (21) we obtain (22). The remaining special case (23) is easily shown by the derivative $\left (\log \frac {1+\sqrt {y}}{1-\sqrt {y}}\right )'=\frac {1-\sqrt {y}}{1+\sqrt {y}}\cdot \frac {\frac {1}{2\sqrt {y}}(1-\sqrt {y})+(1+\sqrt {y})\frac {1}{2\sqrt {y}}}{(1-\sqrt {y})^{2}}=\frac {1}{(1-y)\sqrt {y}}$. □

Proof of Proposition 4:

For irreducible q = n/N = n/2^m ∈ (0; 1), we have odd n for m ≥ 1 and (− 1)ⁿ = − 1. From (7) we then obtain for $k=1,\ldots ,\frac {N}{4}-1$

$$ \begin{array}{@{}rcl@{}} && \log\left( r_{k}\cdot r_{N/2-k}^{(-1)^{n}}\right) = \log\left( \frac{r_{k}}{r_{N/2-k}}\right)=\frac{1}{2}\log\left( \frac{y^{2/N}-2y^{1/N}\cos\left( 2\pi\frac{k}{N}\right)+1}{y^{2/N}+2y^{1/N}\cos\left( 2\pi\frac{k}{N}\right)+1}\right) \end{array} $$

(42)

$$ \begin{array}{@{}rcl@{}} & &\varphi_{k}-(-1)^{n}\varphi_{N/2-k}=\varphi_{k}+\varphi_{N/2-k}=-\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right)\\ &&\quad -\pi\cdot\left( 1-H\left( y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)\right)\right) - \arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}+\cos\left( 2\pi\frac{k}{N}\right)}\right) \end{array} $$

(43)

with the Heaviside function H(y) = 1 if y ≥ 0 and H(y) = 0 otherwise. We can further simplify the last equation using the addition theorem of the $\arctan $ function

$$ \arctan(a)+\arctan(b) = \begin{cases} \arctan\left( \frac{a+b}{1-ab}\right) &, ab<1 \\ \pi + \arctan\left( \frac{a+b}{1-ab}\right) &, ab>1\ \text{and}\ a+b\ge 0 \\ -\pi + \arctan\left( \frac{a+b}{1-ab}\right) &, ab>1\ \text{and}\ a+b< 0 \end{cases}\ . $$

Specifically, for $a:=\frac {\sin \limits \left (2\pi \frac {k}{N}\right )}{y^{1/N}-\cos \limits \left (2\pi \frac {k}{N}\right )}$ and $b:=\frac {\sin \limits \left (2\pi \frac {k}{N}\right )}{y^{1/N}+\cos \limits \left (2\pi \frac {k}{N}\right )}$ we get

$$ \begin{array}{@{}rcl@{}} &&ab = \frac{\sin^{2}\left( 2\pi\frac{k}{N}\right)}{y^{2/N}-\cos^{2}\left( 2\pi\frac{k}{N}\right)} ,\ \ \text{thus}\ ab<1\ \Leftrightarrow \begin{cases} \sin^{2}(.)+\cos^{2}(.)=1 < y^{2/N} &,y^{2/N}>\cos^{2}(.)\\ \sin^{2}(.)+\cos^{2}(.)=1 > y^{2/N} &,y^{2/N}<\cos^{2}(.) \end{cases} ,\\ &&a+b = \frac{\sin\left( \frac{2\pi k}{N}\right)(y^{\frac{1}{N}}+\cos\left( \frac{2\pi k}{N}\right))+\sin\left( \frac{2\pi k}{N}\right)(y^{\frac{1}{N}}-\cos\left( \frac{2\pi k}{N}\right))}{y^{\frac{2}{N}}-\cos^{2}\left( \frac{2\pi k}{N}\right)} = \frac{2\sin\left( \frac{2\pi k}{N}\right)y^{\frac{1}{N}}}{y^{\frac{2}{N}}-\cos^{2}\left( \frac{2\pi k}{N}\right)}, \\ &&\arctan\left( \frac{a+b}{1-ab}\right) = \arctan\left( \frac{2\sin\left( 2\pi\frac{k}{N}\right)y^{1/N}}{y^{2/N}-1}\right) \ . \end{array} $$

Note that for $k\in \{1,\ldots ,\frac {N}{4}-1\}$ both $\sin \limits \left (2\pi \frac {k}{N}\right )>0$ and $\cos \limits \left (2\pi \frac {k}{N}\right )>0$. Thus, for 0 < y < 1 the conditions ab < 1 and a + b < 0 are both equivalent to $y^{1/N}<\cos \limits \left (2\pi \frac {k}{N}\right )$. Therefore the addition theorem implies

$$ \begin{array}{@{}rcl@{}} &&\arctan(a)+\arctan(b) = \arctan\left( \frac{2\sin\left( 2\pi\frac{k}{N}\right)y^{1/N}}{y^{2/N}-1}\right) + \pi\cdot H\left( y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)\right) \\ && \text{and (43) gets}~ \varphi_{k}-(-1)^{n}\varphi_{\frac{N}{2}-k}=\varphi_{k}+\varphi_{\frac{N}{2}-k}=-\pi-\arctan\left( \frac{2\sin\left( \frac{2\pi k}{N}\right)y^{\frac{1}{N}}}{y^{\frac{2}{N}}-1}\right) \end{array} $$

(44)

such that with (9), the identity $\arctan (\frac {1}{x})=\frac {\pi }{2}-\arctan (x)$, and (− 1)ⁿ = − 1 the antiderivative (22) becomes for odd n with $\cos \limits (\pi \frac {n}{2})=0$ and $\sin \limits (\pi \frac {n}{2})=(-1)^{(n-1)/2}$

$$ \begin{array}{@{}rcl@{}} &&F(y)=\frac{\pi\sin(\frac{2\pi n}{N})}{1-\cos(\frac{2\pi n}{N})}+\log\left( \frac{1+y^{\frac{1}{N}}}{1-y^{\frac{1}{N}}}\right) -2\cdot(-1)^{\frac{n-1}{2}}\cdot(\frac{\pi}{2}-\arctan(y^{\frac{1}{N}})) -\sum\limits_{k=1}^{\frac{N}{4}-1}\cos(\frac{2\pi kn}{N}) \\ &&\cdot\log\left( \frac{y^{\frac{2}{N}}-2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+1}{y^{\frac{2}{N}}+2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+1}\right) - {\sum}_{k=1}^{\frac{N}{4}-1}2\sin(\frac{2\pi kn}{N})\cdot\left( \pi + \arctan\left( \frac{2\sin\left( \frac{2\pi k}{N}\right)y^{\frac{1}{N}}}{y^{\frac{2}{N}}-1}\right)\right). \end{array} $$

(45)

Simplifying this result yields Proposition 4: As $\arctan (0)=0$, $\log (1)=0$, and F(y) = 0 by definition (see (22)), the trigonometric identity (24) follows from (45) for y = 0. Inserting (24) back into (45) yields (25). □

Proof of Proposition 5:

Recapitulating our results so far, we find that Proposition 2 (including all equations in the proof) and (3)–(5) hold also for y > 1. With this we can easily verify that for y > 1 we have real-valued $\log (y^{1/N}-1)\in \mathbb {R}$ and the antiderivative (6) becomes

$$ \begin{array}{@{}rcl@{}} \tilde{F}(y) = \log\left( \frac{(1+y^{1/N})^{(-1)^{n+1}}}{y^{1/N}-1}\right) -2\tilde{G}(y) \end{array} $$

(46)

Then, because of $y>1\ge \cos \limits (2\pi \frac {k}{n})$, we see that (7) simplifies to

$$ \begin{array}{@{}rcl@{}} \varphi_{k}=-\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}-\cos\left( 2\pi\frac{k}{N}\right)}\right)\quad\text{and}\quad \varphi_{N/2-k}=-\arctan\left( \frac{\sin\left( 2\pi\frac{k}{N}\right)}{y^{1/N}+\cos\left( 2\pi\frac{k}{N}\right)}\right)\ , \end{array} $$

whereas all equations of (7) and (8) still hold true. Thus, in correspondence to (22) we get for y > 1 and any n = 1, 2, 3, 4,…,N − 1

$$ \begin{array}{@{}rcl@{}} \tilde{F}(y) &=& \log\left( \frac{(1+y^{1/N})^{(-1)^{n+1}}}{y^{1/N}-1}\right) - \cos(\pi\frac{n}{2})\log(y^{2/N}+1) - 2\sin(\pi\frac{n}{2})\arctan(y^{-1/N})\\ &&-2\sum\limits_{k=1}^{N/4-1}\cos(2\pi\frac{kn}{N})\log\left( r_{k}\cdot r_{N/2-k}^{(-1)^{n}}\right) -\sin(2\pi\frac{kn}{N})\left( \varphi_{k}-(-1)^{n}\varphi_{N/2-k}\right) \end{array} $$

(47)

employing real-valued computations only (whereas (22) is still a valid antiderivative for y > 1, but includes imaginary numbers). For irreducible q = n/2^m with odd n = 1, 3, 5,…,N − 1, it is easy to verify that (9) and (10) remain valid, whereas $y^{1/N}-\cos \limits \left (2\pi \frac {k}{N}\right )>0$ for y > 1 implies always ab < 1 and therefore (11) simplifies to

$$ \begin{array}{@{}rcl@{}} \varphi_{k}-(-1)^{n}\varphi_{N/2-k}=\varphi_{k}+\varphi_{N/2-k}=-\arctan\left( \frac{2\sin\left( 2\pi\frac{k}{N}\right)y^{1/N}}{y^{2/N}-1}\right)\ . \end{array} $$

(48)

Thus, (14) becomes for odd n with (9), (11), $\arctan (\frac {1}{x})=\frac {\pi }{2}-\arctan (x)$, (− 1)ⁿ = − 1, $\cos \limits (\pi \frac {n}{2})=0$, and $\sin \limits (\pi \frac {n}{2})=(-1)^{(n-1)/2}$

$$ \begin{array}{@{}rcl@{}} &&\tilde{F}(y) = \log\left( \frac{(1+y^{\frac{1}{N}})^{(-1)^{n+1}}}{y^{\frac{1}{N}}-1}\right) - 2\cdot(-1)^{\frac{n-1}{2}}\cdot(\frac{\pi}{2}-\arctan(y^{\frac{1}{N}}))-\sum\limits_{k=1}^{\frac{N}{4}-1}\cos(\frac{2\pi kn}{N})\\ &&\ \cdot \log\left( \frac{y^{\frac{2}{N}}-2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+1}{y^{\frac{2}{N}}+2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+1}\right) +\sum\limits_{k=1}^{\frac{N}{4}-1}2\cdot\sin(\frac{2\pi kn}{N})\cdot\left( -\arctan\left( \frac{2\sin\left( \frac{2\pi k}{N}\right)y^{\frac{1}{N}}}{y^{\frac{2}{N}}-1}\right)\right)\\ &=&\log\left( \frac{1+y^{\frac{1}{N}}}{y^{\frac{1}{N}}-1}\right) +(-1)^{\frac{n-1}{2}}\cdot 2\arctan(y^{\frac{1}{N}}) +\sum\limits_{k=1}^{\frac{N}{4}-1}\cos(\frac{2\pi kn}{N})\\ &&\ \cdot \log\left( \frac{1+2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+y^{\frac{2}{N}}}{1-2y^{\frac{1}{N}}\cos\left( \frac{2\pi k}{N}\right)+y^{\frac{2}{N}}}\right) - \sum\limits_{k=1}^{\frac{N}{4}-1}2\cdot\sin(\frac{2\pi kn}{N})\cdot\arctan\left( \frac{2\sin\left( \frac{2\pi k}{N}\right)y^{\frac{1}{N}}}{y^{\frac{2}{N}}-1}\right) + C \end{array} $$

(49)

$$ \begin{array}{@{}rcl@{}} &&\text{where with the identity (24) the constant is}~ C = \pi\cdot(-1)^{(n+1)/2} \ . \end{array} $$

(50)

Merging this with Proposition 4 gives almost immediately Proposition 5: Skipping the constant C in (49) and then comparing to (25) reveals that, after taking absolute values |1 − y^1/N|, both (25) for 0 ≤ y < 1 and (49) for y > 1 represent the same function that is unified by (26). The case N = 2 corresponding to (27) can be shown as in (23). □

Proof of Proposition 6:

We assume irreducible q = n/N > 1 with N := 2^m > 1 and odd n > N. Using odd η := n − N > 0 and substituting u := y^1/N with $du/dy=\frac {y^{1/N-1}}{N}=\frac {1}{Ny^{(N-1)/N}}$, y = u^N, dy = Ny^(N− 1)/Ndu = Nu^N− 1du we obtain

$$ \begin{array}{@{}rcl@{}} \int \frac{y^{q-1}}{1-y}dy = \int \frac{y^{(n-N)/N}}{1-y}dy = \int \frac{y^{\eta/N}}{1-y}dy = \int \frac{u^{\eta}}{1-u^{N}}Nu^{N-1}du = N\int \frac{u^{N+\eta-1}}{1-u^{N}}du \end{array} $$

(51)

By polynomial division we get for M := (η − 1) div N ≥ 0 and even R := (η − 1) mod N = η − 1 − M ⋅ N ∈{0,…,N − 2}

$$ \begin{array}{@{}rcl@{}} \frac{u^{N+\eta-1}}{u^{N}-1} = u^{\eta-1} + \frac{u^{\eta-1}}{u^{N}-1} =\ldots= \sum\limits_{i=0}^{M}u^{\eta-1-i\cdot N} + \frac{u^{R}}{u^{N}-1} \end{array} $$

(52)

and therefore, with re-substituting u := y^1/N, η := n − N, and $du=\frac {dy}{Ny^{(N-1)/N}}$, (18) becomes

$$ \begin{array}{@{}rcl@{}} &&\int \frac{y^{q-1}}{1-y}dy = -N\int \frac{u^{N+\eta-1}}{u^{N}-1}du = -N\sum\limits_{i=0}^{M}\int u^{\eta-1-i\cdot N}du -N\int\frac{u^{R}}{u^{N}-1}du \\ &&\quad = -N\sum\limits_{i=0}^{M}\frac{u^{\eta-i\cdot N}}{\eta-i\cdot N} +N\int\frac{u^{R}}{1-u^{N}}du = -N\sum\limits_{i=0}^{M}\frac{y^{\eta/N-i}}{\eta-i\cdot N} + N\int\frac{y^{R/N}}{1-y}\cdot\frac{dy}{Ny^{(N-1)/N}} \\ &&\quad= -N\sum\limits_{i=0}^{M}\frac{y^{(\eta\ \text{div}\ N) + (\eta\ \text{mod}\ N)-i}}{n-N-i\cdot N} + \int\frac{1}{(1-y)\cdot y^{(N-1-R)/N}}dy \\ &&\quad= -N\sum\limits_{i=0}^{M}\frac{y^{M-i+(R+1)/N}}{n-(i+1)\cdot N} + \int\frac{1}{(1-y)\cdot y^{1-(R+1)/N}}dy \end{array} $$

(53)

where (20) uses η div N = (η − 1) div N and η mod N = ((η − 1) mod N) + 1 = R + 1 as η := n − N is odd while N := 2^m is even, and so η mod N ∈{1, 3, 5,…,N − 1} is odd. Using (20) completes the proof: As $\tilde {n}:=R+1=\eta \ \text {mod}\ N=n\ \text {mod}\ N\in \{1,3,5,\ldots ,N-1\}$ is odd, $\tilde {q}:=(R+1)/N=\tilde {n}/N$ is irreducible and $0<\tilde {q}<1$, thus the remaining integral in (53) follows from (26) for N ≥ 4 or from (27) for N = 2, replacing q and n by $\tilde {q}$ and $\tilde {n}$, respectively. Thus we get (28) for N ≥ 4 and (29) for N = 2. Both antiderivatives have already the correct offset 0 for y = 0, so they correspond to continuations of F(y) as defined earlier. □

Proof of Theorem 3:

First, let us show that (30) contains Proposition 6 (as special case for irreducible q = n/N > 1 with N ≥ 2 and $y\in \mathbb {R}^{+}_{0}\backslash \{1\}$): Indeed, for $M:=(n-N-1)\ \text {div}\ N=(n \ \text {div}\ N)-1=\tilde {q}-1$, the first sums in (6) and (29) become after substituting i by M − i

$$ \begin{array}{@{}rcl@{}} -Ny^{\tilde{n}/N}\cdot\sum\limits_{i=0}^{M}\frac{y^{M-i}}{n-(i+1)\cdot N} &= -Ny^{\tilde{n}/N}\cdot\sum\limits_{i=0}^{\tilde{q}-1}\frac{y^{i}}{n-(M-i+1)\cdot N} \ , \end{array} $$

which, for the case N ≥ 2 with H(N − 2) = H(q − 1) = 1, equals the first line of (30) writing the first summand for i = 0 as a separate term for later case distinctions. It is easy to verify that also the remaining terms in (6) and (29) are equivalent to those in (30) as here H(N − 2) and H(N − 4) select the necessary terms, respectively.

Second, we show that (30) contains Theorem 2 (as special case for $q\in \mathbb {N}$, 0 ≤ y < 1) and extends it for y > 1: As here $q=n/N=n\in \mathbb {N}$ with N = 1, and thus $\tilde {n}=0$, $\tilde {q}=n=q$, M = q − 1, and H(N − 2) = H(N − 4) = 0,H(q − 1) = 1, the antiderivative (30) becomes

$$ \begin{array}{@{}rcl@{}} F(y) &=& -Ny^{\tilde{n}/N}\cdot\left( \frac{H(N-2)}{n-\tilde{q}\cdot N}+\sum\limits_{i=1}^{\tilde{q}-1}\frac{y^{i}}{n-(\tilde{q}-i)\cdot N}\right)+\log\left( \frac{1+y^{1/N}\cdot H(N-2)}{|1-y^{1/N}|}\right) \\ &=& -\sum\limits_{i=1}^{q-1}\frac{y^{i}}{i}-\log\left( |1-y|\right) \ , \end{array} $$

which equals (12) for 0 ≤ y < 1 and, also for the case y > 1, is still a proper antiderivative of f(y), as can easily be seen by recapitulating the proof of (12) using (34).

Third, it is easy to verify that (30) contains Proposition 5 (as special case for irreducible 0 < q = n/N < 1 with N ≥ 2 and $y\in \mathbb {R}^{+}_{0}\backslash \{1\}$) using H(q − 1) = 0, H(N − 2) = 1, and either H(N − 4) = 1 (for (26)) or H(N − 4) = 0 (for (27)).

Finally, the monotonicity of F(y) follows from f(y) > 0 for 0 < y < 1 and f(y) < 0 for y > 1, and the limits and asymptotic expressions (31) are easily verified by inspecting each case. In particular, F(0) = 0 follows from 0ⁱ = 0, $\log (1)=0$, and $\arctan (0)=0$. $F(1)=\infty $ follows from $\lim _{y\rightarrow 1}\log (\frac {1+y^{\frac {1}{N}}\cdot H(N-2)}{|1-y^{1/N}|})=\infty $ and all other terms remaining finite for $y\rightarrow 1$. For $y\rightarrow \infty $, approximation (31) follows because $\arctan (\frac {2\sin \limits (\frac {2\pi k}{N})y^{\frac {1}{N}}}{1-y^{\frac {2}{N}}})\rightarrow 0$, $\log (\frac {1+2y^{\frac {1}{N}}\cos \limits (\frac {2\pi k}{N})+y^{\frac {2}{N}}}{1-2y^{\frac {1}{N}}\cos \limits (\frac {2\pi k}{N})+y^{\frac {2}{N}}})\rightarrow 0$, $\arctan (y^{\frac {1}{N}})\rightarrow \frac {\pi }{2}$, and $\log (\frac {1+y^{\frac {1}{N}}\cdot H(N-2)}{|1-y^{\frac {1}{N}}|})\rightarrow 0$ for N ≥ 2. From this the asymptotics can easily be seen, in particular for q > 1 the dominating term in (31) is $-\frac {N}{n-N}y^{\tilde {n}/N+\tilde {q}-1}=-\frac {1}{(n-N)/N}y^{(n\ \text {mod}\ N)/N+n\ \text {div}\ N-1}=-\frac {1}{q-1}y^{(n-N)/N}=-\frac {1}{q-1}y^{q-1}$. □

Appendix B: Implementation details

Figures 2–3 used the Simple RNN of Fig. 2A with D inputs, M = 10 hidden units, and K outputs. The layers are linked by dense connections A, U, B and include also bias weights for layers u, y. Activation functions are $\tanh $ for u and the logistic sigmoid σ for y. Synaptic connections are initialized by uniform Xavier [10]. Training used ADAM optimizer [18, 31] with standard parameters β₁ = 0.9, β₂ = 0.999. Experiments used either Keras 2.2.5 with a Tensorflow 1.14.0 backend [1, 6] or a custom neural network library for backpropagation. While Keras computes gradients based on automatic differentiation [23] of the loss function (see Python code below for the power error loss (10) with (30)), the custom implementation uses (5) with the power error initialization (7).

The Embedded Reber Grammar Problem is to predict the next output symbol of a finite automaton with non-deterministic state transitions [15]. Inputs are symbol sequences x_n(1),x_n(2),…,x_T generated by the automaton, representing each of the D = 7 symbols by a one-hot input vector x_n(τ). The output to be predicted at time τ is the next symbol x_n(τ + 1) generated by the automaton in the next time step (K = 7). Due to the non-determinism, target vectors t_n(τ) can have multiple one-entries, one for each possible output symbol. Network decision $\hat {\mathbf {y}}_{n}(\tau )$ is evaluated as correct if $\hat {\mathbf {y}}_{n}(\tau )=\mathbf {t}_{n}(\tau )$ at decision threshold 0.5. Learning used N = 2048 sequences (90% training, 10% validating/testing). Average sequence length is T = 12 (maximum 40).

Remarks to Fig. 3: Reaching zero median errors suggests that all 16 learning trials of an experiment converged to zero error, while previous works reported problems of simple RNN solving this data set [9, 15]. However, as zero median means that at least half of trials reached zero errors, I have analyzed also mean test errors (instead of medians), which also reached zero within the total learning duration of 65 epochs for all exponents 0.5 ≤ q ≤ 15 (data not shown). The profile of minimal epoch numbers until zero test errors was similar to Fig. 3B, although absolute values were about factor 2-2.5 larger (best was 9 epochs for q = 1.5 vs. 26 epochs for q = 1). The discrepancy may be that earlier works used a suboptimal initialization of synaptic weights, causing either vanishing or exploding gradients [3, 15]. Additional experiments (data not shown) revealed that at least q > 1 can still reach zero error, even if initial weights deviate substantially (by factors between 0.125 and 3) from Xavier initialization. This is consistent with the idea that, for q > 1, the power error loss provides a better gradient-to-loss ratio and, thereby, avoids flat loss landscapes and vanishing gradients [19].

Remarks to Fig. 2D: Panel (D) shows estimated power exponents $\tilde {q}$ of absolute output error distributions |𝜖| = |t − y| corresponding to panels (B) and (C). Estimated $\tilde {q}$ are obtained from selecting the minimum Euclidean distance of histogram vectors (10 equally spaced bins of interval |𝜖|∈ [0; 1]) between experimental and theoretical distributions (3) evaluated for $q\in \{\frac {1}{8},\ldots ,\frac {7}{8},1,1.25,1.5,1.75,2,2.5,3,4,\ldots ,20\}$. As expected, $\tilde {q}$ decreases gradually with epoch number, and may therefore be used to characterize learning progress as discussed in Section 4. Again, there are some deviations between Keras and the custom implementation, in particular, in later learning phases (but note the logarithmic scale).

Figure 5 used a sequential 2D Convolutional Neural Network (CNN) architecture implemented with PyTorch 1.7.0+cu101 [29], including 6 CNN-Layers (kernel size 3, stride 1, “same” padding), 2 Max-Pooling layers (MaxPool; kernel size 2; stride 2), 2 Batch-Normalization (BN) layers, 2 Fully Connected (FC) layers, and 2 Dropout-Layers: Input → CNN(32) → BN(32) → ReLU → CNN(64) → ReLU → MaxPool → CNN(128) → BN → ReLU → CNN(128) → ReLU → MaxPool → Dropout(p = 0.05) → CNN(256) → BN → ReLU → CNN(256) → ReLU → MaxPool → Dropout(p = 0.1) → FC(1024) → ReLU → FC(512) → Dropout(p = 0.1) → FC(512) → σ. Numbers in brackets correspond to channels (CNN/BN) and size (FC). Activation functions were rectified linear units (ReLU) except σ for outputs (or softmax for CCE loss). The CIFAR10 dataset consists of 50000 training and 10000 test images (RGB) of size 32 × 32 from 10 classes [20]. Accuracies were computed from maximum decisions averaged (mean value) over 16 learning trials. Standard parameters were used without any further optimizations (ADAM optimizer as above, but initial learning rate 0.0001, minibatch 64).

Backpropagation involving automatic differentiation (as with Keras and PyTorch) requires implementing the power error loss function (10) with (30). All experiments involving Keras used the following Python code (for PyTorch replace K and t f by t orch and c lip_by_value by c lamp):

[fontsize=\mysize] import numpy as np, tensorflow as tf from keras import backend as K def powererrorloss_wrapper(n=3,N=1,_eps=1e-7): """ computes power error loss function eq.10 for q=n/N using eq.30 it is assumed that n and N are positive integers with N=2^m being a power of two _eps: clip y and powers of y to interval [_eps;1-_eps] to avoid NaN etc. """ def F(y): # eq.30: assumes that n is odd, N=2^m is a power of two for m>=0, and 0<y<1 y=tf.clip_by_value(y,_eps,1.0-_eps) n_tilde = n q_tilde = n//N y_1divN = tf.clip_by_value(K.pow(y,1.0/N),_eps,1.0-_eps) y_2divN = tf.clip_by_value(K.pow(y,2.0/N),_eps,1.0-_eps) y_ntildedivN = tf.clip_by_value(K.pow(y,n_tilde/N),_eps,1.0-_eps) # (i) first line of F(y) in eq.30 L = 0 if n>=N: # if q>=1 if N>=2: L=L+1.0/(n-q_tilde*N) for i in range(1,q_tilde): L=L+K.pow(y,i)*(1.0/(n-(q_tilde-i)*N)) L=L*(-N*y_ntildedivN) # (ii) second line of eq.30 tmp=1.0/tf.clip_by_value(1.0-y_1divN,_eps,1.0-_eps) # abs(.) not necessary (clipping) if N>=2: tmp=tmp*(1.0+y_1divN); L=L+K.log(tmp); # (iii) third line of eq.30 if N>=4: tmp=(n_tilde-1)//2 # exponent of (-1)^((n_tilde-1)/2) if ((tmp//2)*2) == tmp: sg=1 # sign of (-1)^((n_tilde-1)/2) else: sg=-1 L=L+sg*2.0*tf.atan(y_1divN) # (iv) fourth line of eq.30 if N>=4: for k in range(1,N//4): tmp=tf.clip_by_value(1.0+2.0*y_1divN*np.cos(2.0*np.pi*k/N)+y_2divN,_eps,4.) tmp=tmp/tf.clip_by_value(1.0-2.0*y_1divN*np.cos(2.0*np.pi*k/N)+y_2divN,_eps,4.) L=L+np.cos(2.0*np.pi*k*n_tilde/N)*K.log(tmp) # (v) fifth line of eq.30 for k in range(1,N//4): tmp=2.0*np.sin(2.0*np.pi*k/N)*y_1divN tmp=tf.atan(tmp/tf.clip_by_value(1.0-y_2divN,_eps,1.0-_eps)) L=L+2.0*np.sin(2.0*np.pi*k*n_tilde/N)*tmp return L def powererrorloss(t,y): loss=(1.0-t)*F(y)+t*F(1.0-y) # eq.10 return K.mean(loss) # averaging to be consistent with other Keras losses # main program of powererrorloss_wrapper assert n>0 and isinstance(n,int),'n must be positive integer!' assert N>0 and isinstance(N,int),'N must be positive integer!' m,N_=0,N while N_>1: # check if N is power of 2 N_old=N_ N_=N_//2 assert N_*2==N_old,'N must be power of two!' m=m+1 while ((n//2)*2==n) and (N>1): # reduce n/N n=n//2 N=N//2 return powererrorloss

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Knoblauch, A. On the antiderivatives of x^p/(1 − x) with an application to optimize loss functions for classification with neural networks. Ann Math Artif Intell 90, 425–452 (2022). https://doi.org/10.1007/s10472-022-09786-2

Download citation

Accepted: 10 January 2022
Published: 18 March 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10472-022-09786-2

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the antiderivatives of x^p/(1 − x) with an application to optimize loss functions for classification with neural networks

Abstract

Article PDF

Similar content being viewed by others

Pushing Stochastic Gradient towards Second-Order Methods – Backpropagation Learning with Transformations in Nonlinearities

Adapting Loss Functions to Learning Progress Improves Accuracy of Classification in Neural Networks

Fundamentals of Artificial Neural Networks and Deep Learning

References

Funding