Abstract
We explore the potential for using a nonsmooth loss function based on the max-norm in the training of an artificial neural network without hidden layers. We hypothesise that this may lead to superior classification results in some special cases where the training data are either very small or the class size is disproportional. Our numerical experiments performed on a simple artificial neural network with no hidden layer appear to confirm our hypothesis.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Arnold, V.: On functions of three variables. Dokl. Akad. Nauk SSSR 114, 679–681,: English translation: Amer. Math. Soc. Transl. 28(1963), 51–54 (1957)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Convex optimization with sparsity-inducing norms, chap. 2, pp. 19–53. MIT press (2011)
Boyd, S., Vandenberghe, L.: Convex optimization, 7th (edn.) Cambridge University Press, New York, USA (2009)
Buolamwini, J., Gebru, T.: Gender shades: intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research 81, 1–15 (2018)
Crouzeix, J.P.: Conditions for convexity of quasiconvex functions. Math. Oper. Res. 5(1), 120–125 (1980)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2, 303–314 (1989)
Daniilidis, A., Hadjisavvas, N., Martinez-Legaz, J.E.: An appropriate subdifferential for quasiconvex functions. SIAM J. Optim. 12(2), 407–420 (2002)
Dau, H.A., Keogh, E., Kamgar, K., Yeh, C.C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A., Yanping, Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-ML: The UCR time series classification archive (2018). https://www.cs.ucr.edu/~eamonn/time_series_data_2018/
Dutta, J., Rubinov, A.M.: Abstract convexity. Handbook of generalized convexity and generalized monotonicity 76, 293–333 (2005)
de Finetti, B.: Sulle stratificazioni convesse. Ann. Mat. Pura Appl. pp. 173–183 (1949)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press (2016). http://www.deeplearningbook.org
Gould, S., Hartley, R., Campbell, D.: Deep declarative networks: a new hope. CoRR abs/1909.04866 (2019). arXiv:1909.04866
Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, pp. 4390–4398. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.467. Accessed 21–26 July 2017
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953–956 (1957)
LeCun, Y., Cortes, C., Burges, C.C.: The MNIST database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/
Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993). https://doi.org/10.1016/S0893-6080(05)80131-5
Marcotte, P., Savard, G.: Novel approaches to the discrimination problem. Math. Methods Oper. Res. 36, 517–545 (1992)
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer 8, 143–195 (1999). https://doi.org/10.1017/S0962492900002919
Rubinov, A.M.: Abstract convexity and global optimization, vol. 44. Springer Science & Business Media (2013)
Rubinov, A.M., Simsek, B.: Conjugate quasiconvex nonnegative functions. Optimization 35(1), 1–22 (1995). https://doi.org/10.1080/02331939508844124
Sinha, V.B., Kudugunta, S., Sankar, A.R., Chavali, S.T., Balasubramanian, V.N.: Dante: deep alternations for training neural networks. Neural Netw. 131, 127–143 (2020). https://doi.org/10.1016/j.neunet.2020.07.026.. https://www.sciencedirect.com/science/article/pii/S0893608020302677
Steponavičė, I., Hyndman, R., Smith-Miles, K., Villanova, L.: Efficient Identification of the Pareto Optimal Set. In: Pardalos P., Resende M., Vogiatzis C., Walteros J. (eds.) Learning and Intelligent Optimization. LION 2014. Lecture Notes in Computer Science, vol. 8427. Springer, Cham (2014)
Sun, R.Y.: Optimization for deep learning: an overview. Journal of the Operations Research Society of China 8, 249–294 (2020)
Acknowledgements
The authors would like to thank the anonymous referees for their helpful recommendations on improving this manuscript.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions This research was supported by the Australian Research Council (ARC), solving hard Chebyshev approximation problems through nonsmooth analysis (Discovery Project DP180100602).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by: Andrew C. Eberhard
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Datasets
1.1.1 TwoLeadECG
MIT-BIH Long-Term ECG Data collection comes from the well-known PhysioNet database. Seven long-term ECG recordings with carefully evaluated beat annotations are included in this MIT-BIH Long-Term ECG data collection. We utilise the TwoLeadECG dataset from this collection, which is the final set (seventh set) of recordings. It has two signal classes: Class 1 contains signals of type signal 0, and Class 2 contains signals of type signal 1. The basic purpose is to discriminate between these two groups of signals.
1.1.2 SonyAIBORobotSurface1
The SONY AIBO Robot is a small, dog-shaped robot equipped with multiple sensors. In the experimental setting, the robot walked on two different surfaces: carpet and cement. Class 1 comprises of the data when the robot walked on the carpet, and Class 2 consists of the data of the robot when it walked on the cement floor. The main goal is to distinguish the type of floor that the robot walked on.
1.1.3 ToeSegmentation1
The ToeSegmentation data are derived from the CMU Graphics Lab Motion Capture Database (CMU). Motions in the database containing the keyword walk are classified by their motion descriptions into two categories. The first is the normal walk (Class 1), with only walk in the motion descriptions. The other is the abnormal walk (Class 2), with the motion descriptions containing: hobble walk, walk wounded leg, walk on toes bent forward, hurt leg walk, drag bad leg walk or hurt stomach walk. In the abnormal walks, the actors are pretending to have difficulty walking normally. ToeSegmentation1 contains the coordinates of the x-axis.
1.1.4 WormsTwoClass
Caenorhabditis elegans is a roundworm commonly used as a model organism in the study of genetics. The movement of these worms is known to be a useful indicator for understanding behavioural genetics. There are five variants of worms: N2, goa-1, unc-1, unc-38 and un63. N2 is wild type (i.e. normal), and the other four are mutant strains. This dataset relates to 258 traces of worms movements, and each worm is classified as either wild type (Class 1) or one of four mutant types (Class 2).
1.1.5 PhalangesOutlinesCorrect
This dataset is designed to test the efficacy of hand and bone outline detection and whether these outlines could be helpful in bone age prediction. Algorithms to automatically extract the hand outlines, and then the outlines of three bones of the middle finger (proximal, middle and distal phalanges) were applied to images, and three human evaluators labelled the output of the image outlining as correct or incorrect. If all three volunteers agree that a data point is valid, it is labelled as correct, and hence, Class 2 contains correctly identified data points whereas Class 1 contains incorrectly identified data points.
1.1.6 Strawberry
Food spectrographs are used in chemometrics to classify food types, a task that has obvious applications in food safety and quality assurance. The classes are strawberry (authentic samples) and non-strawberry (adulterated strawberries and other fruits), obtained using Fourier transform infrared (FTIR) spectroscopy with attenuated total reflectance (ATR) sampling.
1.1.7 Earthquakes
The earthquake classification problem involves predicting whether a major event is about to occur based on the most recent readings in the surrounding area. The data is taken from the Northern California Earthquake Data Center, and each data is an averaged reading for 1 h, with the first reading taken on Dec. 1st, 1967, and the last in 2003. This single time series was then transformed into a classification problem by first defining a major event as any reading of over 5 on the Rictor scale. Major events are often followed by aftershocks. The physics of these are well understood, and their detection is not the objective of this exercise. Hence, a positive case is considered to be one where a major event is not preceded by another major event for at least 512 hours. To construct a negative case, instances where there is a reading below 4 (to avoid blurring of the boundaries between major and non major events) that is preceded by at least 20 readings in the previous 512 hours that are non-zero (to avoid trivial negative cases) are considered. None of the cases overlap in time. This dataset consists of 368 negative cases (Class 1) and 93 positive cases (Class 2).
1.1.8 PowerCons
The PowerCons dataset contains the individual household electric power consumption in 1 year distributed in two season classes: warm (class 1) and cold (class 2), depending on whether the power consumption is recorded during the warm seasons (from April to September) or the cold seasons (from October to March).
1.1.9 Computers
These problems were taken from data recorded as part of a government-sponsored study called Powering the Nation. The intention was to collect behavioural data about how consumers use electricity within the home to help reduce the UK’s carbon footprint. The data contains readings from 250 households, sampled in two-minute intervals over a month. Classes are Desktop (Class 1) and Laptop (Class 2).
1.2 Experiments and results
We start the experiments with the original training and testing sets. We compare the classification accuracy computed by the MATLAB Deep learning toolbox which uses MSE with the classification accuracy computed by uniform approximation-based loss function. The results are given in Table 15.
One can see that the uniform approximation is more accurate for the TwoLeadECG, SONYAIBORobotSurface1 and ToeSegmentation1 datasets with a smaller training set than the test set while MSE is much more accurate for all the other datasets. Now, we swap the training and testing sets for the datasets bearing the numbers 4, 5, 6 and 7 since their training set is bigger than the testing set. The results are presented in Table 16.
The original training and testing sets are swapped only for datasets 4, 5, 6, and 7 in the rest of the experiments. The original training set is considered the training set for all other datasets (that is, datasets 1, 2, 3, 8 and 9).
Now, we consider reduced training sets which contain even number of representatives from each class in the training set. In particular, the first 10 points from each class were chosen to create a training set of size 20. However, since the size of the training set is small for datasets 1, 2 and 3, only the first 5 points from each class were chosen to create a training set of size 10. The results are presented in Table 17.
Our next step is to reduce the training set by considering uneven number of points from each class. The size of the training set is 20 for datasets 4, 5, 6, 7, 8 and 9. We consider 18 points from Class 1 and 2 points from Class 2. The results are presented in Table 18. For datasets 2 and 4, the size of the training set is reduced to 10: 8 points from Class 1 and 2 points from Class 2. For dataset 3, the training set contains only 8 points: 6 points from Class 1 and 2 points from Class 2. These different representations are due to the varying sizes of the training sets and also due to the size of the datasets that represents each class. The results are in Table 19.
Now, we consider a symmetric situation where the training set contains 20 points for the datasets numbers 4, 5, 6, 7, 8 and 9: 2 points from Class 1 and 18 points from Class 2. The results are presented in Table 20. Due to the same reasoning as above, we consider different representations for the rest of the datasets. For datasets 2 and 4, the size of the training set is 10: 2 points from Class 1 and 8 points from Class 2. For dataset 3, the training set contains only 8 points: 2 points from Class 1 and 6 points from Class 2. Results are in Table 21.
Now, we present the results when the training set points are chosen randomly. In Table 22, we present the results for datasets numbers 5, 6, 7, 8 and 9, in the case when the training set contains 50 randomly selected points. In Table 23, we select 20 points randomly to generate the training set, and these results are only valid for the above-mentioned datasets whose training set (original testing set) is larger than 20.
We finally present the results of the experiments when the training set contains 10 randomly selected points. This experiment is valid to all the datasets that we considered. The results are in Table 24.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Peiris, V., Roshchina, V. & Sukhorukova, N. Artificial neural networks with uniform norm-based loss functions. Adv Comput Math 50, 31 (2024). https://doi.org/10.1007/s10444-024-10124-9
Published:
DOI: https://doi.org/10.1007/s10444-024-10124-9