Abstract
In this paper, we treat the problem of testing for normality as a binary classification problem and construct a feedforward neural network that can act as a powerful normality test. We show that by changing its decision threshold, we can control the frequency of false non-normal predictions and thus make the network more similar to standard statistical tests. We also find the optimal decision thresholds that minimize the total error probability for each sample size. The experiments conducted on the samples with no more than 100 elements suggest that our method is more accurate and more powerful than the selected standard tests of normality for almost all the types of alternative distributions and sample sizes. In particular, the neural network was the most powerful method for testing normality of the samples with fewer than 30 elements regardless of the alternative distribution type. Its total accuracy increased with the sample size. Additionally, when the optimal decision-thresholds were used, the network was very accurate for larger samples with 250–1000 elements. With AUROC equal to almost 1, the network was the most accurate method overall. Since the normality of data is an assumption of numerous statistical techniques, the network constructed in this study has a very high potential for use in everyday practice of statistics, data analysis and machine learning.
Similar content being viewed by others
Availability of data and material
The data and the code are in the following \({\mathtt{github}}\) repository: https://github.com/milos-simic/neural-normality.
References
Nornadiah Mohd Razali and Yap Bee Wah (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J Statistical Model Anal 2(1):21–33
Thode HC (2002) Testing For Normality. Statistics, textbooks and monographs. Taylor & Francis. ISBN 9780203910894
Sigut J, Piñeiro J, Estévez J, and Toledo P (2006) A neural network approach to normality testing. Intell Data Anal, 10(6):509–519, 12
Esteban MD, Castellanos ME, Morales D, Vajda I (2001) Monte carlo comparison of four normality tests using different entropy estimates. Commun Statistics - Simul Comput 30(4):761–785
Hadi Alizadeh Noughabi and Naser Reza Arghami (2011) Monte carlo comparison of seven normality tests. J Statistical Comput Simul 81(8):965–972
Hain Johannes (August 2010) Comparison of common tests for normality. Diploma thesis, Julius-Maximilians-Universität Würzburg Institut für Mathematik und Informatik
Yap BW and Sim CH (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81 (12):2141–2155, 12
Ahmad Fiaz, Khan Rehan (2015) A power comparison of various normality tests. Pakistan Journal of Statistics and Operation Research 11(3):331–345
Marmolejo-Ramos Fernando and González-Burgos Jorge (2013) A power comparison of various tests of univariate normality on ex-gaussian distributions. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 9(4):137
Mbah Alfred K, Paothong Arnut (2015) Shapiro-francia test compared to other normality test using expected p-value. Journal of Statistical Computation and Simulation 85(15):3002–3016
binti Yusoff S and Bee Wah Y (Sept 2012) Comparison of conventional measures of skewness and kurtosis for small sample size. In 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE), pages 1–6. 10.1109/ICSSBE.2012.6396619
Patrício Miguel, Ferreira Fábio, Oliveiros Bárbara, and Caramelo Francisco (2017) Comparing the performance of normality tests with roc analysis and confidence intervals. Communications in Statistics-Simulation and Computation, pages 1–17
Wijekularathna Danush K, Manage Ananda BW, and Scariano Stephen M (September 2019) Power analysis of several normality tests: A monte carlo simulation study. Communications in Statistics - Simulation and Computation, pages 1–17
Wilson PR and Engel AB (1990) Testing for normality using neural networks. In [1990] Proceedings. First International Symposium on Uncertainty Modeling and Analysis, pages 700–704
Gel Yulia, Miao Weiwen, and Gastwirth Joseph L (2005) The importance of checking the assumptions underlying statistical analysis: Graphical methods for assessing normality. Jurimetrics, 46(1):3–29. ISSN 08971277, 21544344. http://www.jstor.org/stable/29762916
Stehlík M, Střelec L, and Thulin M (2014) On robust testing for normality in chemometrics. Chemometrics and Intelligent Laboratory Systems, 130: 98–108. ISSN 0169-7439. https://doi.org/10.1016/j.chemolab.2013.10.010. https://www.sciencedirect.com/science/article/pii/S0169743913001913
Lopez-Paz David and Oquaba Maxime (2017) Revisiting classifier two-sample tests. In ICLR
Ojala Markus and Garriga Gemma C (2010) Permutation tests for studying classifier performance. Journal of Machine Learning Research, 11 (Jun):1833–1863
Al-Rawi Mohammed Sadeq and Paulo Silva Cunha João (2012) Using permutation tests to study how the dimensionality, the number of classes, and the number of samples affect classification analysis. In Aurélio Campilho and Mohamed Kamel, editors, Image Analysis and Recognition, pages 34–42, Berlin, Heidelberg. Springer Berlin Heidelberg. ISBN 978-3-642-31295-3
Kim Ilmun, Ramdas Aaditya, Singh Aarti, and Wasserman Larry (Feb 2016) Classification accuracy as a proxy for two sample testing. arXiv e-prints, art. arXiv:1602.02210
Blanchard Gilles, Lee Gyemin, and Scott Clayton (2010) Semi-supervised novelty detection. Journal of Machine Learning Research, 11 (Nov):2973–3009
Rosenblatt Jonathan D, Benjamini Yuval, Gilron Roee, Mukamel Roy, and Goeman Jelle J (2019) Better-than-chance classification for signal detection. Biostatistics, 10 . kxz035
Gretton Arthur, Borgwardt Karsten M, Rasch Malte J, Schölkopf Bernhard, and Smola Alexander (March 2012) A kernel two-sample test. J. Mach. Learn. Res., 13(null):723–773
Borgwardt Karsten M, Gretton Arthur, Rasch Malte J, Kriegel Hans-Peter, Schölkopf Bernhard, and Smola Alex J (2006) Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics, 22(14):e49–e57, 7
Gretton Arthur, Borgwardt Karsten, Rasch Malte, Schölkopf Bernhard, and Smola Alex J (2007a) A kernel method for the two-sample-problem. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 513–520. MIT Press
Gretton Arthur, Borgwardt Karsten M, Rasch Malte J (2007b) Bernhard Schëlkopf, and Alexander J. Smola. A kernel approach to comparing distributions. In AAAI, pages 1637–1641
Smola Alex, Gretton Arthur, Song Le, and Schölkopf Bernhard (2007) A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7
Gretton Arthur, Fukumizu Kenji, Harchaoui Zaïd, and Sriperumbudur Bharath K (2009) A fast, consistent kernel two-sample test. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 673–681. Curran Associates, Inc.
Scholköpf Bernhard and Smola Alexander J (2001) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). The MIT Press, 12 . ISBN 0262194759
Steinwart Ingo, Crhistmann Andreas (2008) Support Vector Machines. Information Science and Statistics. Springer-Verlag, New York, NY, USA
Gretton Arthur, Fukumizu Kenji, Teo Choon H, Song Le, Schölkopf Bernhard, and Smola Alex J (2008) A kernel statistical test of independence. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 585–592. Curran Associates, Inc.
Gretton Arthur, Bousquet Olivier, Smola Alex, and Schölkopf Bernhard (2005) Measuring statistical dependence with hilbert-schmidt norms. In Sanjay Jain, Hans Ulrich Simon, and Etsuji Tomita, editors, Algorithmic Learning Theory, pages 63–77, Berlin, Heidelberg. Springer Berlin Heidelberg. ISBN 978-3-540-31696-1
Chwialkowski Kacper and Gretton Arthur (Jun 2014) A kernel independence test for random processes. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1422–1430, Bejing, China, 22–24 . PMLR
Pfister Niklas, Bühlmann Peter, Schölkopf Bernhard, and Peters Jonas (2017) Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5–31, 5
Chwialkowski Kacper, Strathmann Heiko, and Gretton Arthur (2016) A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 2606–2615. JMLR.org
Stein Charles (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, pages 583–602, Berkeley, Calif. University of California Press
Shao Xiaofeng (2010) The dependent wild bootstrap. Journal of the American Statistical Association 105(489):218–235
Leucht Anne, Neumann Michael H (2013) Dependent wild bootstrap for degenerateU- andV-statistics. Journal of Multivariate Analysis 117:257–280
Liu Qiang, Lee Jason, and Jordan Michael (2016) A kernelized stein discrepancy for goodness-of-fit tests. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 276–284, New York, New York, USA, 20–22 Jun . PMLR
Arcones Miguel A and Gine Evarist (1992) On the bootstrap of \(u\) and \(v\) statistics. Ann. Statist., 20(2):655–674, 06
Huskova Marie and Janssen Paul (1993) Consistency of the generalized bootstrap for degenerate \(u\)-statistics. Ann. Statist., 21(4):1811–1823, 12
Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, and Arthur Gretton. A linear-time kernel goodness-of-fit test. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 262–271. Curran Associates, Inc., 2017
Chwialkowski Kacper, Ramdas Aaditya, Sejdinovic Dino, and Gretton Arthur (2015) Fast two-sample testing with analytic representations of probability measures. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 1981–1989, Cambridge, MA, USA. MIT Press
Lloyd James R and Ghahramani Zoubin (2015) Statistical model criticism using kernel two sample tests. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 829–837. Curran Associates, Inc.
Jérémie Kellner and Alain Celisse (2019) A one-sample test for normality with kernel methods. Bernoulli, 25(3):1816–1837, 08
Kojadinovic Ivan, Yan Jun (2012) Goodness-of-fit testing based on a weighted bootstrap: A fast large-sample alternative to the parametric bootstrap. Canadian Journal of Statistics 40(3):480–500
Johnson NL (1949a) Bivariate distributions based on simple translation systems. Biometrika 36(3/4):297–304
Johnson NL (1949b) Systems of frequency curves generated by methods of translation. Biometrika 36(1/2):149–176
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611
Lin Ching-Chuong, Mudholkar Govind S (1980) A simple test for normality against asymmetric alternatives. Biometrika 67(2):455–461
Vasicek Oldrich (1976) A test for normality based on sample entropy. Journal of the Royal Statistical Society. Series B (Methodological), 38(1):54–59
Henderson A. Ralph (2006) Testing experimental data for univariate normality. Clinica Chimica Acta, 366(1–2):112 – 129
Gel Yulia R, Miao Weiwen, and Gastwirth Joseph L (2007) Robust directed tests of normality against heavy-tailed alternatives. Computational Statistics | & Data Analysis, 51 (5):2734–2746. ISSN 0167-9473. https://doi.org/10.1016/j.csda.2006.08.022. https://www.sciencedirect.com/science/article/pii/S0167947306002805
Seier Edith (2011) Normality tests: Power comparison. Springer, In International Encyclopedia of Statistical Science. 978-3-642-04897-5
Dufour Jean-Marie, Farhat Abdeljelil, Gardiol Lucien, Khalaf Lynda (1998) Simulation-based finite sample normality tests in linear regressions. The Econometrics Journal 1(1):154–173
Lilliefors Hubert W (1967) On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association 62(318):399–402
Anderson TW and Darling DA (1952) Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Statist., 23(2):193–212, 06
Anderson TW, Darling DA (1954) A test of goodness of fit. Journal of the American Statistical Association 49(268):765–769
Jarque Carlos M, Bera Anil K (1980) Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters 6(3):255–259
Jarque Carlos M, Bera Anil K (1987) A test for normality of observations and regression residuals. International Statistical Review / Revue Internationale de Statistique 55(2):163–172
Shapiro SS, Francia RS (1972) An approximate analysis of variance test for normality. Journal of the American Statistical Association 67(337):215–216
Cramér Harald (1928) On the composition of elementary errors. Scandinavian Actuarial Journal 1928(1):13–74
von Mises Richard (1928) Wahrscheinlichkeit Statistik und Wahrheit. Springer, Berlin Heidelberg
Richard von Mises. Wahrscheinlichkeitsrechnung und Ihre Anwendung in der Statistik und Theoretischen Physik. F. Deuticke, 1931
Nikolai Vasilyevich Smirnov (1936) Sur la distribution de \(\omega ^2\). CR Acad. Sci. Paris, 202(S 449)
D’Agostino Ralph and Pearson ES (1973) Tests for departure from normality. empirical results for the distributions of \(b_2\) and \(\sqrt{b_1}\). Biometrika, 60(3):613–622. ISSN 00063444. http://www.jstor.org/stable/2335012
D’agostino Ralph B, Belanger Albert, and D’agostino Jr Ralph B (1990) A suggestion for using powerful and informative tests of normality. The American Statistician, 44(4):316–321. 10.1080/00031305.1990.10475751. https://www.tandfonline.com/doi/abs/10.1080/00031305.1990.10475751
Gel Yulia R and Gastwirth Joseph L (2008) A robust modification of the jarque-bera test of normality. Economics Letters, 99(1):30–32. ISSN 0165-1765. https://doi.org/10.1016/j.econlet.2007.05.022. https://www.sciencedirect.com/science/article/pii/S0165176507001838
Brys Guy, Hubert Mia, Struyf Anja (2007) Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics 23(3):429–442
Stehlík Milan, Fabián Zdeněk, Střelec Luboš (2012) Small sample robust testing for normality against pareto tails. Communications in Statistics - Simulation and Computation 41(7):1167–1194
Wolf-Dieter Richter, Luboš Střelec, Hamid Ahmadinezhad,and Milan Stehlík. Geometric aspects of robust testing for normality and sphericity. Stochastic Analysis and Applications, 35 (3):511–532, 2017. 10.1080/07362994.2016.1273785. https://doi.org/10.1080/07362994.2016.1273785
John B. Hampshire and Barak Pearlmutter. Equivalence proofs for multi-layer perceptron classifiers and the bayesian discriminant function. In Connectionist Models, pages 159–172. Elsevier, 1991
Richard Michael D, Lippmann Richard P (1991) Neural network classifiers estimate bayesian a posteriori probabilities. Neural Computation 3(4):461–483
Charles M (2003) Grinstead and J. Laurie Snell. Introduction to Probability, AMS
Pearson Karl (1895) Contributions to the mathematical theory of evolution. ii. skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 186:343–414
Pearson Karl (1901) Mathematical contributions to the theory of evolution. x. supplement to a memoir on skew variation. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 197 (287-299):443–459
Pearson Karl (1916) Mathematical contributions to the theory of evolution. xix. second supplement to a memoir on skew variation. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 216 (538-548):429–457
Juha Karvanen, Jan Eriksson, and Visa Koivunen (2000) Pearson system based method for blind separation. In Proceedings of Second International Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000), Helsinki, Finland, pages 585–590
Howell Nancy (2010) Life Histories of the Dobe !Kung: food, fatness and well-being over the life span, volume 4 of Origins of Human Behavior and Culture. University of California Press
Howell Nancy (2017) Demography of the Dobe !Kung. Taylor & Francis. ISBN 9781351522694
Northern California Earthquake Data Center. Berkeley digital seismic network (bdsn), 2014a
Northern California Earthquake Data Center. Northern california earthquake data center, 2014b
Kursa Miron B, Rudnicki Witold R (2010) Feature selection with theBorutaPackage. Journal of Statistical Software 36(11)
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330
Kingma Diederik P and Ba Jimmy (December 2014) Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980
Jarrett Kevin, Kavukcuoglu Koray, Ranzato Marc’Aurelio , and LeCun Yann (2009) What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, pages 2146–2153.10.1109/ICCV.2009.5459469
Vinod Nair and Geoffrey E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 807–814, Madison, WI, USA. Omnipress. ISBN 9781605589077
Xavier Glorot, Antoine Bordes, and Yoshua Bengio (Apr 2011) Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 . PMLR
Murphy Allan H, Winkler Robert L (1977) Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society: Series C (Applied Statistics) 26(1):41–47
Murphy Allan H, Winkler Robert L (1987) A general framework for forecast verification. Monthly Weather Review 115(7):1330–1338
Bröcker Jochen (2008) Some remarks on the reliability of categorical probability forecasts. Monthly Weather Review 136(11):4488–4502
Fenlon Caroline, O’Grady Luke, Doherty Michael L, Dunnion John (2018) A discussion of calibration techniques for evaluating binary and categorical predictive models. Preventive Veterinary Medicine 149:107–114
John Shawe-Taylor and Nello Cristianini (june 2004) Kernel Methods for Pattern Analysis. Cambridge University Press,
Steven Goodman. A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3):135 – 140, 2008. Interpretation of Quantitative Research
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61–74. MIT Press, 1999
Bianca Zadrozny and Charles Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 02, page 694-699, New York, NY, USA. Association for Computing Machinery. ISBN 158113567X
Alex Rosenberg and Lee McIntyre. Philosophy of science: A contemporary introduction. Routledge, 4 edition, 2019. ISBN 9781138331518
Royston JP (1982a) An extension of shapiro and wilk’s w test for normality to large samples. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(2):115–124
Patrick Royston. Remark as r94: A remark on algorithm as 181: The w-test for normality. Journal of the Royal Statistical Society. Series C (Applied Statistics), 44(4):547–551, 1995
J. P. Royston. Algorithm as 177: Expected normal order statistics (exact and approximate). Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(2):161–165, 1982b. ISSN 00359254, 14679876. http://www.jstor.org/stable/2347982
van Soest J (1967) Some experimental results concerning tests of normality*. Statistica Neerlandica 21(1):91–97
Kolmogorov AN (1933) Sulla determinazione empirica di una legge di distribuzione. Giornale dell’ Instituto Italiano degli Attuari 4:83–91
Juergen Gross and Uwe Ligges. nortest: Tests for normality, 2015. R package version 1.0-4
Stephens MA (1986) Tests based on edf statistics. In: D’Agostino RB, Stephens MA (eds) Goodness-of-Fit Techniques. Marcel Dekker, New York
K. O. Bowman and L. R. Shenton. Omnibus test contours for departures from normality based on \(\sqrt{b}_1\) and \(b_2\). Biometrika, 62(2):243–250, 08 1975
Urzua Carlos (1996) On the correct use of omnibus tests for normality. Economics Letters 53(3):247–251
Carmeli C, de Vito E, Toigo A, Umanità V (2010) VECTOR VALUED REPRODUCING KERNEL HILBERT SPACES AND UNIVERSALITY. Analysis and Applications 08(01):19–61
Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic. arXiv e-prints, art. arXiv:1707.07269, July 2017
Acknowledgements
The author would like to thank his advisor, Dr. Miloš Stanković (Innovation Center, School of Electrical Engineering, University of Belgrade), for useful discussions and advice, and Dr. Wittawat Jitkrittum (Google Research) for advice on the kernel tests of goodness-of-fit. The earthquake data for this study come from the Berkeley Digital Seismic Network (BDSN), doi:10.7932/BDSN, operated by the UC Berkeley Seismological Laboratory, which is archived at the Northern California Earthquake Data Center (NCEDC), doi:10.7932/NCEDC, and were accessed through NCEDC.
Funding
No funding has been received for this research.
Author information
Authors and Affiliations
Contributions
The whole study was conducted and the paper written by Miloš Simić.
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
There are no conflicts of interest and no competing interests regarding this study.
Code availability
The data and the code are in the following \({\mathtt{github}}\) repository: https://github.com/milos-simic/neural-normality.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
A Standard statistical tests of normality
Throughout this Appendix, \({\mathbf {x}}=[x_1, x_2, \ldots , x_n]\) will denote a sample drawn from the distribution whose normality we want to test. The same holds for other Appendices.
2.1 A.1 The Shapiro–Wilk Test (SW)
Let \({\mathbf {b}} = [b_1, b_2, \ldots , b_n]^T\) denote the vector of the expected values of order statistics of independent and identically distributed random variables sampled from the standard normal distribution. Let \({\mathbf {V}}\) denote the corresponding covariance matrix.
The intuition behind the SW test is as follows. If a random variable X follows normal distribution \(N(\mu ,\sigma ^2)\), and \(Z\sim N(0, 1)\), then \(X=\mu +\sigma Z\) [49]. For the ordered random samples \({\mathbf {x}}=[x_1,x_2,\ldots ,x_n]\sim N(\mu , \sigma ^2)\) and \({\mathbf {z}}=[z_1,z_2,\ldots ,z_n]\sim N(0,1)\), the best linear unbiased estimate of \(\sigma\) is [49]:
In that case, \({\hat{\sigma }}^2\) should be equal to the usual estimate of variance \(\text {S}({\mathbf {x}})^2\):
The value of the test statistic, W, is a scaled ratio of \({\hat{\sigma }}^2\) and \(\text {S}({\mathbf {x}})^2\):
where:
and
The range of W is [0, 1], with higher values indicating stronger evidence in support of normality. The original formulation required use of the tables of the critical values of W [52] at the most common levels of \(\alpha\) and was limited to smaller samples with \(n \in [3, 20]\) elements because the values of \({\mathbf {b}}\) and \({\mathbf {V}}^{-1}\) were known only for small samples at the time [6]. Royston [98] extended the upper limit of n to 2000 and presented a normalizing transformation algorithm suitable for computer implementation. The upper limit was further improved by Royston [99] who formulated algorithm AS 194 which allowed the test to be used for the samples with \(n \in [3, 5000]\).
2.2 A.2 The Shapiro–Francia Test (SF)
The Shapiro–Francia test is a simplification of the SW test. It was initially proposed for the larger samples, for which the matrix \({\mathbf {V}}\) in the SW test was unknown [61]. By assuming that the order statistics are independent, we can substitute the identity matrix \({\mathbf {I}}\) for \({\mathbf {V}}\) in Equation (16) to obtain the Shapiro-Francia test statistic:
where:
Then, SF is actually the squared Pearson correlation between \(\mathbf{a }\) and \({\mathbf {x}}\), i.e., the \(R^2\) of the regression of \({\mathbf {x}}\) on \(\mathbf{a }\) [2]. Only the expected values of the order statistics are needed to conduct this test. They can be calculated using algorithm AS 177 proposed by Royston [100].
As in the SW test, \({\mathbf {x}}\) is assumed to be ordered.
2.3 A.3 The Lilliefors Test (LF)
This test was introduced independently by Lilliefors [56] and van Soest [101]. The LF test checks if the given sample \({\mathbf {x}}\) comes from a normal distribution whose parameters \(\mu\) and \(\sigma\) are taken to be the sample mean (\(\overline{x}\)) and standard deviation (\(\text {sd}({\mathbf {x}})\)). It is equivalent to the Kolmogorov–Smirnov test of goodness-of-fit [102] for those particular choices of \(\mu\) and \(\sigma\) [13]. The LF test is conducted as follows. If the sample at hand comes from \(N(\overline{x},\text {sd}({\mathbf {x}})^2)\), then its transformation \({\mathbf {z}}=[z_1,z_2,\ldots ,z_n]\), where:
should follow N(0, 1). The difference between the EDF of \({\mathbf {z}}\), \(edf_{{\mathbf {z}}}\), and the CDF of N(0, 1), \(\Phi\), quantifies how well the sample \({\mathbf {x}}\) fits the normal distribution. In the LF test, that difference is calculated as follows
Higher values of D indicate greater deviation from normality.
2.4 A.4 The Anderson–Darling Test (AD)
Whereas the Lilliefors test focuses on the largest difference between the sample’s EDF and the CDF of the hypothesized model distribution, the AD test calculates the expected weighted difference between those two [58], with the weighting function designed to make use of the specific properties of the model. The AD statistic is:
where \(\psi\) is the weighting function, edf is the sample’s EDF, and \(F^*\) is the CDF of the distribution we want to test. When testing for normality, the weighting function is chosen to be sensitive to the tails [13, 58]:
Then, the statistic (21) can be calculated in the following way [2, 103]:
where \([z_{(1)},z_{(2)},\ldots ,z_{(n)}]\) (\(z_{(i)} \le z_{(i+1)}, i=1,2,\ldots ,n-1\)) is the ordered permutation of \({\mathbf {z}}=[z_1,z_2,\ldots ,z_n]\) that is obtained from \({\mathbf {x}}\) as in the LF test. The p-values are computed from the modified statistic [103, 104]:
Larger values of the AD statistic indicate stronger arguments in favor of non-normality.
2.5 A.5 The Cramér–von Mises Test (CVM)
As noted by Wijekularathna et al. [13], the AD test is a generalization of the Cramér–von Mises test [62,63,64,65]. When \(\psi (\cdot )=1\), the AD test’s statistic reduces to that of the CVM test:
Because \(\psi (\cdot )\) takes into account the specific properties of the model distribution, the AD test may be more sensitive than the CVM test [13]. As is the case with the AD test, larger values of the CVM statistic (25) are more compatible with departures from normality.
2.6 A.6 The Jarque–Bera Test (JB)
The Jarque–Bera test [59, 60], checks how much the sample’s skewness (\(\sqrt{\beta _1}\)) and kurtosis (\(\beta _2\)) match those of normal distributions. Namely, for each normal distribution it holds that \(\sqrt{\beta _1}=0\) and \(\beta _2=3\). The statistic of the test, as originally defined by Jarque and Bera [59], is computed as follows:
We see that higher values of J indicate greater deviation from the skewness and kurtosis of normal distributions. The same idea was examined by Bowman and Shenton [105]. The asymptotic expected values of the estimators of skewness and kurtosis are 0 and 3, while the asymptotic variances are 6/n and 24/n for the sample of size n [105]. The J statistic is then a sum of two asymptotically independent standardized normals. However, the estimator of kurtosis slowly converges to normality, which is why the original statistic is not useful for small and medium-sized samples [106]. Urzua [106] adjusted the statistic by using the exact expressions for the means and variances of the estimators of skewness and kurtosis:
where:
which allowed the test to be applied to smaller samples.
2.7 A.7 The D’Agostino–Pearson Test (DP)
The DP test is a combination of the individual skewness and kurtosis tests of normality [66, 67]. Its statistic is:
where \(Z_1\) and \(Z_2\) are normal approximations to the skewness and kurtosis. The exact formulas for \(Z_1\) and \(Z_2\) can be found in D’Agostino et al. [67]. Under the null hypothesis of normality, \(K^2\) has an approximate \(\chi ^2\) distribution with two degrees of freedom [67].
The rationale behind the DP test is that the statistic combining both the sample skewness and kurtosis can detect departures from normality in terms of both moments, unlike the tests based on only one standardized moment.
B Robustified tests of normality
1.1 B.1 The Robustified Jarque–Bera Tests (RJB)
The statistic of the classical JB test, introduced in Section A.6, is based on the classical estimators of the first four moments. Those estimators are sensitive to outliers, which is why the JB test is not robust. The robustified Jarque–Bera tests are obtained when the non-robust moment estimators are replaced with their robust alternatives. Since there are multiple ways to define a robust estimator of a moment, there is not one, but a plethora of the RJB tests, which were defined by Stehlík et al. [70] and named the RT tests.
Let \({\mathbf {x}}=[x_1, x_2, \ldots , x_n]\) be the ordered sample under consideration .Let \(T_0\) be the sample mean, \(T_1\) its median, \(T_2 \equiv T_{(2, s)}=\frac{1}{n-2s}\sum _{i=s+1}^{n-s}x_i\) the trimmed sample mean, and \(T_3=\text {median}_{i\le j}\{(x_i+x_j)/2\}\) the Lehman–Hodges pseudo-median of \({\mathbf {x}}\). Except \(T_0\), all the other location estimators are not sensitive to outliers. Then, the robust central moment estimators can be defined as follows:
with
With the notation set up, the general statistic of the robustified Jarque–Bera tests can be defined as follows [16, 70]:
With the right choice of the parameters (\(C_1, j_1, a_1 \ldots\)), the RJB statistic can be reduced to that of the classical Jarque–Bera test.
For the purpose of this study, we used the same four RJB tests that Stehlík et al. [16] evaluated in their study: MMRT\(_1\), MMRT\(_2\), TTRT\(_1\), TTRT\(_2\). We refer readers to Stehlík et al. [16] for more details on those particular tests’ parameters.
1.2 B.2 The Robustified Lin–Mudholkar Test (RLM)
The Lin–Mudholkar test (LM) is based on the fact that the estimators of mean and variance are independent if and only if the sample at hand comes from a normal distribution [50]. The correlation coefficient between \(\overline{x}\) and \(\text {sd}({\mathbf {x}})\) serves as the statistic of the LM test. Stehlík et al. [16] use the following bootstrap estimator of the coefficient:
The robustified LM tests rely on robust estimation of moments to obtain robust estimators of the skewness and kurtosis. As is the case with the RJB tests, RLM is a class of tests each defined by the particular choice of the estimators’ parameters. The RLM test considered in this study is the same as the one used by Stehlík et al. [16], in which
estimates skewness and
estimates kurtosis.
1.3 B.3 The Robustified Shapiro–Wilk Test (RSW)
Let \(J({\mathbf {x}})\) be the scaled average absolute deviation from the sample median \(M=\text {median}({\mathbf {x}})\):
The RSW test statistic is the ratio of the usual standard deviation estimate \(\text {sd}({\mathbf {x}})\) and \(J(\mathbf{x})\) [53]:
Under the null hypothesis of normality, J is asymptotically normally distributed and a consistent estimate of the true deviation [53]. Therefore, the values of RSW close to 1 are expected when the sample at hand does come from a normal distribution.
C Machine-learning methods for normality testing
1.1 C.1 The FSSD kernel test of normality
As mentioned in Section 2, there are several kernel tests of goodness-of-fit. Just as our approach, they also represent a blend of machine learning and statistics. We evaluate the test of Jitkrittum et al. [42] against our network because that test is computationally less complex than the original kernel tests of goodness-of-fit proposed by Chwialkowski et al. [35], but comparable with them in terms of statistical power. Of the other candidates, the approach of Kellner and Celisse [45] is of quadratic complexity and requires bootstrap, and a drawback of the approach of Lloyd and Ghahramani [44], as Jitkrittum et al. [42] point out, is that it requires a model to be fit and new data to be simulated from it. Also, this approach fails to exploit our prior knowledge on the characteristics of the distribution for which goodness-of-fit is being determined. So, the test formulated by Jitkrittum et al. [42] was our choice as the representative of the kernel tests of goodness-of-fit. It is a distribution-free test, so we first describe its general version before we show how we used it to test for normality. The test is defined for multidimensional distributions, but we present it for the case of one-dimensional distributions because we are interested in one-dimensional Gaussians.
Let \({\mathcal {F}}\) be the RKHS of real-valued functions over \({\mathcal {X}}\in \mathrm{I\!R}\) with the reproducing kernel k. Let q be the density of model \(\Psi\). As in Chwialkowski et al. [35], a Stein operator \(T_{q}\) [36] can be defined over \({\mathcal {F}}\):
Let us note that for
it holds that:
If \(Z \sim \Psi\), then \({\mathrm{I\!E}}(T_{q}f)(Z)= 0\) [35]. Let X be the random variable which follows the distribution from which the sample \({\mathbf {x}}\) was drawn. The Stein discrepancy \(S_q\) between X and Z is defined as follows [35]:
where \(g(\cdot )={\mathrm{I\!E}}\xi _q(X,\cdot )\) is called the Stein witness function and belongs to \({\mathcal {F}}\). Chwialkowski et al. [35] show that if k is a cc-universal kernel [107], then \(S_{q}(X)=0\) if and only if \(X \sim \Psi\), provided a couple of mathematical conditions are satisfied.
Jitkrittum et al. [42] follow the same approach as Chwialkowski et al. [43] for kernel two-sample tests and present the statistic that is comparable to the original one of Chwialkowski et al. [35] in terms of power, but faster to compute. The idea is to use a real analytic kernel k that makes the witness function g real analytic. In that case, the values of \(g(v_1),g(v_2),\ldots ,g(v_m)\) for a sample of points \(\{v_j\}_{j=1}^{m}\), drawn from X, are almost surely zero w.r.t. the density of X if \(X\sim \Psi\). Jitkrittum et al. [42] define the following statistic which they call the finite set Stein discrepancy:
If \(X\sim \Psi\), \(FSSD^2=0\) almost surely. Jitkrittum et al. [42] use the following estimate of \(FSSD^2\):
where
In our case, we want to test if \(\Psi\) is equal to any normal distribution. Similarly to the LF test, we can use the sample estimates of the mean and variance as the parameters of the normal model. Then, we can randomly draw m numbers from \(N(\overline{x}, \text {sd}({\mathbf {x}})^2)\) and use them as points \(\{v_j\}_{j=1}^{m}\), calculating the estimate (42) for \({\mathbf {x}}\) and \(N(\overline{x}, \text {sd}({\mathbf {x}}))\). For kernel, we chose the Gaussian kernel as it fulfills the conditions laid out by Jitkrittum et al. [42]. To set its bandwidth, we used the median heuristic [108], which sets it to the median of the absolute differences \(|x_i-x_j|\) (\(x_i, x_j \in {\mathbf {x}}, 1\le i < j \le n\)). The exact number of locations, m, was set to 10.
Since g is always zero if the sample comes from the normal distribution, the larger the value of \(FSSD^2\), the more likely it is that the sample came from a non-normal distribution. We refer to this test as the FSSD test.
1.2 C.2 The statistic-based neural network
Since the neural networks are designed with a fixed-size input in mind, but samples can have any number of elements, Sigut et al. [3] represent the samples with the statistics of several normality tests which were chosen in advance. The rationale behind this method is that, taken together, the statistics of different normality tests examine samples from complementary perspectives, so a neural network that combines the statistics could be more accurate than individual tests. Sigut et al. [3] use the estimates of the following statistics:
-
1.
skewness:
$$\begin{aligned} \sqrt{\beta _1} = \frac{m_3}{m_2^{3/2}},\quad \text {where } m_u=\frac{1}{n}\sum _{i=1}^{n}(x_i-\overline{x})^u \end{aligned}$$(44) -
2.
kurtosis:
$$\begin{aligned} \sqrt{\beta _2} = \frac{m_4}{m_2^2} \end{aligned}$$(45) -
3.
the W statistic of the Shapiro-Wilk test (see Equation (14)),
-
4.
the statistic of the test proposed by Lin and Mudholkar [50]:
$$\begin{aligned} \begin{aligned} Z_p&= \frac{1}{2}\ln \frac{1+r}{1-r} \\ r&= \frac{\sum _{i=1}^{n}(x_i-\overline{x})(h_i-\overline{h})}{\sqrt{\left( \sum _{i=1}^{n}(x_i-\overline{x})^2\right) \left( \sum _{i=1}^{n}(h_i-\overline{h})^2\right) }}\\ h_i&= \left( \frac{\sum _{j\ne i}^{n}x_j^2-\frac{1}{n-1}\left( \sum _{j\ne i}^{n}x_j\right) ^2}{n}\right) ^{\frac{1}{3}}\\ \overline{h}&= \frac{1}{n}h_i \end{aligned} \end{aligned}$$(46) -
5.
and the statistic of the Vasicek test [51]:
$$\begin{aligned} K_{m,n} = \frac{n}{2m\times \text {sd}({\mathbf {x}})}\left( \prod _{i=1}^{n}(x_{(i+m)}-x_{(i-m)})\right) ^{\frac{1}{n}} \end{aligned}$$(47)where m is a positive integer smaller than n/2, \([x_{(1)},x_{(2)},\ldots ,x_{(n)}]\) is the non-decreasingly sorted sample \({\mathbf {x}}\), \(x_{(i)}=x_{(1)}\) for \(i<1\), and \(x_{(i)}=x_{(n)}\) for \(i>n\).
It is not clear which activation function Sigut et al. [3] use. They train three networks with a single hidden layer containing 3, 5, and 10 neurons, respectively. One of the networks is designed to take the sample size into account as well so that it can be more flexible. Just as the other two, the network showed that it was capable of modeling posterior Bayesian probabilities of the input samples being normal. Sigut et al. [3] focus on the samples with no more than 200 elements.
In addition to our network, presented in Section 4, we trained one that follows the approach of Sigut et al. [3]. We refer to that network as Statistic-Based Neural Network (SBNN) because it expects an array of statistics as its input. More precisely, prior to being fed to the network, each sample \({\mathbf {x}}\) is transformed to the following array:
just as in Sigut et al. [3] (n is the sample size). ReLU was used as the activation function. To make comparison fair, we trained SBNN in the same way as our network which we design in Section 4.
D Results for set \({\mathcal {F}}\)
Detailed results for each n in each subset of \({\mathcal {F}}\) are presented in Tables 15, 16, 17 and 18.
Rights and permissions
About this article
Cite this article
Simić, M. Testing for normality with neural networks. Neural Comput & Applic 33, 16279–16313 (2021). https://doi.org/10.1007/s00521-021-06229-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06229-7