Visualization and statistical modeling of financial big data: doublelog modeling with skewsymmetric error distributions
 273 Downloads
Abstract
This study considers the visualization and statistical modeling of financial data (e.g., sales, assets, etc.) for a large data set of global firms that are listed and delisted. We present exploratory data analysis carried out in the R programming language. The results show that a doublelog model with a skewt error distribution is useful for modeling a firm’s total sales volume (in thousands of U.S. dollars) as a function of its number of employees and total assets (in thousands of U.S. dollars). This result is obtained by comparing the Akaike information criteria of several doublelog models with independent and identically distributed random error terms with skewsymmetric distributions and by further evaluating the models using crossvalidation.
Keywords
Financial big data Exploratory data analysis Data visualization Doublelog model Skewsymmetric distributions SparkR1 Introduction
In a previous study on financial big data, Jimichi and Maeda (2014) analyzed a data set of Nikkei NEEDS financial data extracted from a database system created by Jimichi (2010). This data set includes over 1500 Japanese firms that are listed in the first section of the Tokyo Stock Market. Jimichi and Maeda (2014) approach is based on Tukey (1977) exploratory data analysis method, in which a doublelog model with a normal error distribution is fitted to sales as the response variable with the number of employees and total assets as the explanatory variables based on the results of data visualization.
In this analysis, we use a financial data set that is extracted from the “Osiris” database system^{1} and includes information on over 80,000 listed and delisted firms. See Appendix 1 and Saka and Jimichi (2017) for further details on the data set. Because a doublelog model with a normal error distribution, like that of Jimichi and Maeda (2014), is not appropriate for modeling sales when the size of the data set increases, we construct a doublelog model with a nonnormal error distribution.
The remainder of this paper proceeds as follows. In Sect. 2, we introduce the data set used throughout this analysis, and we then visualize the data in Sect. 3 (cf. Unwin (2015)). We describe some properties of the distribution; specifically, the distribution of the logarithm of sales is shown to be slightly more skewed than expected under a normal distribution. We then discuss the use of the logskewnormal and logskewt distributions to model the data set in Sect. 4. In Sect. 5, using the knowledge obtained in Sect. 4, we fit several doublelog models with normal, skewnormal, and skewt error distributions to the sales data (see Azzalini and Capitanio (2014)). Model selection is then performed based on the Akaike information criterion in Sect. 6, and the models are then evaluated using the Kfold crossvalidation method in Sect. 7.
2 Data set and statistical packages
Summary of the Osiris data set
firmID  country  sales  employees  assets.total 

Length: 26,682  China: 6,085  Min.: 1  Min.: 1.0  Min.: 1 
Class: character  Japan: 3,219  1st Qu.: 15,613  1st Qu.: 121.0  1st Qu.:26,946 
Mode: character  United States of America: 3,161  Median: 87,588  Median: 506.5  Median: 139,410 
India: 1,409  Mean: 1,363,292  Mean: 4,790.9  Mean: 2,219,307  
United Kingdom: 1,126  3rd Qu.: 448,075  3rd Qu.: 2,171.0  3rd Qu.: 699,097  
Cayman Islands: 1,006  Max.: 482,130,000  Max.: 2,300,000.0  Max.: 877,789,728  
(Other): 10,676 

firmID: Combination of firm name and BvD firm code

country: Country name

sales: Sales (Units: U.S. dollars in thousands)

employees: Number of employees (Units: persons)

assets.tota: Total assets (Units: U.S. dollars in thousands)
In the next section, we visualize the data and try to gain insights that will help to inform the statistical modeling.
3 Data visualization
From Fig. 1, the plots off the diagonal indicate that the untransformed variables (left panel) do not follow bivariate normal distributions, and the plots on the diagonal show that the histograms of all variables are very rightskewed. We apply log transformations to sales, employees, and assets.total to normalize the data (as in, for example, Tukey (1977), Mosteller and Tukey (1977), and Fox and Weisberg (2011)). The logscale plots (right panel) appear to be closer to a bivariate normal distribution, but they are slightly modulated. Figure 5.1 in Azzalini and Capitanio (2014) indicates a similar result.
Thus, we can draw the following conclusions from the plots. From the pairwise scatter plots on the original axes, each pair of sales, employees, and assets.total is skewed, and the scatter plots on the log scale indicate a left skew. sales is skewed right, and \(\log (\texttt {{sales}})\) is almost normally distributed with a slight left skew. In the next section, we consider some models for \(\log (\texttt {{sales}})\) using the above insights.
4 Statistical modeling of sales
Azzalini (1985) presented a generalization of the normal distribution with nonzero skewness called the skewnormal distribution. This distribution has been used for modeling and analyzing skewed data. Additionally, related families of distributions include the skewt distribution, which was also studied by Azzalini and Capitanio (2014).
In this section, we fit these skewsymmetric distributions to \(\log (\texttt {{sales}})\). We provide expressions for these distributions in Appendices 1 and 1.
4.1 Logskewnormal distribution
The tail of the distribution deviates from the fitted distribution in the Q–Q plot in Fig. 4. Thus, we now fit another skewsymmetric distribution to \(\log (\texttt {{sales}})\).
4.2 Logskewt distribution
Other skewsymmetric distributions beyond the typical skewt distribution are given in Azzalini and Capitanio (2014) (see also Appendix 1). We consider the fit of the skewt distribution \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\) to \(\log (\texttt {{sales}})\). Note that \((\xi , \omega , \alpha , \nu )\) are the DPs, where \(\xi , \alpha \in \mathbb {R}\), \(\omega \in \mathbb {R}^{+}\), and \(\nu > 1\).
The statistical model in (1) appears to provide a good fit for \(\log (\texttt {{sales}})\). However, the Q–Q plot (Fig. 5) does indicate some signs of deviance from the skewt distribution in the tails. However, (Fig. 6) the skewt distribution appears to provide a better fit than the skewnormal case does (Fig. 4).
5 Doublelog modeling
 Normal Case:
\(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {N}(0,\sigma ^{2})\)
 Skewnormal Case:
\(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {SN}(0,\omega ^{2}, \alpha )\)
 Skewt Case:
\(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {ST}(0,\omega ^{2}, \alpha , \nu )\),
Remark 1
Note that Eq. (2) can be viewed as a CobbDouglas production function \(P=b L^{\alpha } C^{\beta }\), where P, L, C represent production, labor, and capital, respectively (see Cobb and Douglas (1928)).
5.1 Normal case
tTable: normal case
Estimate  Std. error  t value  Pr(> t)  

(Intercept)  0.5803  0.0320  18.13  0.0000 
log(employees)  0.4673  0.0045  104.36  0.0000 
log(assets.total)  0.6559  0.0040  162.80  0.0000 
5.2 Skewnormal case
Table of zratio values: skewnormal case
Estimate  Std.err  zratio  Pr{> z}  

(Intercept.DP)  1.6644  0.0308  54.08  0.0000 
log(employees)  0.3621  0.0047  77.33  0.0000 
log(assets.total)  0.7039  0.0040  178.00  0.0000 
omega  1.4114  0.0088  160.55  0.0000 
alpha  − 2.3201  0.0393  − 59.04  0.0000 
The Q–Q plot in Fig. 9 clearly has some fit issues because many of the residuals are far from the dashed line.
5.3 Skewt case
Table of zratio values: skewt case
Estimate  Std.err  zratio  Pr{> z}  

(Intercept.DP)  1.3258  0.0288  45.96  0.0000 
log(employees)  0.3531  0.0043  81.64  0.0000 
log(assets.total)  0.7017  0.0036  195.52  0.0000 
omega  0.7637  0.0105  72.40  0.0000 
alpha  − 1.0210  0.0405  − 25.24  0.0000 
nu  3.4664  0.0803  43.17  0.0000 
Figure 12 indicates that some scaled DP residuals do not lie along the dashed line in the tails of the Q–Q plot, but the P–P plot suggests a good fit, and these results are better than those in the skewnormal case.
In the next section, we perform model selection using the Akaike information criterion (AIC).
6 Model selection with the AIC
In this section, we perform model selection with respect to the distribution of \(\log (\texttt {{sales}})\) and the error distributions of the doublelog model using the AIC. See Akaike (1973) and Konishi and Kitagawa (2008) for details regarding the AIC.
6.1 Distributions of \(\log (\texttt {{sales}})\)
AIC table: distributions for the log of sales
df  AIC  

lm.log.sales2015  2  127,076.06 
selm.log.sales2015  3  126,627.07 
selm.ST.log.sales2015  4  126,546.63 
The minimum \(\mathrm {AIC}\) is obtained from the fitted skewt distribution (selm.ST.log.sales2015), implying that this distribution provides the best fit. Note that this result is consistent with the visualization results (Fig. 5) in Sect. 4.
6.2 Doublelog models
AIC table: doublelog models
df  AIC  

lm.log.firmfin2015  4  74,980.13 
selm.log.firmfin2015  5  71,972.08 
selm.ST.log.firmfin2015  6  67,897.56 
We again observe that the minimum \(\mathrm {AIC}\) is obtained from the skewt model (selm.ST.log.firmfin2015) (see Table 6). Note that this result is consistent with the visualization results (Fig. 12).
7 Crossvalidation
We evaluate the doublelog models in Eq. (3) using the Kfold crossvalidation method. We set \(K=10\) and adopt the mean squared error of the prediction (MSEP) and the AIC as the evaluation criteria. See Stone (1974), Efron and Tibshirani (1993), and James et al. (2013) for details on general crossvalidation methods, and see Efron and Hastie (2016) for specific details on the AIC.
7.1 MSEP discrepancy
7.2 AIC discrepancy
8 Concluding remarks and discussion
In this study, we conducted an exploratory data analysis to visualize the financial data of listed and delisted firms around the world, and we investigated the distribution of sales and constructed statistical models to explain total sales based on the number of employees and total assets. The total sales of Japan’s listed firms usually follow a lognormal distribution (e.g., Jimichi and Maeda (2014)). However, we observe that the logskewt distribution provides a good fit for the logarithm of sales of the global firms in our data set. One reason is that if the population of firms is expanded to include global firms, some of these firms have extremely small scales, so the distribution of the logarithm of sales is still skewed left.
The doublelog model with a skewt error distribution is also useful for modeling the logarithm of sales as a linear function of the logarithm of the number of employees and total assets. The estimated regression plane (8) is a better predictor of the logarithm of sales than the model with a normal error distribution is. The doublelog models were evaluated using the Kfold crossvalidation method, and the model with the skewt error distribution had the best performance when the AIC was used for evaluation. As we mentioned in Remark 1, a model that accurately predicts sales can be constructed by fitting a Cobb–Douglas production function to the financial (accounting) data of the set of global firms. Note that these results are based on the data for 2015; similar results (not shown here) were obtained for the decade 2006–2015.
Finally, we used theApache Spark™^{4} (say Spark) environment to manipulate our data set. Our next analysis will consider a data set extracted from the “Orbis” database^{5}, containing information for over 20,000,000 firms, and its size is over one hundred gigabytes, so Spark or a similar tool for handling big data will be required.
Footnotes
Notes
Acknowledgements
The authors wish to thank the reviewers for their helpful comments. This work is partially supported by a GrantinAid for Scientific Research (KAKENHI: No. 16K04022) and the Joint Usage/Research Center for Interdisciplinary Largescale Information Infrastructures (JHPCN Project ID: jh171002NWJ, jh181001NWJ) in Japan. We would like to thank Mr. Ayumu Masuda of Bureau van Dijk KK for extracting some dataset files from the Osiris database system.
Supplementary material
References
 Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the 2nd international symposium on information theory (pp. 267–281). Budapest: Akadimiai Kiado.Google Scholar
 ArellanoValle, R. B., & Azzalini, A. (2013). The centred parameterization and related quantities of the skewt distribution. Journal of Multivariate Analysis, 113, 73–90.MathSciNetCrossRefGoogle Scholar
 Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.MathSciNetzbMATHGoogle Scholar
 Azzalini, A., & Capitanio, A. (2014). The skewnormal and related families. Institute of mathematical statistics monographs. Cambridge: Cambridge University Press.Google Scholar
 Cobb, C. W., & Douglas, P. H. (1928). A theory of production. American Economic Review, 18, 139–165.Google Scholar
 Efron, B., & Hastie, T. (2016). Computer age statistical inference: algorithms, evidence, and data science. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
 Efron, B., & Tibshirani, R. J. (1993). An introduction to bootstrap. London: Chapman and Hall/CRC.CrossRefGoogle Scholar
 Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Thousand Oaks: Sage.Google Scholar
 Healy, M. J. R. (1968). Multivariate normal plotting. Applied Statistics, 17, 157–161.CrossRefGoogle Scholar
 James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Berlin: Springer.CrossRefGoogle Scholar
 Jimichi, M. (2010). Building of Financial Database Servers, ISBN: 9784990553005. https://kwansei.repo.nii.ac.jp/ (in Japanese).
 Jimichi, M., & Maeda, S. (2014). Visualization and statistical modeling of financial data with R, Poster at the R user conference 2014. http://user2014.stat.ucla.edu/abstracts/posters/48_Jimichi.pdf
 Konishi, S., & Kitagawa, G. (2008). Information Criteria and Statistical Modeling. Berlin: Springer.CrossRefGoogle Scholar
 Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics. Reading, Mass: AddisonWesley.Google Scholar
 Ryza, S., Laserson, U., Owen, S., & Wills, J. (2016). Advanced analytics with spark. Newton: O’Reilly.Google Scholar
 Saka, C., & Jimichi, M. (2017). Evidence of inequality from accounting data visualization. Taiwan Accounting Review, 13(2), 193–234.Google Scholar
 Stone, M. (1974). Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147.MathSciNetzbMATHGoogle Scholar
 Tukey, J. W. (1977). Exploratory data analysis. Boston: AddisonWesley Publishing Co.zbMATHGoogle Scholar
 Unwin, A. (2015). Graphical data analysis with R. London: Chapman and Hall/CRC.CrossRefGoogle Scholar