Advertisement

Japanese Journal of Statistics and Data Science

, Volume 1, Issue 2, pp 347–371 | Cite as

Visualization and statistical modeling of financial big data: double-log modeling with skew-symmetric error distributions

  • Masayuki JimichiEmail author
  • Daisuke Miyamoto
  • Chika Saka
  • Shuichi Nagata
Article
  • 273 Downloads

Abstract

This study considers the visualization and statistical modeling of financial data (e.g., sales, assets, etc.) for a large data set of global firms that are listed and delisted. We present exploratory data analysis carried out in the R programming language. The results show that a double-log model with a skew-t error distribution is useful for modeling a firm’s total sales volume (in thousands of U.S. dollars) as a function of its number of employees and total assets (in thousands of U.S. dollars). This result is obtained by comparing the Akaike information criteria of several double-log models with independent and identically distributed random error terms with skew-symmetric distributions and by further evaluating the models using cross-validation.

Keywords

Financial big data Exploratory data analysis Data visualization Double-log model Skew-symmetric distributions SparkR 

1 Introduction

In a previous study on financial big data, Jimichi and Maeda (2014) analyzed a data set of Nikkei NEEDS financial data extracted from a database system created by Jimichi (2010). This data set includes over 1500 Japanese firms that are listed in the first section of the Tokyo Stock Market. Jimichi and Maeda (2014) approach is based on Tukey (1977) exploratory data analysis method, in which a double-log model with a normal error distribution is fitted to sales as the response variable with the number of employees and total assets as the explanatory variables based on the results of data visualization.

In this analysis, we use a financial data set that is extracted from the “Osiris” database system1 and includes information on over 80,000 listed and delisted firms. See Appendix  1 and Saka and Jimichi (2017) for further details on the data set. Because a double-log model with a normal error distribution, like that of Jimichi and Maeda (2014), is not appropriate for modeling sales when the size of the data set increases, we construct a double-log model with a non-normal error distribution.

The remainder of this paper proceeds as follows. In Sect.  2, we introduce the data set used throughout this analysis, and we then visualize the data in Sect.  3 (cf. Unwin (2015)). We describe some properties of the distribution; specifically, the distribution of the logarithm of sales is shown to be slightly more skewed than expected under a normal distribution. We then discuss the use of the log-skew-normal and log-skew-t distributions to model the data set in Sect.  4. In Sect. 5, using the knowledge obtained in Sect.  4, we fit several double-log models with normal, skew-normal, and skew-t error distributions to the sales data (see Azzalini and Capitanio (2014)). Model selection is then performed based on the Akaike information criterion in Sect.  6, and the models are then evaluated using the K-fold cross-validation method in Sect.  7.

2 Data set and statistical packages

Table 1 shows a summary of the data set, which was extracted from the Bureau van Dijk (BvD) Osiris database system and is related to the financial indexes of 26,682 global listed and delisted firms that used consolidated accounting in 2015. The R packages SparkR, magrittr, and dplyr are used to manipulate the data. See Appendix 1 for details.
Table 1

Summary of the Osiris data set

firmID

country

sales

employees

assets.total

Length: 26,682

China: 6,085

Min.: 1

Min.: 1.0

Min.: 1

Class: character

Japan: 3,219

1st Qu.: 15,613

1st Qu.: 121.0

1st Qu.:26,946

Mode: character

United States of America: 3,161

Median: 87,588

Median: 506.5

Median: 139,410

 

India: 1,409

Mean: 1,363,292

Mean: 4,790.9

Mean: 2,219,307

 

United Kingdom: 1,126

3rd Qu.: 448,075

3rd Qu.: 2,171.0

3rd Qu.: 699,097

 

Cayman Islands: 1,006

Max.: 482,130,000

Max.: 2,300,000.0

Max.: 877,789,728

 

(Other): 10,676

   
The variables are as follows:
  • firmID: Combination of firm name and BvD firm code

  • country: Country name

  • sales: Sales (Units: U.S. dollars in thousands)

  • employees: Number of employees (Units: persons)

  • assets.tota: Total assets (Units: U.S. dollars in thousands)

In this work, we use the R statistical programming language2 and the packages SparkR, magrittr, and dplyr for data loading and manipulation; ggplot2, GGally, and rgl for visualization; and sn for the visualization and inference of the skew-symmetric distributions. See Appendix  1 for more details.

In the next section, we visualize the data and try to gain insights that will help to inform the statistical modeling.

3 Data visualization

First, we visualize the data with pairwise scatter plots for sales, the number of employees, and assets to gain information about each pair of variables.
Fig. 1

Pairwise scatter plots: untransformed variables (left panel) and log scale (right panel)

From Fig. 1, the plots off the diagonal indicate that the untransformed variables (left panel) do not follow bivariate normal distributions, and the plots on the diagonal show that the histograms of all variables are very right-skewed. We apply log transformations to sales, employees, and assets.total to normalize the data (as in, for example, Tukey (1977), Mosteller and Tukey (1977), and Fox and Weisberg (2011)). The log-scale plots (right panel) appear to be closer to a bivariate normal distribution, but they are slightly modulated. Figure 5.1 in Azzalini and Capitanio (2014) indicates a similar result.

Next, we investigate the distribution of sales. Figure  2 shows the histograms of sales plotted on the original axes and on the log scale. The histogram for the untransformed data (left panel) is extremely right-skewed. In contrast, the log-scale (right panel) variable seems to be normally distributed, but the left tail is slightly longer than the right tail, so it is not completely symmetric.
Fig. 2

Histograms of sales: Untransformed (left panel) and log scale (right panel) data

We can further investigate the normality of the data with a normal Q–Q plot of \(\log (\texttt {{sales}})\).
Fig. 3

Normal Q–Q plot of \(\log (\texttt {{sales}})\)

We can clearly observe fit issues in both the upper and lower tails in Fig. 3. Additionally, the skewness of \(\log (\texttt {{sales}})\) is given as follows:
$$\begin{aligned} g_{1}:= \frac{m_{3}^{2}}{m_{2}^{3/2}} = -0.35 (< 0), \end{aligned}$$
where \(m_{j}:=\sum _{i=1}^{n}(x_{i}-\bar{x})^{j}/n\) is the j-th moment of the data \(\left\{ x_{1},\dots ,x_{n}\right\}\) around the mean \(\bar{x}=\sum _{i=1}^{n}x_{i}/n\). The skewness has a negative value, so the data are skewed left.

Thus, we can draw the following conclusions from the plots. From the pairwise scatter plots on the original axes, each pair of sales, employees, and assets.total is skewed, and the scatter plots on the log scale indicate a left skew. sales is skewed right, and \(\log (\texttt {{sales}})\) is almost normally distributed with a slight left skew. In the next section, we consider some models for \(\log (\texttt {{sales}})\) using the above insights.

4 Statistical modeling of sales

Azzalini (1985) presented a generalization of the normal distribution with non-zero skewness called the skew-normal distribution. This distribution has been used for modeling and analyzing skewed data. Additionally, related families of distributions include the skew-t distribution, which was also studied by Azzalini and Capitanio (2014).

In this section, we fit these skew-symmetric distributions to \(\log (\texttt {{sales}})\). We provide expressions for these distributions in Appendices 1 and 1.

4.1 Log-skew-normal distribution

Suppose that sales is distributed according to the log-skew-normal distribution \(\mathsf {LSN}(\xi , \omega ^{2}, \alpha )\). We fit the skew-normal distribution \(\mathsf {SN}(\xi , \omega ^{2}, \alpha )\) to \(\log (\texttt {{sales}})\). Note that \((\xi , \omega , \alpha )\) are the direct parameters (DP), where \(\xi \in \mathbb {R}:=(-\infty , \infty )\), \(\omega \in \mathbb {R}^{+}:=(0,\infty )\), and \(\alpha \in \mathbb {R}\). See Azzalini and Capitanio (2014). The maximum likelihood estimates (MLE) are given as follows:
$$\begin{aligned} (\widehat{\xi }, \widehat{\omega }, \widehat{\alpha }) =(13.54,3.46,-1.44) \end{aligned}$$
This result leads to the following estimate for the probability density function (p.d.f.), which we refer to as the statistical model:
$$\begin{aligned} f _{\mathsf {SN}}(\log (\texttt {{sales}}) \mid \widehat{\xi }, \widehat{\omega }, \widehat{\alpha }) := \frac{2}{\widehat{\omega }} \phi \left( \frac{\log (\texttt {{sales}})-\widehat{\xi }}{\widehat{\omega }}\right) \Phi \left( \widehat{\alpha } \frac{\log (\texttt {{sales}})-\widehat{\xi }}{\widehat{\omega }}\right) , \end{aligned}$$
where
$$\begin{aligned} \phi (z) := \frac{1}{\sqrt{2 \pi }} \exp \left( -\frac{z^{2}}{2}\right) , \quad \Phi (z) := \int _{-\infty }^{z} \phi (x) dx \quad (z \in \mathbb {R}) \end{aligned}$$
are the p.d.f. and the cumulative distribution function (c.d.f.), respectively, of the standard normal distribution. The histogram of \(\log (\texttt {{sales}})\) and the Q–Q plot of the squared scaled DP residuals
$$\begin{aligned} \widehat{z}_{i}^{2} := \left( \frac{\log (\texttt {{sales}}_{i})-\widehat{\xi }}{\widehat{\omega }}\right) ^{2} \end{aligned}$$
are shown in Fig.  4.
Fig. 4

Histogram of \(\log (\texttt {{sales}})\) fitted to a statistical model based on the skew-normal distribution (left panel) and Q–Q plot of the squared scaled DP residuals (right panel)

The tail of the distribution deviates from the fitted distribution in the Q–Q plot in Fig.  4. Thus, we now fit another skew-symmetric distribution to \(\log (\texttt {{sales}})\).

4.2 Log-skew-t distribution

Other skew-symmetric distributions beyond the typical skew-t distribution are given in Azzalini and Capitanio (2014) (see also Appendix 1). We consider the fit of the skew-t distribution \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\) to \(\log (\texttt {{sales}})\). Note that \((\xi , \omega , \alpha , \nu )\) are the DPs, where \(\xi , \alpha \in \mathbb {R}\), \(\omega \in \mathbb {R}^{+}\), and \(\nu > 1\).

The MLEs of the DPs \((\xi , \omega , \alpha , \nu )\) are given by
$$\begin{aligned} (\widehat{\xi }, \widehat{\omega }, \widehat{\alpha }, \widehat{\nu }) =( 13.17, 3.05, -1.12, 19.1 ), \end{aligned}$$
and the statistical model is
$$\begin{aligned} f _{\mathsf {ST}}(\log (\texttt {{sales}}) \mid \widehat{\xi }, \widehat{\omega }, \widehat{\alpha }, \widehat{\nu }) =\frac{2}{\widehat{\omega }} \ f _{\mathsf {t}}\left( \left. \frac{\log (\texttt {{sales}})-\widehat{\xi }}{\widehat{\omega }}\right| \widehat{\nu } \right) \nonumber \\ \times \quad F _{\mathsf {t}}\left( \left. \widehat{\alpha } \frac{\log (\texttt {{sales}})-\widehat{\xi }}{\widehat{\omega }} \sqrt{\frac{\widehat{\nu }+1}{\left( \frac{\log (\texttt {{sales}})-\widehat{\xi }}{\widehat{\omega }}\right) ^{2}+\widehat{\nu }}} \right| \widehat{\nu }+1 \right) , \end{aligned}$$
(1)
where
$$\begin{aligned} f _{\mathsf {t}}(z \mid \nu ) := \frac{\Gamma \left( \frac{\nu +1}{2}\right) }{\Gamma \left( \frac{\nu }{2}\right) \sqrt{\pi \nu }} \left( 1+\frac{z^{2}}{\nu }\right) ^{-\frac{\nu +1}{2}}, \quad F _{\mathsf {t}}(z \mid \nu ) = \int _{-\infty }^{z} f _{\mathsf {t}}(x \mid \nu ) \text{ d } x \end{aligned}$$
are the p.d.f. and the c.d.f., respectively, of the t distribution with \(\nu\) degrees of freedom.
The histogram of \(\log (\texttt {{sales}})\) compared with the statistical model and the Q–Q plot3 of the squared scaled DP residuals are shown in Fig.  5.
Fig. 5

Histogram of \(\log (\texttt {{sales}})\) fitted to the skew-t distribution (left panel) and Q–Q plot of the squared scaled DP residuals (right panel)

The statistical model in (1) appears to provide a good fit for \(\log (\texttt {{sales}})\). However, the Q–Q plot (Fig.  5) does indicate some signs of deviance from the skew-t distribution in the tails. However, (Fig. 6) the skew-t distribution appears to provide a better fit than the skew-normal case does (Fig.  4).

5 Double-log modeling

Let us consider the following model:
$$\begin{aligned} \texttt {{sales}}_{i} = \gamma \times \texttt {{employees}}_{i}^{\alpha _{1}} \times \texttt {{assets.total}}_{i}^{\alpha _{2}} \times \varepsilon _{i}, \quad i = 1,\dots ,n. \end{aligned}$$
(2)
We take the natural logarithm of both sides of (2) to obtain the following model:
$$\begin{aligned} \log (\texttt {{sales}}_{i}) = \alpha _{0} + \alpha _{1} \log (\texttt {{employees}}_{i})+ \alpha _{2} \log (\texttt {{assets.total}}_{i}) + \log (\varepsilon _{i}), \end{aligned}$$
(3)
where \(\alpha _{0}:=\log \gamma\). Equation  (3) is called the double-log model or the log-log model.
We consider the following three cases for the error distributions:
  1. Normal Case:

    \(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {N}(0,\sigma ^{2})\)

     
  2. Skew-normal Case:

    \(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {SN}(0,\omega ^{2}, \alpha )\)

     
  3. Skew-t Case:

    \(\log (\varepsilon _{i}) {\mathop {\sim }\limits ^\mathrm{i.i.d.}} \mathsf {ST}(0,\omega ^{2}, \alpha , \nu )\),

     
where \(i=1,\dots , n\) and the notation “\({\mathop {\sim }\limits ^\mathrm{i.i.d.}}\)” denotes independent and identically distributed.

Remark 1

Note that Eq. (2) can be viewed as a Cobb-Douglas production function \(P=b L^{\alpha } C^{\beta }\), where P, L, C represent production, labor, and capital, respectively (see Cobb and Douglas (1928)).

5.1 Normal case

The t values of the regression coefficients \(\alpha _{j}\) (\(j=0,1,2\)) are given in Table  2. Note that all of the coefficients are statistically significant.
Table 2

t-Table: normal case

 

Estimate

Std. error

t value

Pr(> |t|)

(Intercept)

0.5803

0.0320

18.13

0.0000

log(employees)

0.4673

0.0045

104.36

0.0000

log(assets.total)

0.6559

0.0040

162.80

0.0000

The estimated regression plane is given as follows:
$$\begin{aligned} \widehat{\eta }_{\mathsf {LNL}}= & {} \widehat{\alpha }_{0} + \widehat{\alpha }_{1} \log (\texttt {{employees}}) + \widehat{\alpha }_{2} \log (\texttt {{assets.total}}) \nonumber \\= & {} 0.58 + 0.467 \log (\texttt {{employees}}) + 0.656 \log (\texttt {{assets.total}}). \end{aligned}$$
(4)
See Fig.  15 in Appendix 1. The estimated variance of the error is \(\widehat{\sigma }^{2} = 0.986^{2}\), and the coefficient of determination and its adjusted version are given by \(R^{2} = 0.858\) and \(\bar{R}^{2} = 0.858\), respectively. From these results, the double-log model with a normal error distribution provides a reasonable fit, but the normal Q–Q plot of the residuals \(e_{\mathsf {LNL} i} := \log (\texttt {{sales}}_{i}) - \widehat{\eta }_{\mathsf {LNL} i}\) reveals that the residuals are not normally distributed, especially in the tails.
Fig. 6

Normal Q–Q plots of the residuals for the double-log model with a normal error distribution

5.2 Skew-normal case

The z-ratio values for the MLEs of the parameters are given in Table 3.
Table 3

Table of z-ratio values: skew-normal case

 

Estimate

Std.err

z-ratio

Pr{> |z|}

(Intercept.DP)

1.6644

0.0308

54.08

0.0000

log(employees)

0.3621

0.0047

77.33

0.0000

log(assets.total)

0.7039

0.0040

178.00

0.0000

omega

1.4114

0.0088

160.55

0.0000

alpha

− 2.3201

0.0393

− 59.04

0.0000

Note that all parameter estimates are statistically significant. The estimated regression plane is as follows:
$$\begin{aligned} \widehat{\eta }_{\mathsf {LSNL}}= & {} \widehat{\alpha }_{0} + \widehat{\alpha }_{1} \log (\texttt {{employees}}) + \widehat{\alpha }_{2} \log (\texttt {{assets.total}}) \nonumber \\= & {} 1.664 + 0.362 \log (\texttt {{employees}}) + 0.704 \log ( \texttt {{assets.total}}). \end{aligned}$$
(5)
See the left-hand panel of Fig.  16 in Appendix 1.
We obtain the DP residuals from the estimated regression coefficients as follows:
$$\begin{aligned} e_{\mathsf {LSNL.DP} i}:= & {} \log (\texttt {{sales}}_{i}) - \widehat{\eta }_{\mathsf {LSNL} i} \\= & {} \log (\texttt {{sales}}_{i}) - \widehat{\alpha }_{0} - \widehat{\alpha }_{1} \log (\texttt {{employees}}_{i}) -\widehat{\alpha }_{2} \log (\texttt {{assets.total}}_{i}). \end{aligned}$$
The plot of the DP residuals \(e_{\mathsf {LSNL.DP} i}\) versus the fitted values \(\widehat{\eta }_{\mathsf {LSNL} i}\) is given in Fig.  7.
Fig. 7

Plot of the DP residuals vs. fitted values: skew-normal case

The DP residuals are far from the origin (see Fig.  7). To reduce the residuals, we can use the following adjusted regression plane:
$$\begin{aligned} \widetilde{\eta }_{\mathsf {LSNL}} := \widehat{\eta }_{\mathsf {LSNL}} + \widehat{\omega } b \widehat{\delta }, \end{aligned}$$
where \(\widehat{\delta }:=\widehat{\alpha }/\sqrt{1+\widehat{\alpha }^{2}}\). See p. 67 of Azzalini and Capitanio (2014) and Appendix  1.
The adjusted regression plane is then given as follows:
$$\begin{aligned} \widetilde{\eta }_{\mathsf {LSNL}}= & {} \widehat{\eta }_{\mathsf {LSNL}} + \widehat{\omega } b \widehat{\delta } = (\widehat{\alpha }_{0}+ \widehat{\omega } b \widehat{\delta }) + \widehat{\alpha }_{1} \log (\texttt {{employees}}) + \widehat{\alpha }_{2} \log (\texttt {{assets.total}}) \nonumber \\= & {} (1.664+ 1.411 \times 0.798 \times (-0.918) ) + 0.362 \log (\texttt {{employees}}) + 0.704 \log ( \texttt {{assets.total}}) \nonumber \\= & {} 0.63 + 0.362 \log (\texttt {{employees}}) + 0.704 \log ( \texttt {{assets.total}}). \end{aligned}$$
(6)
See the right-hand panel of Fig.  16 in Appendix 1.
The centered parameter (CP) residuals are defined by
$$\begin{aligned} e_{\mathsf {LSNL.CP} i}:= & {} \log (\texttt {{sales}}_{i}) - \widetilde{\eta }_{\mathsf {LSNL} i} = \log (\texttt {{sales}}_{i}) - \widehat{\eta }_{\mathsf {LSNL}i} - \widehat{\omega } b \widehat{\delta } \\= & {} \log (\texttt {{sales}}_{i}) - (\widehat{\alpha }_{0}+\widehat{\omega } b \widehat{\delta }) - \widehat{\alpha }_{1} \log (\texttt {{employees}}_{i}) - \widehat{\alpha }_{2} \log (\texttt {{assets.total}}_{i}), \end{aligned}$$
and a plot of the CP residuals \(e_{\mathsf {LSNL.CP} i}\) versus the fitted values \(\widetilde{\eta }_{\mathsf {LSNL} i}\) is given in Fig.  8. Thus, the adjusted regression plane has reduced the biases of the residuals.
Fig. 8

Plot of the CP residuals vs. fitted values: skew-normal case

The Q–Q and P–P plots of the scaled DP residuals
$$\begin{aligned} z_{\mathsf {LSNL} i}= & {} \frac{\log (\texttt {{sales}}_{i}) - \widehat{\eta }_{\mathsf {LSNL} i}}{\widehat{\omega }} \\= & {} \frac{\log (\texttt {{sales}}_{i})- \widehat{\alpha }_{0}- \widehat{\alpha }_{1} \log (\texttt {{employees}}) -\widehat{\alpha }_{2} \log (\texttt {{assets.total}})}{\widehat{\omega }} \end{aligned}$$
are shown in Fig.  9. These plots are based on the following property of the scaled DP residuals:
$$\begin{aligned} z_{\mathsf {LSNL} i}^{2}{\mathop {\sim }\limits ^\mathrm{a}} \chi _{1}^{2} \,\,\ (\text {; Chi-square distribution with 1 degree of freedom}). \end{aligned}$$
See p. 61 of Azzalini and Capitanio (2014) as well as Eq.  (9) in the Appendix.
Fig. 9

Q–Q (left panel) and P–P (right panel) plots of the scaled DP residuals: skew-normal case

The Q–Q plot in Fig.  9 clearly has some fit issues because many of the residuals are far from the dashed line.

5.3 Skew-t case

The z-ratio values with respect to the MLEs of the parameters are given in Table 4.
Table 4

Table of z-ratio values: skew-t case

 

Estimate

Std.err

z-ratio

Pr{> |z|}

(Intercept.DP)

1.3258

0.0288

45.96

0.0000

log(employees)

0.3531

0.0043

81.64

0.0000

log(assets.total)

0.7017

0.0036

195.52

0.0000

omega

0.7637

0.0105

72.40

0.0000

alpha

− 1.0210

0.0405

− 25.24

0.0000

nu

3.4664

0.0803

43.17

0.0000

All parameters are significant, and the estimated regression plane is given by
$$\begin{aligned} \widehat{\eta }_{\mathsf {LSTL}}= & {} \widehat{\alpha }_{0} + \widehat{\alpha }_{1} \log (\texttt {{employees}}) + \widehat{\alpha }_{2} \log (\texttt {{assets.total}}) \nonumber \\= & {} 1.326 + 0.353 \log (\texttt {{employees}}) + 0.702 \log (\texttt {{assets.total}}). \end{aligned}$$
(7)
See the left-hand panel of Fig.  17 in Appendix 1.
Figure  10 plots the DP residuals \(e_{\mathsf {LSTL.DP} i}:= \log (\texttt {{sales}}_{i})- \widehat{\eta }_{\mathsf {LSTL} i}\) versus the fitted values \(\widehat{\eta }_{\mathsf {LSTL} i} := \widehat{\alpha }_{0}+ \widehat{\alpha }_{1} \log (\texttt {{employees}}_{i})+ \widehat{\alpha }_{2} \log (\texttt {{assets.total}}_{i}).\)
Fig. 10

Plot of DP residuals vs. fitted values: skew-t case

The DP residuals are far from the origin in Fig.  10. Using a similar method as in the skew-normal case, we can correct the estimated regression plane as follows:
$$\begin{aligned} \widetilde{\eta }_{\mathsf {LSTL}} := \widehat{\eta }_{\mathsf {LSTL}} + \widehat{\omega } b_{\widehat{\nu }+1} \widehat{\delta }, \end{aligned}$$
where \(b_{\widehat{\nu }+1}:=\sqrt{(\widehat{\nu }+1)/\pi } \ \Gamma (\widehat{\nu }/2)/\Gamma ((\widehat{\nu }+1)/2)\) and \(\widehat{\delta }=\widehat{\alpha }/\sqrt{1+\widehat{\alpha }^{2}}\). We refer to the definition of the pseudo-CP given by Eq.  (11) in Appendix  1 as well as Arellano-Valle and Azzalini (2013) for details.
The adjusted regression plane is obtained as follows:
$$\begin{aligned} \widetilde{\eta }_{\mathsf {LSTL}}= & {} \widehat{\eta }_{\mathsf {LSTL}} + \widehat{\omega } b_{\widehat{\nu }+1} \widehat{\delta } = (\widehat{\alpha }_{0}+ \widehat{\omega } b_{\widehat{\nu }+1} \widehat{\delta }) + \widehat{\alpha }_{1} \log (\texttt {{employees}}) + \widehat{\alpha }_{2} \log (\texttt {{assets.total}}) \nonumber \\= & {} (1.326 + 0.764 \times 0.973 \times (-0.714)) + 0.353 \log (\texttt {{employees}}) + 0.702 \log ( \texttt {{assets.total}}) \nonumber \\= & {} 0.795 + 0.353 \log (\texttt {{employees}}) + 0.702 \log ( \texttt {{assets.total}}). \end{aligned}$$
(8)
See the right-hand panel of Fig.  17 in Appendix 1.
The plot of the pseudo-CP residuals \(e_{\mathsf {LSTL.PCP} i}:= \log (\texttt {{sales}}_{i})- \widetilde{\eta }_{\mathsf {LSTL} i}\) versus the fitted values \(\widetilde{\eta }_{\mathsf {LSTL}i}\) is given in Fig.  11. This adjustment has reduced the biases of the residuals.
Fig. 11

Plot of pseudo-CP residuals vs. fitted values: skew-t case

The Q–Q and P–P plots of the scaled DP residuals \(z_{\mathsf {LSTL} i} := (\log (\texttt {{sales}}_{i}) - \widehat{\eta }_{\mathsf {LSTL} i})/\widehat{\omega }\) are given in Fig.  12.
Fig. 12

Q–Q (left panel) and P–P (right panel) plots of scaled DP residuals: skew-t case

These plots are based on the following property of the squared scaled DP residuals:
$$\begin{aligned} z_{\mathsf {LSTL} i}^{2}{\mathop {\sim }\limits ^{a}} \mathsf {F}_{\nu }^{1} (: \text{ F } \text{ distribution } \text{ with } (1,\nu ) \text{ degrees } \text{ of } \text{ freedom) }. \end{aligned}$$
Property (10) in Appendix  1 as well as p. 102 of Azzalini and Capitanio (2014) provide further details.

Figure 12 indicates that some scaled DP residuals do not lie along the dashed line in the tails of the Q–Q plot, but the P–P plot suggests a good fit, and these results are better than those in the skew-normal case.

In the next section, we perform model selection using the Akaike information criterion (AIC).

6 Model selection with the AIC

In this section, we perform model selection with respect to the distribution of \(\log (\texttt {{sales}})\) and the error distributions of the double-log model using the AIC. See Akaike (1973) and Konishi and Kitagawa (2008) for details regarding the AIC.

6.1 Distributions of \(\log (\texttt {{sales}})\)

We now compare the different models according to the AIC. The column “AIC” in Table 5 gives the AIC values for the fitted normal distribution (lm.log.sales2015), the skew-normal distribution (selm.log.sales2015), and the skew-t distribution (selm.ST.log.sales2015) for \(\log (\texttt {{sales}})\). The column “df” in Table 5 represents the number of parameters for each model.
Table 5

AIC table: distributions for the log of sales

 

df

AIC

lm.log.sales2015

2

127,076.06

selm.log.sales2015

3

126,627.07

selm.ST.log.sales2015

4

126,546.63

The minimum \(\mathrm {AIC}\) is obtained from the fitted skew-t distribution (selm.ST.log.sales2015), implying that this distribution provides the best fit. Note that this result is consistent with the visualization results (Fig.  5) in Sect.  4.

6.2 Double-log models

We now compare the fitted models under the three assumptions with respect to the error term of the double-log model given by Eq.  (3), that is, the normal case (lm.log.firmfin2015), the skew-normal case (selm.log.firmfin2015), and the skew-t case (selm.ST.log.firmfin2015).
Table 6

AIC table: double-log models

 

df

AIC

lm.log.firmfin2015

4

74,980.13

selm.log.firmfin2015

5

71,972.08

selm.ST.log.firmfin2015

6

67,897.56

We again observe that the minimum \(\mathrm {AIC}\) is obtained from the skew-t model (selm.ST.log.firmfin2015) (see Table 6). Note that this result is consistent with the visualization results (Fig. 12).

7 Cross-validation

We evaluate the double-log models in Eq.  (3) using the K-fold cross-validation method. We set \(K=10\) and adopt the mean squared error of the prediction (MSEP) and the AIC as the evaluation criteria. See Stone (1974), Efron and Tibshirani (1993), and James et al. (2013) for details on general cross-validation methods, and see Efron and Hastie (2016) for specific details on the AIC.

7.1 MSEP discrepancy

The correspondence between the labels of the different cases and the predictors for the MSEP criterion is given in Table  7.
Table 7

Correspondence table

Label

Predictor (regression plane)

Error distribution

MSEP.log.Normal

(4)

Normal

MSEP.log.SN

(5)

Skew-normal

MSEP.log.SN.adj

(6)

Skew-normal

MSEP.log.ST

(7)

Skew-t

MSEP.log.ST.adj

(8)

Skew-t

The cross-validation results are displayed in Fig.  13. The standard predictors (5) and (7) for the skew-normal and skew-t error distributions suggest poor model fits. The standard predictor (4) for the normal error distribution is best, and the adjusted predictors for the skew-normal and the skew-t error cases indicate similar performances as in the normal case. In the MSEP case, the discrepancy function is based on the “squared error,” and the normal error case is comparable to the MSEP criterion because the parameters of the double-log model with the normal error distribution are estimated using the least squares method. Therefore, it seems that the skew-normal and the skew-t models provide reasonable fits.
Fig. 13

Boxplot of evaluation criteria: \(K=10\), MSEP

7.2 AIC discrepancy

It is natural to adopt a criterion related to the maximum likelihood method to evaluate the cross-validation because the parameters of the double-log model with a skew-normal error distribution and a skew-t error distribution are estimated using this method. Here, we use the discrepancy function based on the AIC. The cross-validation results are displayed in Fig. 14. The evaluation based on the AIC indicates that the double-log model with a skew-t error distribution performs the best.
Fig. 14

Boxplot of evaluation criteria: \(K=10\), AIC

8 Concluding remarks and discussion

In this study, we conducted an exploratory data analysis to visualize the financial data of listed and delisted firms around the world, and we investigated the distribution of sales and constructed statistical models to explain total sales based on the number of employees and total assets. The total sales of Japan’s listed firms usually follow a log-normal distribution (e.g., Jimichi and Maeda (2014)). However, we observe that the log-skew-t distribution provides a good fit for the logarithm of sales of the global firms in our data set. One reason is that if the population of firms is expanded to include global firms, some of these firms have extremely small scales, so the distribution of the logarithm of sales is still skewed left.

The double-log model with a skew-t error distribution is also useful for modeling the logarithm of sales as a linear function of the logarithm of the number of employees and total assets. The estimated regression plane (8) is a better predictor of the logarithm of sales than the model with a normal error distribution is. The double-log models were evaluated using the K-fold cross-validation method, and the model with the skew-t error distribution had the best performance when the AIC was used for evaluation. As we mentioned in Remark  1, a model that accurately predicts sales can be constructed by fitting a Cobb–Douglas production function to the financial (accounting) data of the set of global firms. Note that these results are based on the data for 2015; similar results (not shown here) were obtained for the decade 2006–2015.

Finally, we used theApache Spark4 (say Spark) environment to manipulate our data set. Our next analysis will consider a data set extracted from the “Orbis” database5, containing information for over 20,000,000 firms, and its size is over one hundred gigabytes, so Spark or a similar tool for handling big data will be required.

Footnotes

  1. 1.

    The Osiris system is produced by Bureau van Dijk (BvD) KK.

  2. 2.

    R version 3.4.4 (2018-03-15)

  3. 3.

    Note that the Q–Q plot in Fig.  5 is based on an F distribution. See also Appendix  1.

  4. 4.
  5. 5.

    The Orbis database is produced by BvD KK.

Notes

Acknowledgements

The authors wish to thank the reviewers for their helpful comments. This work is partially supported by a Grant-in-Aid for Scientific Research (KAKENHI: No. 16K04022) and the Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN Project ID: jh171002-NWJ, jh181001-NWJ) in Japan. We would like to thank Mr. Ayumu Masuda of Bureau van Dijk KK for extracting some dataset files from the Osiris database system.

Supplementary material

References

  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the 2nd international symposium on information theory (pp. 267–281). Budapest: Akadimiai Kiado.Google Scholar
  2. Arellano-Valle, R. B., & Azzalini, A. (2013). The centred parameterization and related quantities of the skew-t distribution. Journal of Multivariate Analysis, 113, 73–90.MathSciNetCrossRefGoogle Scholar
  3. Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.MathSciNetzbMATHGoogle Scholar
  4. Azzalini, A., & Capitanio, A. (2014). The skew-normal and related families. Institute of mathematical statistics monographs. Cambridge: Cambridge University Press.Google Scholar
  5. Cobb, C. W., & Douglas, P. H. (1928). A theory of production. American Economic Review, 18, 139–165.Google Scholar
  6. Efron, B., & Hastie, T. (2016). Computer age statistical inference: algorithms, evidence, and data science. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  7. Efron, B., & Tibshirani, R. J. (1993). An introduction to bootstrap. London: Chapman and Hall/CRC.CrossRefGoogle Scholar
  8. Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Thousand Oaks: Sage.Google Scholar
  9. Healy, M. J. R. (1968). Multivariate normal plotting. Applied Statistics, 17, 157–161.CrossRefGoogle Scholar
  10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Berlin: Springer.CrossRefGoogle Scholar
  11. Jimichi, M. (2010). Building of Financial Database Servers, ISBN: 978-4-9905530-0-5. https://kwansei.repo.nii.ac.jp/ (in Japanese).
  12. Jimichi, M., & Maeda, S. (2014). Visualization and statistical modeling of financial data with R, Poster at the R user conference 2014. http://user2014.stat.ucla.edu/abstracts/posters/48_Jimichi.pdf
  13. Konishi, S., & Kitagawa, G. (2008). Information Criteria and Statistical Modeling. Berlin: Springer.CrossRefGoogle Scholar
  14. Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics. Reading, Mass: Addison-Wesley.Google Scholar
  15. Ryza, S., Laserson, U., Owen, S., & Wills, J. (2016). Advanced analytics with spark. Newton: O’Reilly.Google Scholar
  16. Saka, C., & Jimichi, M. (2017). Evidence of inequality from accounting data visualization. Taiwan Accounting Review, 13(2), 193–234.Google Scholar
  17. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147.MathSciNetzbMATHGoogle Scholar
  18. Tukey, J. W. (1977). Exploratory data analysis. Boston: Addison-Wesley Publishing Co.zbMATHGoogle Scholar
  19. Unwin, A. (2015). Graphical data analysis with R. London: Chapman and Hall/CRC.CrossRefGoogle Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2018

Authors and Affiliations

  1. 1.School of Business AdministrationKwansei Gakuin UniversityNishinomiyaJapan
  2. 2.Division of Computer Science, Information ScienceNara Institute of Science and TechnologyIkomaJapan

Personalised recommendations