Visualization and statistical modeling of financial big data: double-log modeling with skew-symmetric error distributions

Abstract

This study considers the visualization and statistical modeling of financial data (e.g., sales, assets, etc.) for a large data set of global firms that are listed and delisted. We present exploratory data analysis carried out in the R programming language. The results show that a double-log model with a skew-t error distribution is useful for modeling a firm’s total sales volume (in thousands of U.S. dollars) as a function of its number of employees and total assets (in thousands of U.S. dollars). This result is obtained by comparing the Akaike information criteria of several double-log models with independent and identically distributed random error terms with skew-symmetric distributions and by further evaluating the models using cross-validation.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    The Osiris system is produced by Bureau van Dijk (BvD) KK.

  2. 2.

    R version 3.4.4 (2018-03-15)

  3. 3.

    Note that the Q–Q plot in Fig.  5 is based on an F distribution. See also Appendix  1.

  4. 4.

    http://spark.apache.org/docs/latest/index.html

  5. 5.

    The Orbis database is produced by BvD KK.

References

  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the 2nd international symposium on information theory (pp. 267–281). Budapest: Akadimiai Kiado.

    Google Scholar 

  2. Arellano-Valle, R. B., & Azzalini, A. (2013). The centred parameterization and related quantities of the skew-t distribution. Journal of Multivariate Analysis, 113, 73–90.

    MathSciNet  Article  Google Scholar 

  3. Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.

    MathSciNet  MATH  Google Scholar 

  4. Azzalini, A., & Capitanio, A. (2014). The skew-normal and related families. Institute of mathematical statistics monographs. Cambridge: Cambridge University Press.

    Google Scholar 

  5. Cobb, C. W., & Douglas, P. H. (1928). A theory of production. American Economic Review, 18, 139–165.

    Google Scholar 

  6. Efron, B., & Hastie, T. (2016). Computer age statistical inference: algorithms, evidence, and data science. Cambridge: Cambridge University Press.

    Google Scholar 

  7. Efron, B., & Tibshirani, R. J. (1993). An introduction to bootstrap. London: Chapman and Hall/CRC.

    Google Scholar 

  8. Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Thousand Oaks: Sage.

    Google Scholar 

  9. Healy, M. J. R. (1968). Multivariate normal plotting. Applied Statistics, 17, 157–161.

    Article  Google Scholar 

  10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Berlin: Springer.

    Google Scholar 

  11. Jimichi, M. (2010). Building of Financial Database Servers, ISBN: 978-4-9905530-0-5. https://kwansei.repo.nii.ac.jp/ (in Japanese).

  12. Jimichi, M., & Maeda, S. (2014). Visualization and statistical modeling of financial data with R, Poster at the R user conference 2014. http://user2014.stat.ucla.edu/abstracts/posters/48_Jimichi.pdf

  13. Konishi, S., & Kitagawa, G. (2008). Information Criteria and Statistical Modeling. Berlin: Springer.

    Google Scholar 

  14. Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics. Reading, Mass: Addison-Wesley.

    Google Scholar 

  15. Ryza, S., Laserson, U., Owen, S., & Wills, J. (2016). Advanced analytics with spark. Newton: O’Reilly.

    Google Scholar 

  16. Saka, C., & Jimichi, M. (2017). Evidence of inequality from accounting data visualization. Taiwan Accounting Review, 13(2), 193–234.

    Google Scholar 

  17. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147.

    MathSciNet  MATH  Google Scholar 

  18. Tukey, J. W. (1977). Exploratory data analysis. Boston: Addison-Wesley Publishing Co.

    Google Scholar 

  19. Unwin, A. (2015). Graphical data analysis with R. London: Chapman and Hall/CRC.

    Google Scholar 

Download references

Acknowledgements

The authors wish to thank the reviewers for their helpful comments. This work is partially supported by a Grant-in-Aid for Scientific Research (KAKENHI: No. 16K04022) and the Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN Project ID: jh171002-NWJ, jh181001-NWJ) in Japan. We would like to thank Mr. Ayumu Masuda of Bureau van Dijk KK for extracting some dataset files from the Osiris database system.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Masayuki Jimichi.

Appendices

A full data set and data manipulation

Our data set is extracted from the “Osiris” database, provided by the BvD KK. This data set has 86 financial indices (e.g., sales, assets, etc.) for 82,878 listed and delisted firms from 1985 to 2017. The data file (firmfin.csv) is in CSV format, and its size is over 1.3 GB. We use R for explanatory data analysis, and it is considered appropriate to use Spark to load the data into R because it quickly responds to user queries by scanning large in-memory data sets. Furthermore, the SparkR package, which is a frontend for Spark, is provided (cf. Ryza et al. (2016)). The R function read.df is available from the SparkR package to load the data set into R as a Spark DataFrame (sdf) in RStudio:

figurea

We use the pipe operator %>% and the R function collect from the magrittr package to transform the Spark DataFrame object firmfin.sdf into an R data frame firmfin2015:

figureb

Note that only firm names with a BvD ID (firmID), a country, and positive values for sales, employees, and assets.total in 2015 are selected.

B Computing environments

Our computing environment and software are as follows:

C Estimated regression planes

Fig. 15
figure15

Estimated regression plane with a three-dimensional scatter plot: normal error distribution

Fig. 16
figure16

Estimated regression planes with three-dimensional scatter plots: skew-normal error distribution, DP version (left panel) and CP (adjusted) version (right panel)

Fig. 17
figure17

Estimated regression planes with three-dimensional scatter plots: skew-t error distribution, DP version (left panel) and pseudo-CP (adjusted) version (right panel)

D Skew-normal distribution and log-skew-normal distribution

D.1 Skew-normal distribution

If the distribution of the random variable X is skew-normal, then we express it as follows:

$$\begin{aligned} X \sim \mathsf {SN}(\xi , \omega ^{2}, \alpha ), \end{aligned}$$

where \((\xi , \omega ^{2}, \alpha )\) are the DPs. Note that the parameter \(\alpha\) is called the slant parameter.

The plots of the p.d.f.s of the skew-normal distribution if \((\xi , \omega , \alpha )=(0,1,5), (0,1,-5)\) are given in Fig.  18.

Fig. 18
figure18

Skew-normal density functions with \((\xi , \omega , \alpha )=(0,1,5)\) (left panel) and \((\xi , \omega , \alpha )=(0,1,-5)\) (right panel)

If the distribution of the random variable X is skew-normal \(\mathsf {SN}(\xi , \omega ^{2}, \alpha )\), then the expectation parameter \(\mu _{\mathsf {SN}} := \text{ E }(X)= \xi + \omega b \delta\) is called a CP of the skew-normal distribution (see Azzalini and Capitanio (2014), p. 66), where \(b:=\sqrt{2/\pi }\), \(\delta :=\alpha /\sqrt{1+\alpha ^{2}}\). Note that \(Z:=(X - \xi )/\omega \sim \mathsf {SN}(0,1,\alpha )\) and \(Z^{2}\) follows the chi-square distribution with 1 degree of freedom:

$$\begin{aligned} Z^{2} \sim \chi _{1}^{2} \end{aligned}$$
(9)

[see Azzalini and Capitanio (2014), Proposition 2.1(e)]. The Q–Q and P–P plots for the skew-normal distribution are based on this property, and they are typical examples of Healy’s plots (see Healy (1968)).

D.2 Log-skew-normal distribution

If \(\log (Y )\) follows the skew-normal distribution \(\mathsf {SN}(\xi , \omega ^{2}, \alpha )\), then the distribution of Y is called log-skew-normal \(\mathsf {LSN}(\xi , \omega ^{2}, \alpha )\):

$$\begin{aligned} Y \sim \mathsf {LSN}(\xi , \omega ^{2}, \alpha ) {\mathop {\Longleftrightarrow }\limits ^\mathrm{def.}} \log (Y) \sim \mathsf {SN}(\xi , \omega ^{2}, \alpha ), \end{aligned}$$

where

$$\begin{aligned} y \in \mathbb {R}^{+}, \quad \xi \in \mathbb {R}, \quad \omega \in \mathbb {R}^{+}, \quad \alpha \in \mathbb {R}. \end{aligned}$$

See also Azzalini and Capitanio (2014), p.53.

E Skew-t distribution and log-skew-t distribution

E.1 Skew-t distribution

Recall that the following result holds with respect to the ratio of the independent random variables \(Z \sim \mathsf {N}(0,1)\) and \(V \sim \chi ^{2}_{\nu }/\nu\):

$$\begin{aligned} \frac{Z}{\sqrt{V}} \sim \mathsf {t}_{\nu }(: \text {t distribution with the degrees of freedom}\,\,{\nu }) \end{aligned}$$

By analogy, if the random variable \(Z_{0}\) has the distribution \(\mathsf {SN}(0,1,\alpha )\), then the distribution of

$$\begin{aligned} T:=\frac{Z_{0}}{\sqrt{V}} \end{aligned}$$

is called the skew-t distribution \(\mathsf {ST}(0,1,\alpha ,\nu )\) with \(\nu\) degrees of freedom.

From (9), if the random variable \(Z_{0}\) follows a skew-normal distribution \(\mathsf {SN}(0,1,\alpha )\), then \(Z_{0}^{2} \sim \chi ^{2}_{1}\), and if \(T=Z_{0}/\sqrt{V} \sim \mathsf {ST}(0,1,\alpha ,\nu )\), then

$$\begin{aligned} T^{2} = \frac{Z_{0}^{2}}{V} \sim \frac{\chi ^{2}_{1}/1}{\chi ^{2}_{\nu }/\nu } {\mathop {=}\limits ^{\mathrm {d}}} \mathsf {F}_{\nu }^{1} (: \text{ F } \text{ distribution } \text{ with } \text{ the } \text{ degrees } \text{ of } \text{ freedom } (1, \nu )). \end{aligned}$$
(10)

The Q–Q and P–P plots of the skew-t distribution are based on this property.

If \(Z \sim \mathsf {ST}(0,1,\alpha ,\nu )\), then \(X := \xi + \omega Z \sim \mathsf {ST}(\xi ,\omega ^{2},\alpha ,\nu )\). The parameters \((\xi , \omega ^{2}, \alpha , \nu )\) are the DPs.

The plots of the p.d.f.s of the skew-t distribution if \((\xi , \omega , \alpha , \nu )=(0,1,5,1), (0,1,-5,1)\) are given in Fig.  19.

Fig. 19
figure19

Skew-t density functions with \((\xi , \omega , \alpha , \nu )=(0,1,5,1)\) (left panel) and \((\xi , \omega , \alpha , \nu )= (0,1,-5,1)\) (right panel)

If the distribution of the random variable X is \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\), then the expectation parameter \(\mu _{\mathsf {ST}} := \text{ E }(X) = \text{ E }(X) = \xi + \omega b_{\nu } \delta\) is the CP, where \(b_{\nu }:=\sqrt{\nu /\pi } \Gamma ((\nu -1)/2)/\Gamma (\nu /2)\), \(\delta =\alpha /\sqrt{1+\alpha ^{2}}\). Note that if \(\nu \le 4\), then the following corrected parameter is called the pusedo-CP:

$$\begin{aligned} \widetilde{\mu }_{\mathsf {ST}} := \xi + \omega b_{\nu + a} \delta , \end{aligned}$$
(11)

where \(a (\ge 1)\) is constant, and we often take \(a=1\). See Arellano-Valle and Azzalini (2013) for details.

The skew-t distribution \(\mathsf {ST}(0,1,\alpha ,1)\) is called the skew-Cauchy distribution. Note that Fig.  19 shows the p.d.f.s of the skew-Cauchy distributions.

E.2 Log-skew-t distribution

If the distribution of \(\log (Y)\) is the skew-t distribution \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\), then the distribution of Y is called the log-skew-t distribution \(\mathsf {LST}(\xi , \omega ^{2}, \alpha , \nu )\):

$$\begin{aligned} Y \sim \mathsf {LST}(\xi , \omega ^{2}, \alpha , \nu ) {\mathop {\Longleftrightarrow }\limits ^\mathrm{def.}} \log (Y) \sim \mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu ), \end{aligned}$$

where

$$\begin{aligned} y \in \mathbb {R}^{+}, \quad \xi \in \mathbb {R}, \quad \omega \in \mathbb {R}^{+}, \quad \alpha \in \mathbb {R}, \quad \nu \in \mathbb {R}^{+}. \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jimichi, M., Miyamoto, D., Saka, C. et al. Visualization and statistical modeling of financial big data: double-log modeling with skew-symmetric error distributions. Jpn J Stat Data Sci 1, 347–371 (2018). https://doi.org/10.1007/s42081-018-0019-1

Download citation

Keywords

  • Financial big data
  • Exploratory data analysis
  • Data visualization
  • Double-log model
  • Skew-symmetric distributions
  • SparkR