Abstract
This study considers the visualization and statistical modeling of financial data (e.g., sales, assets, etc.) for a large data set of global firms that are listed and delisted. We present exploratory data analysis carried out in the R programming language. The results show that a doublelog model with a skewt error distribution is useful for modeling a firm’s total sales volume (in thousands of U.S. dollars) as a function of its number of employees and total assets (in thousands of U.S. dollars). This result is obtained by comparing the Akaike information criteria of several doublelog models with independent and identically distributed random error terms with skewsymmetric distributions and by further evaluating the models using crossvalidation.
This is a preview of subscription content, log in to check access.
Notes
 1.
The Osiris system is produced by Bureau van Dijk (BvD) KK.
 2.
R version 3.4.4 (20180315)
 3.
 4.
 5.
The Orbis database is produced by BvD KK.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the 2nd international symposium on information theory (pp. 267–281). Budapest: Akadimiai Kiado.
ArellanoValle, R. B., & Azzalini, A. (2013). The centred parameterization and related quantities of the skewt distribution. Journal of Multivariate Analysis, 113, 73–90.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.
Azzalini, A., & Capitanio, A. (2014). The skewnormal and related families. Institute of mathematical statistics monographs. Cambridge: Cambridge University Press.
Cobb, C. W., & Douglas, P. H. (1928). A theory of production. American Economic Review, 18, 139–165.
Efron, B., & Hastie, T. (2016). Computer age statistical inference: algorithms, evidence, and data science. Cambridge: Cambridge University Press.
Efron, B., & Tibshirani, R. J. (1993). An introduction to bootstrap. London: Chapman and Hall/CRC.
Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Thousand Oaks: Sage.
Healy, M. J. R. (1968). Multivariate normal plotting. Applied Statistics, 17, 157–161.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Berlin: Springer.
Jimichi, M. (2010). Building of Financial Database Servers, ISBN: 9784990553005. https://kwansei.repo.nii.ac.jp/ (in Japanese).
Jimichi, M., & Maeda, S. (2014). Visualization and statistical modeling of financial data with R, Poster at the R user conference 2014. http://user2014.stat.ucla.edu/abstracts/posters/48_Jimichi.pdf
Konishi, S., & Kitagawa, G. (2008). Information Criteria and Statistical Modeling. Berlin: Springer.
Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: a second course in statistics. Reading, Mass: AddisonWesley.
Ryza, S., Laserson, U., Owen, S., & Wills, J. (2016). Advanced analytics with spark. Newton: O’Reilly.
Saka, C., & Jimichi, M. (2017). Evidence of inequality from accounting data visualization. Taiwan Accounting Review, 13(2), 193–234.
Stone, M. (1974). Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147.
Tukey, J. W. (1977). Exploratory data analysis. Boston: AddisonWesley Publishing Co.
Unwin, A. (2015). Graphical data analysis with R. London: Chapman and Hall/CRC.
Acknowledgements
The authors wish to thank the reviewers for their helpful comments. This work is partially supported by a GrantinAid for Scientific Research (KAKENHI: No. 16K04022) and the Joint Usage/Research Center for Interdisciplinary Largescale Information Infrastructures (JHPCN Project ID: jh171002NWJ, jh181001NWJ) in Japan. We would like to thank Mr. Ayumu Masuda of Bureau van Dijk KK for extracting some dataset files from the Osiris database system.
Author information
Affiliations
Corresponding author
Appendices
A full data set and data manipulation
Our data set is extracted from the “Osiris” database, provided by the BvD KK. This data set has 86 financial indices (e.g., sales, assets, etc.) for 82,878 listed and delisted firms from 1985 to 2017. The data file (firmfin.csv) is in CSV format, and its size is over 1.3 GB. We use R for explanatory data analysis, and it is considered appropriate to use Spark to load the data into R because it quickly responds to user queries by scanning large inmemory data sets. Furthermore, the SparkR package, which is a frontend for Spark, is provided (cf. Ryza et al. (2016)). The R function read.df is available from the SparkR package to load the data set into R as a Spark DataFrame (sdf) in RStudio:
We use the pipe operator %>% and the R function collect from the magrittr package to transform the Spark DataFrame object firmfin.sdf into an R data frame firmfin2015:
Note that only firm names with a BvD ID (firmID), a country, and positive values for sales, employees, and assets.total in 2015 are selected.
B Computing environments
Our computing environment and software are as follows:

R (R. Ihaka, R. Gentleman, R Core Team, https://www.rproject.org/)

R Packages

dplyr (H. Wickham, http://dplyr.tidyverse.org/)

GGally:: ggpairs (B. Schloerke, http://ggobi.github.io/ggally/)

ggplots2 (H. Wickham, http://had.co.nz/ggplot2/)

magrittr (H. Wickham, https://github.com/tidyverse/magrittr)

rgl (D. Murdoch, https://cran.rproject.org/web/packages/rgl/vignettes/rgl.html)

sn (A. Azzalini, http://azzalini.stat.unipd.it/SN/)

SparkR (http://spark.apache.org/)

xtable (D. B. Dahl, http://xtable.rforge.rproject.org/)


RStudio (RStudio, https://www.rstudio.com/)

Spark 2.2.0 (http://spark.apache.org/)

Sweave (F. Leisch, https://leisch.userweb.mwn.de/Sweave/)
C Estimated regression planes
D Skewnormal distribution and logskewnormal distribution
D.1 Skewnormal distribution
If the distribution of the random variable X is skewnormal, then we express it as follows:
where \((\xi , \omega ^{2}, \alpha )\) are the DPs. Note that the parameter \(\alpha\) is called the slant parameter.
The plots of the p.d.f.s of the skewnormal distribution if \((\xi , \omega , \alpha )=(0,1,5), (0,1,5)\) are given in Fig. 18.
If the distribution of the random variable X is skewnormal \(\mathsf {SN}(\xi , \omega ^{2}, \alpha )\), then the expectation parameter \(\mu _{\mathsf {SN}} := \text{ E }(X)= \xi + \omega b \delta\) is called a CP of the skewnormal distribution (see Azzalini and Capitanio (2014), p. 66), where \(b:=\sqrt{2/\pi }\), \(\delta :=\alpha /\sqrt{1+\alpha ^{2}}\). Note that \(Z:=(X  \xi )/\omega \sim \mathsf {SN}(0,1,\alpha )\) and \(Z^{2}\) follows the chisquare distribution with 1 degree of freedom:
[see Azzalini and Capitanio (2014), Proposition 2.1(e)]. The Q–Q and P–P plots for the skewnormal distribution are based on this property, and they are typical examples of Healy’s plots (see Healy (1968)).
D.2 Logskewnormal distribution
If \(\log (Y )\) follows the skewnormal distribution \(\mathsf {SN}(\xi , \omega ^{2}, \alpha )\), then the distribution of Y is called logskewnormal \(\mathsf {LSN}(\xi , \omega ^{2}, \alpha )\):
where
See also Azzalini and Capitanio (2014), p.53.
E Skewt distribution and logskewt distribution
E.1 Skewt distribution
Recall that the following result holds with respect to the ratio of the independent random variables \(Z \sim \mathsf {N}(0,1)\) and \(V \sim \chi ^{2}_{\nu }/\nu\):
By analogy, if the random variable \(Z_{0}\) has the distribution \(\mathsf {SN}(0,1,\alpha )\), then the distribution of
is called the skewt distribution \(\mathsf {ST}(0,1,\alpha ,\nu )\) with \(\nu\) degrees of freedom.
From (9), if the random variable \(Z_{0}\) follows a skewnormal distribution \(\mathsf {SN}(0,1,\alpha )\), then \(Z_{0}^{2} \sim \chi ^{2}_{1}\), and if \(T=Z_{0}/\sqrt{V} \sim \mathsf {ST}(0,1,\alpha ,\nu )\), then
The Q–Q and P–P plots of the skewt distribution are based on this property.
If \(Z \sim \mathsf {ST}(0,1,\alpha ,\nu )\), then \(X := \xi + \omega Z \sim \mathsf {ST}(\xi ,\omega ^{2},\alpha ,\nu )\). The parameters \((\xi , \omega ^{2}, \alpha , \nu )\) are the DPs.
The plots of the p.d.f.s of the skewt distribution if \((\xi , \omega , \alpha , \nu )=(0,1,5,1), (0,1,5,1)\) are given in Fig. 19.
If the distribution of the random variable X is \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\), then the expectation parameter \(\mu _{\mathsf {ST}} := \text{ E }(X) = \text{ E }(X) = \xi + \omega b_{\nu } \delta\) is the CP, where \(b_{\nu }:=\sqrt{\nu /\pi } \Gamma ((\nu 1)/2)/\Gamma (\nu /2)\), \(\delta =\alpha /\sqrt{1+\alpha ^{2}}\). Note that if \(\nu \le 4\), then the following corrected parameter is called the pusedoCP:
where \(a (\ge 1)\) is constant, and we often take \(a=1\). See ArellanoValle and Azzalini (2013) for details.
The skewt distribution \(\mathsf {ST}(0,1,\alpha ,1)\) is called the skewCauchy distribution. Note that Fig. 19 shows the p.d.f.s of the skewCauchy distributions.
E.2 Logskewt distribution
If the distribution of \(\log (Y)\) is the skewt distribution \(\mathsf {ST}(\xi , \omega ^{2}, \alpha , \nu )\), then the distribution of Y is called the logskewt distribution \(\mathsf {LST}(\xi , \omega ^{2}, \alpha , \nu )\):
where
Rights and permissions
About this article
Cite this article
Jimichi, M., Miyamoto, D., Saka, C. et al. Visualization and statistical modeling of financial big data: doublelog modeling with skewsymmetric error distributions. Jpn J Stat Data Sci 1, 347–371 (2018). https://doi.org/10.1007/s4208101800191
Received:
Accepted:
Published:
Issue Date:
Keywords
 Financial big data
 Exploratory data analysis
 Data visualization
 Doublelog model
 Skewsymmetric distributions
 SparkR