Abstract
The Pareto type I distribution (also known as the power law distribution and Zipf’s law) appears to be the main distribution used to model heavy tailed phenomena in the big data literature. The Pareto type I distribution being one of the oldest heavy tailed distributions is not very flexible. Here, we show flexibility of four other heavy tailed distributions for modeling four big data sets in social networks. The Pareto type I distribution is shown not to provide the best or even an adequate fit for any of the data sets.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Arnold BC (2008) Pareto and generalized Pareto distributions. In: Modeling income distributions and lorenz curves, volume 5 of the series economic studies in equality, social exclusion and well-being, pp 119–145
Arnold BC (2015) Pareto distributions, 2nd edn. Chapman and Hall, New York
Bartels R (1982) The rank version of von Neumann’s ratio test for randomness. J Am Stat Assoc 77:40–46
Box GEP, Pierce DA (1970) Distribution of residual correlations in autoregressive-integrated moving average time series models. J Am Stat Assoc 65:1509–1526
Breusch TS (1979) Testing for autocorrelation in dynamic linear models. Aust Econ Pap 17:334–355
Coleman R, Johnson MA (2014) Power-laws and structure in functional programs. In: Akhgar B, Arabnia HR (eds) Proceedings of the 2014 international conference on computational science and computational intelligence, pp 168–172
Cox DR, Stuart A (1955) Some quick sign test for trend in location and dispersion. Biometrika 42:80–95
Davison AC, Smith RL (1990) Models for exceedances over high thresholds (with discussion). J R Stat Soc B 52:393–442
Durbin J, Watson GS (1950) Testing for serial correlation in least squares regression I. Biometrika 37:409–428
Durbin J, Watson GS (1951) Testing for serial correlation in least squares regression II. Biometrika 38:159–178
Durbin J, Watson GS (1971) Testing for serial correlation in least squares regression III. Biometrika 58:1–19
Godfrey LG (1978) Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46:1293–1302
Jiang B, Yin J, Liu Q (2015) Zipf’s law for all the natural cities around the world. Int J Geogr Inf Sci 29:498–522
Kotz S, Balakrishnan N, Johnson NL (2000) Continuous multivariate distributions, vol 1, 2nd edn. Wiley, New York
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW’10 proceedings of the 19th international conference on the world wide web, pp 591–600
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65:297–303
Lu J, Li D (2013) Bias correction in small sample from big data. IEEE Trans Data Knowl Eng 25:2658–2663
Ma D, Sandberg M, Jiang B (2015) Characterizing the heterogeneity of the openstreetmap data and community. ISPRS Int J Geoinf 4:535–550
Pareto V (1964) Cours d’Économie Politique: Nouvelle édition par G. -H. Bousquet et G. Busino. Librairie Droz, Geneva, pp 299–345
R Development Core Team (2016) A language and environment for statistical computing: R foundation for statistical computing, Vienna
Wald A, Wolfowitz J (1940) On a test whether two samples are from the same population. Ann Math Stat 11:147–162
Wang TC, Phoa FKH (2014) Scanning network communities with power-law-distributed attributes. In: Wu X, Ester M, Xu G (eds) Proceedings of the 2014 proceedings of the IEEE/ACM international conference on advances in social networks analysis and mining, pp 204–207
Wang TC, Phoa FKH, Hsu TC (2015) Power-law distributions of attributes in community detection. Social Network Analysis and Mining, 5, Article Number UNSP 45
Zhao ZD, Yang ZM, Zhang ZK, Zhou T, Huang ZG, Lai YC (2013) Emergence of scaling in human-interest dynamics. Scientific Reports, 3, Article Number 3472
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Y., Nadarajah, S. Flexible Heavy Tailed Distributions for Big Data. Ann. Data. Sci. 4, 421–432 (2017). https://doi.org/10.1007/s40745-017-0113-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-017-0113-4