Flexible Heavy Tailed Distributions for Big Data

Zhang, Yuanyuan; Nadarajah, Saralees

doi:10.1007/s40745-017-0113-4

Flexible Heavy Tailed Distributions for Big Data

Published: 10 June 2017

Volume 4, pages 421–432, (2017)
Cite this article

Annals of Data Science Aims and scope Submit manuscript

209 Accesses
1 Citation
Explore all metrics

Abstract

The Pareto type I distribution (also known as the power law distribution and Zipf’s law) appears to be the main distribution used to model heavy tailed phenomena in the big data literature. The Pareto type I distribution being one of the oldest heavy tailed distributions is not very flexible. Here, we show flexibility of four other heavy tailed distributions for modeling four big data sets in social networks. The Pareto type I distribution is shown not to provide the best or even an adequate fit for any of the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Article Google Scholar
Arnold BC (2008) Pareto and generalized Pareto distributions. In: Modeling income distributions and lorenz curves, volume 5 of the series economic studies in equality, social exclusion and well-being, pp 119–145
Arnold BC (2015) Pareto distributions, 2nd edn. Chapman and Hall, New York
Google Scholar
Bartels R (1982) The rank version of von Neumann’s ratio test for randomness. J Am Stat Assoc 77:40–46
Article Google Scholar
Box GEP, Pierce DA (1970) Distribution of residual correlations in autoregressive-integrated moving average time series models. J Am Stat Assoc 65:1509–1526
Article Google Scholar
Breusch TS (1979) Testing for autocorrelation in dynamic linear models. Aust Econ Pap 17:334–355
Article Google Scholar
Coleman R, Johnson MA (2014) Power-laws and structure in functional programs. In: Akhgar B, Arabnia HR (eds) Proceedings of the 2014 international conference on computational science and computational intelligence, pp 168–172
Cox DR, Stuart A (1955) Some quick sign test for trend in location and dispersion. Biometrika 42:80–95
Article Google Scholar
Davison AC, Smith RL (1990) Models for exceedances over high thresholds (with discussion). J R Stat Soc B 52:393–442
Google Scholar
Durbin J, Watson GS (1950) Testing for serial correlation in least squares regression I. Biometrika 37:409–428
Google Scholar
Durbin J, Watson GS (1951) Testing for serial correlation in least squares regression II. Biometrika 38:159–178
Article Google Scholar
Durbin J, Watson GS (1971) Testing for serial correlation in least squares regression III. Biometrika 58:1–19
Google Scholar
Godfrey LG (1978) Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46:1293–1302
Article Google Scholar
Jiang B, Yin J, Liu Q (2015) Zipf’s law for all the natural cities around the world. Int J Geogr Inf Sci 29:498–522
Article Google Scholar
Kotz S, Balakrishnan N, Johnson NL (2000) Continuous multivariate distributions, vol 1, 2nd edn. Wiley, New York
Book Google Scholar
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW’10 proceedings of the 19th international conference on the world wide web, pp 591–600
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65:297–303
Article Google Scholar
Lu J, Li D (2013) Bias correction in small sample from big data. IEEE Trans Data Knowl Eng 25:2658–2663
Article Google Scholar
Ma D, Sandberg M, Jiang B (2015) Characterizing the heterogeneity of the openstreetmap data and community. ISPRS Int J Geoinf 4:535–550
Article Google Scholar
Pareto V (1964) Cours d’Économie Politique: Nouvelle édition par G. -H. Bousquet et G. Busino. Librairie Droz, Geneva, pp 299–345
R Development Core Team (2016) A language and environment for statistical computing: R foundation for statistical computing, Vienna
Wald A, Wolfowitz J (1940) On a test whether two samples are from the same population. Ann Math Stat 11:147–162
Article Google Scholar
Wang TC, Phoa FKH (2014) Scanning network communities with power-law-distributed attributes. In: Wu X, Ester M, Xu G (eds) Proceedings of the 2014 proceedings of the IEEE/ACM international conference on advances in social networks analysis and mining, pp 204–207
Wang TC, Phoa FKH, Hsu TC (2015) Power-law distributions of attributes in community detection. Social Network Analysis and Mining, 5, Article Number UNSP 45
Zhao ZD, Yang ZM, Zhang ZK, Zhou T, Huang ZG, Lai YC (2013) Emergence of scaling in human-interest dynamics. Scientific Reports, 3, Article Number 3472

Download references

Author information

Authors and Affiliations

School of Mathematics, University of Manchester, Manchester, M13 9PL, UK
Yuanyuan Zhang & Saralees Nadarajah

Authors

Yuanyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Saralees Nadarajah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saralees Nadarajah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Nadarajah, S. Flexible Heavy Tailed Distributions for Big Data. Ann. Data. Sci. 4, 421–432 (2017). https://doi.org/10.1007/s40745-017-0113-4

Download citation

Received: 14 April 2017
Revised: 02 May 2017
Accepted: 21 May 2017
Published: 10 June 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s40745-017-0113-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Flexible Heavy Tailed Distributions for Big Data

Abstract

Access this article

Similar content being viewed by others

Introduction to Network Modeling Using Exponential Random Graph Models (ERGM): Theory and an Application Using R-Project

GLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks

Auxiliary Parameter MCMC for Exponential Random Graph Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Flexible Heavy Tailed Distributions for Big Data

Abstract

Access this article

Similar content being viewed by others

Introduction to Network Modeling Using Exponential Random Graph Models (ERGM): Theory and an Application Using R-Project

GLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks

Auxiliary Parameter MCMC for Exponential Random Graph Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation