Skip to main content
Log in

Flexible Heavy Tailed Distributions for Big Data

  • Published:
Annals of Data Science Aims and scope Submit manuscript

Abstract

The Pareto type I distribution (also known as the power law distribution and Zipf’s law) appears to be the main distribution used to model heavy tailed phenomena in the big data literature. The Pareto type I distribution being one of the oldest heavy tailed distributions is not very flexible. Here, we show flexibility of four other heavy tailed distributions for modeling four big data sets in social networks. The Pareto type I distribution is shown not to provide the best or even an adequate fit for any of the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723

    Article  Google Scholar 

  2. Arnold BC (2008) Pareto and generalized Pareto distributions. In: Modeling income distributions and lorenz curves, volume 5 of the series economic studies in equality, social exclusion and well-being, pp 119–145

  3. Arnold BC (2015) Pareto distributions, 2nd edn. Chapman and Hall, New York

    Google Scholar 

  4. Bartels R (1982) The rank version of von Neumann’s ratio test for randomness. J Am Stat Assoc 77:40–46

    Article  Google Scholar 

  5. Box GEP, Pierce DA (1970) Distribution of residual correlations in autoregressive-integrated moving average time series models. J Am Stat Assoc 65:1509–1526

    Article  Google Scholar 

  6. Breusch TS (1979) Testing for autocorrelation in dynamic linear models. Aust Econ Pap 17:334–355

    Article  Google Scholar 

  7. Coleman R, Johnson MA (2014) Power-laws and structure in functional programs. In: Akhgar B, Arabnia HR (eds) Proceedings of the 2014 international conference on computational science and computational intelligence, pp 168–172

  8. Cox DR, Stuart A (1955) Some quick sign test for trend in location and dispersion. Biometrika 42:80–95

    Article  Google Scholar 

  9. Davison AC, Smith RL (1990) Models for exceedances over high thresholds (with discussion). J R Stat Soc B 52:393–442

    Google Scholar 

  10. Durbin J, Watson GS (1950) Testing for serial correlation in least squares regression I. Biometrika 37:409–428

    Google Scholar 

  11. Durbin J, Watson GS (1951) Testing for serial correlation in least squares regression II. Biometrika 38:159–178

    Article  Google Scholar 

  12. Durbin J, Watson GS (1971) Testing for serial correlation in least squares regression III. Biometrika 58:1–19

    Google Scholar 

  13. Godfrey LG (1978) Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46:1293–1302

    Article  Google Scholar 

  14. Jiang B, Yin J, Liu Q (2015) Zipf’s law for all the natural cities around the world. Int J Geogr Inf Sci 29:498–522

    Article  Google Scholar 

  15. Kotz S, Balakrishnan N, Johnson NL (2000) Continuous multivariate distributions, vol 1, 2nd edn. Wiley, New York

    Book  Google Scholar 

  16. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW’10 proceedings of the 19th international conference on the world wide web, pp 591–600

  17. Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65:297–303

    Article  Google Scholar 

  18. Lu J, Li D (2013) Bias correction in small sample from big data. IEEE Trans Data Knowl Eng 25:2658–2663

    Article  Google Scholar 

  19. Ma D, Sandberg M, Jiang B (2015) Characterizing the heterogeneity of the openstreetmap data and community. ISPRS Int J Geoinf 4:535–550

    Article  Google Scholar 

  20. Pareto V (1964) Cours d’Économie Politique: Nouvelle édition par G. -H. Bousquet et G. Busino. Librairie Droz, Geneva, pp 299–345

  21. R Development Core Team (2016) A language and environment for statistical computing: R foundation for statistical computing, Vienna

  22. Wald A, Wolfowitz J (1940) On a test whether two samples are from the same population. Ann Math Stat 11:147–162

    Article  Google Scholar 

  23. Wang TC, Phoa FKH (2014) Scanning network communities with power-law-distributed attributes. In: Wu X, Ester M, Xu G (eds) Proceedings of the 2014 proceedings of the IEEE/ACM international conference on advances in social networks analysis and mining, pp 204–207

  24. Wang TC, Phoa FKH, Hsu TC (2015) Power-law distributions of attributes in community detection. Social Network Analysis and Mining, 5, Article Number UNSP 45

  25. Zhao ZD, Yang ZM, Zhang ZK, Zhou T, Huang ZG, Lai YC (2013) Emergence of scaling in human-interest dynamics. Scientific Reports, 3, Article Number 3472

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saralees Nadarajah.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Nadarajah, S. Flexible Heavy Tailed Distributions for Big Data. Ann. Data. Sci. 4, 421–432 (2017). https://doi.org/10.1007/s40745-017-0113-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40745-017-0113-4

Keywords

Navigation