Analysis of Big Data Using GLM



The application of the generalized linear models to big data is discussed in this chapter using the divide and recombine (D&R) framework. In this chapter, the exponential family of distributions for binary, count, normal, and multinomial outcome variables and the corresponding sufficient statistics for parameters are shown to have great potential in analyzing big data where traditional statistical methods cannot be used for the entire data set.


  1. Bahadur RR (1954) Sufficiency and statistical decision functions. Ann Math Stat 25:423–462MathSciNetCrossRefGoogle Scholar
  2. Buhlmann P, Petros D, Michael K, van der Mark L (2016) Handbook of big data. Routledge, LondonCrossRefGoogle Scholar
  3. Chen Y, Dong G, Han J, Pei J, Wah BW, Wang J (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18:1–15CrossRefGoogle Scholar
  4. Chen X, Xie M (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sinica 24:1655–1684MathSciNetzbMATHGoogle Scholar
  5. Cleveland S, Hafen R (2014) Divide and recombine (D&R): data science for large complex data. Stat Anal Data Min 7:425–433MathSciNetCrossRefGoogle Scholar
  6. Cox DR, Kartsonaki C, Keogh RH (2018) Big data: some statistical issues. Stat Probab Lett 1(36):111–115MathSciNetCrossRefGoogle Scholar
  7. Dobson AJ, Barnett AG (2018) An introduction to generalized linear models, 4th edn. CRC Press, Boca RatonzbMATHGoogle Scholar
  8. Donoho D (2015) 50 Years of data science. Presentation at the Tukey Centennial Workshop, Princeton, New Jersey, Sep 2015Google Scholar
  9. Donoho D (2017) 50 Years of data science. J Comput Graph Stat 26(4):745–766MathSciNetCrossRefGoogle Scholar
  10. Dunson DB (2018) Statistics in the big data era: failures of the machine. Stat Probab Lett 1(36):4–9MathSciNetCrossRefGoogle Scholar
  11. Einav L, Levin J (2014) Economics in the age of big data. Science 346:1243089-1, -5CrossRefGoogle Scholar
  12. Fahrmeir L, Tutz G (2001) Multivariate statistical modelling based on generalized linear models, 2nd edn. Springer, New YorkCrossRefGoogle Scholar
  13. Fisher RA (1920) A mathematical examination of the method of determining the accuracy of an observation by the mean error and by the mean square error, M.N.R. Astron Soc 80(8):758–770CrossRefGoogle Scholar
  14. Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans R Soc Lond A 222:309–368CrossRefGoogle Scholar
  15. Fisher RA (1925) Theory of statistical estimation. Proc Camb Philos Soc 22:700–725CrossRefGoogle Scholar
  16. Fraser DAS (1961) Invariance and the fiducial method. Biometrika 48:261–280MathSciNetCrossRefGoogle Scholar
  17. Fraser DAS (1963) On sufficiency and the exponential family. J R Stat Soc Ser B 25:115–123MathSciNetzbMATHGoogle Scholar
  18. Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, Cleveland WS (2012) Large complex data: divide and recombine (D&R) with RHIPE. Stat 1(1):53–67CrossRefGoogle Scholar
  19. Hafen R (2016) Divide and recombine: approach for detailed analysis and visualization of large complex data. Handbook of big data. Chapman and Hall, Boca RatonGoogle Scholar
  20. Halmos PR, Savage LJ (1949) Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann Math Stat 20:225–241MathSciNetCrossRefGoogle Scholar
  21. Härdle WK, Lu HHS, Shen X (eds) (2018) Handbook of big data analytics. SpringerGoogle Scholar
  22. Koopman BO (1936) On distribution admitting a sufficient statistic. Trans Am Math Soc 39:399–409MathSciNetCrossRefGoogle Scholar
  23. Lee JYL, Brown JJ, Ryan MM (2017) Sufficiency revisited: rethinking statistical algorithms in the big data era. Am Stat 71(3):202–208MathSciNetCrossRefGoogle Scholar
  24. Lehmann EL (1959) Theory of hypothesis testing. Wiley, New YorkGoogle Scholar
  25. Liu W, Li Y (2018) A new stochastic restricted Liu estimator for the logistic regression model. Open J Stat 8:25–37CrossRefGoogle Scholar
  26. Pitman EJG (1936) Sufficient statistics and intrinsic accuracy. Proc Camb Philos Soc 32:567–579CrossRefGoogle Scholar
  27. Reid N (2018) Statistical science in the world of big data. Stat Probab Lett 1(36):42–45MathSciNetCrossRefGoogle Scholar
  28. Sangalli LM (2018) The role of statistics in the era of big data. Stat Probab Lett 1(36):1–3MathSciNetCrossRefGoogle Scholar
  29. Xi R, Lin N, Chen Y (2008) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 1(1):1–14Google Scholar
  30. Zomaya AY, Sakr S (eds) (2017) Handbook of big data technologies. SpringerGoogle Scholar
  31. ZuoW Li Y (2018) A new stochastic restricted Liu estimator for the logistic regression model. Open J Stat 8:25–37CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of RajshahiRajshahiBangladesh
  2. 2.Institute of Statistical Research and TrainingUniversity of DhakaDhakaBangladesh

Personalised recommendations