Chapter 3: Linear Regression Models: Diagnostics and Model-Building

  • Peter K. Dunn
  • Gordon K. Smyth
Part of the Springer Texts in Statistics book series (STS)


As the previous two chapters have demonstrated, the process of building a linear regression model, or any regression model, is aided by exploratory plots of the data, by reflecting on the experimental design, and by considering the scientific relationships between the variables. This process should ensure that the model is broadly appropriate for the data. Once a candidate model has been fitted to the data, however, there are specialist measures and plots that can examine the model assumptions and diagnose possible problems in greater detail. This chapter describes these tools for detecting and highlighting violations of assumptions in linear regression models. The chapter goes on to discuss some possible courses of action that might alleviate the identified problems. The process of examining and identifying possible violations of model assumptions is called diagnostic analysis. The assumptions of linear regression models are first reviewed (Sect. 3.2), then residuals, the main tools of diagnostic analysis, are defined (Sect. 3.3). We follow with a discussion of the leverage, a measure of the location of an observation relative to the average observation location (Sect. 3.4). The various diagnostic tools for checking the model assumptions are then introduced (Sect. 3.5) followed by techniques for identifying unusual and influential observations (Sect. 3.6). The terminology of residuals is summarized in Sect. 3.7. Techniques for fixing any weaknesses in the models are summarised in Sect. 3.8, and explained in greater detail in Sects. 3.9 to 3.13. Finally, the issue of collinearity is discussed (Sect. 3.14).


  1. [1]
    Ashton, K.G., Burke, R.L., Layne, J.N.: Geographic variation in body and clutch size of Gopher tortoises. Copeia 2007(2), 355–363 (2007)CrossRefGoogle Scholar
  2. [2]
    Atkinson, A.C.: Regression diagnostics, transformations and constructed variables. Journal of the Royal Statistical Society, Series B 44(1), 1–36 (1982)Google Scholar
  3. [3]
    Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons, New York (2004)Google Scholar
  4. [4]
    Benson, J.: Season of birth and onset of locomotion: Theoretical and methodological implications. Infant Behavior and Development 16(1), 69–81 (1993)CrossRefGoogle Scholar
  5. [5]
    Bivand, R.S., Pebesma, E.J., Gómez-Rubio, V.: Applied Spatial Data Analysis with r. Springer (2008)Google Scholar
  6. [6]
    Boer, R., Fletcher, D.J., Campbell, L.C.: Rainfall patterns in a major wheat-growing region of Australia. Australian Journal of Agricultural Research 44, 609–624 (1993)CrossRefGoogle Scholar
  7. [7]
    Box, G.E.P., Cox, D.R.: An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, 211–252 (1964)Google Scholar
  8. [8]
    Cochran, D., Orcutt, G.H.: Application of least squares regression to relationships containing auto-correlated error terms. Journal of the American Statistical Association 44(245), 32–61 (1949)zbMATHGoogle Scholar
  9. [9]
    Cook, D.R.: Detection of influential observations in linear regression. Technometrics 19(1), 15–18 (1977)MathSciNetzbMATHGoogle Scholar
  10. [10]
    Davison, A.C.: Statistical Models. Cambridge University Press, UK (2003)CrossRefGoogle Scholar
  11. [11]
    Draper, N., Smith, H.: Applied Regression Analysis. John Wiley and Sons, New York (1966)zbMATHGoogle Scholar
  12. [12]
    Fox, J.: An R and S-Plus Companion to Applied Regression Analysis. Sage Publications, Thousand Oaks, CA (2002)Google Scholar
  13. [13]
    Geary, R.C.: Testing for normality. Biometrics 34(3/4), 209–242 (1947)MathSciNetCrossRefGoogle Scholar
  14. [14]
    Gelman, A., Nolan, D.: Teaching Statistics: A Bag of Tricks. Oxford University Press, Oxford (2002)zbMATHGoogle Scholar
  15. [15]
    Gethin, G.T., Cowman, S., Conroy, R.M.: The impact of Manuka honey dressings on the surface pH of chronic wounds. International Wound Journal 5(2), 185–194 (2008)CrossRefGoogle Scholar
  16. [16]
    Gethin, G.T., Cowman, S., Conroy, R.M.: Retraction: The impact of Manuka honey dressings on the surface pH of chronic wounds. International Wound Journal 11(3), 342–342 (2014)CrossRefGoogle Scholar
  17. [17]
    Giauque, W.F., Wiebe, R.: The heat capacity of hydrogen bromide from 15K. to its boiling point and its heat of vaporization. The entropy from spectroscopic data. Journal of the American Chemical Society 51(5), 1441–1449 (1929)CrossRefGoogle Scholar
  18. [18]
    Hand, D.J., Daly, F., Lunn, A.D., McConway, K.Y., Ostrowski, E.: A Handbook of Small Data Sets. Chapman and Hall, London (1996)zbMATHGoogle Scholar
  19. [19]
    Joglekar, G., Scheunemyer, J.H., LaRiccia, V.: Lack-of-fit testing when replicates are not available. The American Statistician 43, 135–143 (1989)Google Scholar
  20. [20]
    Johnson, B., Courtney, D.M.: Tower building. Child Development 2(2), 161–162 (1931)CrossRefGoogle Scholar
  21. [21]
    Kahn, M.: An exhalent problem for teaching statistics. Journal of Statistical Education 13(2) (2005).Google Scholar
  22. [22]
    Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data (2nd ed.). Wiley, New York (2002)CrossRefGoogle Scholar
  23. [23]
    Mazess, R.B., Peppler, W.W., Gibbons, M.: Total body composition by dualphoton (153Gd) absorptiometry. American Journal of Clinical Nutrition 40, 834–839 (1984)CrossRefGoogle Scholar
  24. [24]
    Moir, R.J.: A note on the relationship between the digestible dry matter and the digestable energy content of ruminant diets. Australian Journal of Experimental Agriculture and Animal Husbandry 1, 24–26 (1961)CrossRefGoogle Scholar
  25. [25]
    Moore, D.S., McCabe, G.P.: Introduction to the Practice of Statistics, second edn. W. H. Freeman and Company, New York (1993)Google Scholar
  26. [26]
    Myers, R.H.: Classical and Modern Regression with Applications, second edn. Duxbury, Belmont CA (1990)Google Scholar
  27. [27]
    Palomares, M.L., Pauly, D.: A multiple regression model for predicting the food consumption of marine fish populations. Australian Journal of Marine and Freshwater Research 40(3), 259–284 (1989)CrossRefGoogle Scholar
  28. [28]
    Ryan, T.A., Joiner, B.L., Ryan, B.F.: Minitab Student Handbook. Duxbury Press, North Scituate, Mass. (1976)Google Scholar
  29. [29]
    Searle, S.R., Casella, G., McCulloch, C.E.: Variance Components. John Wiley and Sons, New York (2006)zbMATHGoogle Scholar
  30. [30]
    Seddigh, M., Joliff, G.D.: Light intensity effects on meadowfoam growth and flowering. Crop Science 34, 497–503 (1994)CrossRefGoogle Scholar
  31. [31]
    Shacham, M., Brauner, N.: Minimizing the effects of collinearity in polynomial regression. Industrial and Engineering Chemical Research 36, 4405–4412 (1997)CrossRefGoogle Scholar
  32. [32]
    Silverman, S.G., Tuncali, K., Adams, D.F., Nawfel, R.D., Zou, K.H., Judy, P.F.: ct fluoroscopy-guided abdominal interventions: Techniques, results, and radiation exposure. Radiology 212, 673–681 (1999)CrossRefGoogle Scholar
  33. [33]
    Singer, J.D., Willett, J.B.: Improving the teaching of applied statistics: Putting the data back into data analysis. The American Statistician 44(3), 223–230 (1990)Google Scholar
  34. [34]
    Smyth, G.K.: Australasian data and story library (Ozdasl) (2011). URL
  35. [35]
    Snapinn, S.M., Small, R.D.: Tests of significance using regression models for ordered categorical data. Biometrics 42, 583–592 (1986)CrossRefGoogle Scholar
  36. [36]
    Sokal, R.R., Rohlf, F.J.: Biometry: The Principles and Practice of Statistics in Biological Research, third edn. W. H. Freeman and Company, New York (1995)Google Scholar
  37. [37]
    Student: The probable error of a mean. Biometrika 6(1), 1–25 (1908)Google Scholar
  38. [38]
    Wallach, D., Goffinet, B.: Mean square error of prediction in models for studying ecological systems and agronomic systems. Biometrics 43(3), 561–573 (1987)CrossRefGoogle Scholar
  39. [39]
    Weisberg, S.: Applied Linear Regression. Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, New York (1985)Google Scholar
  40. [40]
    West, B.T., Welch, K.B., Galecki, A.T.: Linear Mixed Models: A Practical Guide using Statistical Software. CRC, Boca Raton, Fl (2007)zbMATHGoogle Scholar
  41. [41]
    Yang, P.J., Pham, J., Choo, J., Hu, D.L.: Duration of urination does not change with body size. Proceedings of the National Academy of Sciences 111(33), 11 932–11 937 (2014)CrossRefGoogle Scholar
  42. [42]
    Young, B.A., Corbett, J.L.: Maintenance energy requirement of grazing sheep in relation to herbage availability. Australian Journal of Agricultural Research 23(1), 57–76 (1972)CrossRefGoogle Scholar
  43. [43]
    Zou, K.H., Tuncali, K., Silverman, S.G.: Correlation and simple linear regression. Radiology 227, 617–628 (2003)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Peter K. Dunn
    • 1
  • Gordon K. Smyth
    • 2
  1. 1.Faculty of Science, Health, Education and EngineeringSchool of Health of Sport Science, University of the Sunshine CoastQueenslandAustralia
  2. 2.Bioinformatics DivisionWalter and Eliza Hall Institute of Medical ResearchParkvilleAustralia

Personalised recommendations