Primary Steps in Analyzing Data: Tasks and Tools for a Systematic Data Exploration

  • Martin ZwanzigEmail author
  • Robert Schlicht
  • Nico Frischbier
  • Uta Berger
Part of the Ecological Studies book series (ECOLSTUD, volume 240)


Understanding the structure, basic properties and relationships in a given dataset is a fundamental prerequisite for an appropriate statistical analysis. Here, we highlight the major principles of data exploration and provide a roadmap for a systematic and reproducible analysis based on key questions. Using an exemplary dataset on throughfall measurements, we demonstrate how several techniques can be used and evaluated in order to address common tasks of data analysis such as understanding the structure of the dataset, detecting temporal and spatial dependence among observations, identifying outliers and influential observations, checking normality and homogeneity of model residuals and exploring the relationships of variables. Finally, it is discussed when and when not data transformations can be used as potential actions to overcome restrictions in the application of a statistical method. The electronic supplement offers the sample dataset as well as fully documented computer code, which aims to serve as a guideline for conducting an exploratory data analysis using the statistical software environment R.

Supplementary material

464883_1_En_7_MOESM1_ESM.r (56 kb)
DataExploration (R 55 kb)
464883_1_En_7_MOESM2_ESM.txt (779 kb)
Throughfall_dataset (TXT 779 kb)


  1. Bivand RS, Pebesma EJ, Gomez-Rubio V (2008) Applied spatial data analysis with R. Use R series, Springer, New York. CrossRefGoogle Scholar
  2. Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York. CrossRefGoogle Scholar
  3. Carlyle-Moses DE, Lishman CE, McKee AJ (2014) A preliminary evaluation of throughfall sampling techniques in a mature coniferous forest. J For Res 25:407–413. CrossRefGoogle Scholar
  4. Chenouri S, Small CG (2012) A nonparametric multivariate multisample test based on data depth. Electron J Stat 6:760–782. CrossRefGoogle Scholar
  5. Daszykowski M, Kaczmarek K, Vander Heyden Y, Walczak B (2007) Robust statistics in data analysis – a review. basic concepts Chemometrics Intell Lab Syst 85:203–219. CrossRefGoogle Scholar
  6. Dytham C (2006) Choosing and using statistics: a biologist’s guide. 2nd edn (Repr.), Blackwell Publishing., Malden, p 248Google Scholar
  7. Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G et al (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36:027–046. CrossRefGoogle Scholar
  8. Fox J, Weisberg S (2011) An R companion to applied regression, 2nd edn. Sage Publications, Thousand Oaks. Google Scholar
  9. Freckleton RP (2011) Dealing with collinearity in behavioural and ecological data: model averaging and the problems of measurement error. Behav Ecol Sociobiol 65:91–101. CrossRefGoogle Scholar
  10. Frischbier N (2012) Study on the single-tree related small-scale variability and quantity-dependent dynamics of net forest precipitation using the example of two mixed beech-spruce stands. TUDpress, Dresden. (Dissertation).
  11. Frischbier N, Wagner S (2015) Detection, quantification and modelling of small-scale lateral translocation of throughfall in tree crowns of European beech (Fagus sylvatica L.) and Norway spruce (Picea abies (L.) karst.). J Hydrol 522:228–238. CrossRefGoogle Scholar
  12. Hurlbert SH (1984) Pseudoreplication and the design of ecological field experiments. Ecol Monogr 54:187–211. CrossRefGoogle Scholar
  13. Joliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Phil Trans R Soc A 374:20150202. CrossRefGoogle Scholar
  14. Kallenberg O (2002) Foundations of modern probability, 2nd edn. Springer, New York, p 638CrossRefGoogle Scholar
  15. Keim RF, Skaugset AE, Weiler M (2005) Temporal persistence of spatial patterns in throughfall. J Hydrol 314:263–274. CrossRefGoogle Scholar
  16. Pinheiro J, Bates D (2010) Mixed-effects models in S and S-PLUS. Springer, Dordrecht. ISBN: 9781441903181. CrossRefGoogle Scholar
  17. Quinn GP, Keough MJ (2002) Experimental design and data analysis for biologists. Repr. With corr. 2003. Cambridge University Press, Cambridge, p 537Google Scholar
  18. Schielzeth H, Forstmeier W (2009) Conclusions beyond support: overconfident estimates in mixed models. Behav Ecol 20:416–420. CrossRefGoogle Scholar
  19. Schielzeth H (2010) Simple means to improve the interpretability of regression coefficients. Methods Ecol Evol 1:103–113. CrossRefGoogle Scholar
  20. Schielzeth H, Nakagawa S (2013) Nested by design: model fitting and interpretation in a mixed model era. Methods Ecol Evol 4:14–24. CrossRefGoogle Scholar
  21. Sievert C (2018) Plotly for R.
  22. Sun F, Roderick ML, Farquhar GD (2018) Rainfall statistics, stationarity, and climate change. P Natl Acad Sci USA 115:2305–2310. CrossRefGoogle Scholar
  23. Tischer A, Zwanzig M, Frischbier N (2019) Spatiotemporal statistics: analysis of spatially and temporally-correlated throughfall data: exploring and considering dependency and heterogeneity. In: Levia DF, Carlyle-Moses DE, Iida S, Michalzik B, Nanko K, Tischer A (eds) Forest-water interactions. Ecological studies series, No. 240. Springer, Heidelberg.
  24. Townend J (2008) Practical statistics for environmental and biological scientists. Wiley, Chichester, p 276. ISBN: 978-0-471-49665-6Google Scholar
  25. Unwin A (2018). OutliersO3: draws overview of outliers (O3) Plots. R package version 0.5.4.
  26. Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer-Verlag, New York. CrossRefGoogle Scholar
  27. Wilks DS (2006) Statistical methods in the atmospheric sciences. Second edition. Elsevier, Amsterdam, p 676Google Scholar
  28. Zuur AF, Ieno EN, Elphick CS (2010) A protocol for data exploration to avoid common statistical problems. Methods Ecol Evol 1:3–14. CrossRefGoogle Scholar
  29. Zuur AF, Ieno EN (2015) A beginner’s guide to data exploration and visualisation with R. Highland Statistics Ltd.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Martin Zwanzig
    • 1
    Email author
  • Robert Schlicht
    • 1
  • Nico Frischbier
    • 2
  • Uta Berger
    • 1
  1. 1.Institute of Forest Growth and Forest Computer SciencesTechnische Universität DresdenTharandtGermany
  2. 2.Forestry Research and Competence CentreThüringenForstGothaGermany

Personalised recommendations