Skip to main content

Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets


Three large training sets were investigated to determine optimal sample sizes for diatom-based inference models. The sample sets represented (1) assemblages from Great Lakes coastlines, (2) phytoplankton from the pelagic Great Lakes and (3) surface sediment assemblages from Minnesota lakes. Diatom-based weighted average models to infer nutrient concentrations were developed for each training set. Training set sample sizes ranging from 10 to the maximum number of samples were created through random sample selection, and performance of each model was evaluated. For each model iteration, diatom-inferred (DI) nutrient data were related to stressor data (e.g., adjacent agricultural or urban development) to characterize the ability of each model to track human activities. The relationships between model performance parameters (DI-stressor correlations and model r 2, error and bias) and sample size were used to determine the minimum sample size needed to optimize models for each region. Depending on the training set, at least 40–70 samples were needed to capture the variation in diatom assemblages and environmental conditions to such a degree that non-analog situations should be rare and so should provide an unambiguous result if the model was applied to any sample assemblage from the region. It is recommended that one exercises caution when dealing with smaller training sets unless there is certainty that the selected samples reflect the regional variability in diatom assemblages and environmental conditions.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  • Barbiero RP, Tuchman ML (2002) Results from GLNPO’s biological open water surveillance program of the Laurentian Great Lakes 1999. Report to US EPA Great Lakes National Program Office, EPA-905-R-02-001, p 32

  • Battarbee R, Jones VJ, Flower RJ, Cameron NG, Bennion H, Carvalho L, Juggins S (2001) Diatoms. In: Smol JP, Birks HJB, Last WM (eds) Tracking environmental change using lake sediments—volume 3: terrestrial, algal, and siliceous indicators. Kluwer Academic Publishers, Dordrecht, pp 155–202

    Google Scholar 

  • Bennion H (1994) A diatom-phosphorus transfer function for shallow, eutrophic ponds in southeast England. Hydrobiologia 275(276):391–410

    Article  Google Scholar 

  • Birks HJB, Line JM (1992) The use of rarefaction analysis for estimating palynological richness from Quaternary pollen-analytical data. Holocene 2:1–10

    Google Scholar 

  • Birks HJB, Line JM, Juggins S, Stevenson AC, ter Braak CJF (1990) Diatoms and pH reconstructions. Philos Trans R Soc Lond B Biol Sci 327:263–278

    Article  Google Scholar 

  • Bowen ZH, Freeman MC (1998) Sampling effort and estimates of species richness based on prepositioned area electrofisher samples. N Am J Fish Manag 18:144–153

    Article  Google Scholar 

  • Bradshaw EG, Anderson NJ (2001) Validation of a diatom-phosphorus calibration set for Sweden. Freshw Biol 46:1035–1048

    Article  CAS  Google Scholar 

  • Danz NP, Niemi GJ, Regal RR, Hollenhorst T, Johnson LB, Hanowski JM, Axler RP, Ciborowski JJH, Hrabik T, Brady VJ, Kelly JR, Brazner JC, Howe RW, Johnston CA, Host GE (2007) Integrated gradients of anthropogenic stress in the US Great Lakes basin. Environ Manag 39:631–647

    Article  Google Scholar 

  • Dixit SS, Smol JP (1994) Diatoms as indicators in the Environmental Monitoring and Assessment Program-Surface Waters (EMAP-SW). Environ Monit Assess 31:275–307

    CAS  Google Scholar 

  • Edlund MB, Kingston JC (2004) Expanding sediment diatom reconstruction model to eutrophic southern Minnesota lakes. Final report to Minnesota Pollution Control Agency, p 33

  • Ekdahl EJ, Teranes JL, Wittkop CA, Stoermer EF, Reavie ED, Smol JP (2007) Diatom assemblage response to Iroquoian and Euro-Canadian eutrophication of Crawford Lake, Ontario. Can J Paleolimnol 37:233–246

    Article  Google Scholar 

  • Environment Canada, USEPA (2009) State of the Great Lakes 2009. EPA 905-R-09-031

  • Forrest F, Reavie ED, Smol JP (2002) Comparing the trophic impacts of canal construction to other catchment disturbances in four lakes within the Rideau Canal system, Ontario, Canada. J Limnol 61:183–197

    Google Scholar 

  • Hall RI, Smol JP (1992) A weighted-averaging regression and calibration model for inferring total phosphorus concentration from diatoms in British Columbia (Canada) lakes. Freshw Biol 27:417–434

    Article  CAS  Google Scholar 

  • Hayek LC, Buzas MA (1997) Surveying natural populations. Columbia University Press, New York, p 563

    Google Scholar 

  • Heiskary SA, Swain EB (2002) Water quality reconstruction from fossil diatoms: applications for trend assessment, model verification, and development of nutrient criteria for lakes in Minnesota, USA. Minnesota Pollution Control Agency, Environmental Outcomes Division, St. Paul, Minnesota, p 103

  • Juggins S (2009) rioja: analysis of Quaternary science data, R package version 0.5-6.

  • Kingston JC, Engstrom DR, Norton AR, Peterson MR, Griese NA, Stoermer EF, Andresen NA (2004) Paleolimnological inference of nutrient loading in a eutrophic lake in north-central Minnesota (USA) and periodic occurrence of abnormal Stephanodiscus niagarae. In: Poulin M (ed) Proceedings of the XVIIth international diatom symposium. Biopress Ltd, Bristol, pp 187–202

  • Kireta AR, Reavie ED, Axler RP, Sgro GV, Kingston JC, Brown TN, Danz NP, Hollenhorst T (2007) Coastal geomorphic variability in the Laurentian Great Lakes: implications for a diatom-based monitoring tool. J Gt Lakes Res 33:136–153

    Article  CAS  Google Scholar 

  • Quinn G, Keough M (2002) Experimental design and data analysis for biologists. Cambridge University Press, Cambridge

    Google Scholar 

  • R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

  • Ramstack JM, Fritz SC, Engstrom DR, Heiskary SA (2003) The application of a diatom-based transfer function to evaluate regional water-quality trends in Minnesota since 1970. J Paleolimnol 29:79–94

    Article  Google Scholar 

  • Reavie ED (2007) A diatom-based water quality index for Great Lakes coastlines. J Gt Lakes Res 33:86–92

    Article  CAS  Google Scholar 

  • Reavie ED, Baratono NG (2007) Multi-core investigation of a lotic bay of Lake of the Woods (Minnesota, USA) impacted by cultural development. J Paleolimnol 38:137–156

    Article  Google Scholar 

  • Reavie ED, Smol JP (2001) Diatom-environmental relationships in 64 alkaline southeastern Ontario (Canada) lakes: a diatom-based model for water quality reconstructions. J Paleolimnol 25:25–42

    Article  Google Scholar 

  • Reavie ED, Kingston JC, Edlund MD, Peterson M (2005) Sediment diatom reconstruction model for Minnesota lakes. Report to Itasca Soil and Water Conservation District

  • Reavie ED, Axler RP, Sgro GV, Danz NP, Kingston JC, Kireta AR, Brown TN, Hollenhorst TP, Ferguson MJ (2006) Diatom-based weighted-averaging transfer functions for Great Lakes coastal water quality: relationships to watershed characteristics. J Gt Lakes Res 32:321–347

    Article  CAS  Google Scholar 

  • Reavie ED, Sgro GV, Danz NP, Axler RP, Kireta AR, Kingston JC, Hollenhorst TP (2008) Comparison of simple and multimetric diatom-based indices for Great Lakes coastline disturbance. J Phycol 44:787–802

    Article  CAS  Google Scholar 

  • Sgro GV, Reavie ED, Kingston JC, Kireta AR, Ferguson MJ, Danz NP, Johansen JR (2007) A diatom quality index from a diatom-based total phosphorus inference model. Environ Bioindic 2:15–34

    Article  Google Scholar 

  • ter Braak CJF, van Dam H (1989) Inferring pH from diatoms: a comparison of old and new calibration methods. Hydrobiologia 178:209–223

    Article  CAS  Google Scholar 

  • Tibby J (2004) Development of a diatom-based model for inferring total phosphorus in southeastern Australian water storages. J Paleolimnol 31:23–36

    Article  Google Scholar 

  • USEPA (2010) Sampling and analytical procedures for GLNPO’s open lake water quality survey of the Great Lakes. United States Environmental Protection Agency, Great Lakes National Program Office. Chicago, Illinois. EPA 905-R-05-001. Accessed 7 October 2010

  • Weilhoefer CL, Pan Y (2006) Diatom-based bioassessment in wetlands: how many samples do we need to adequately characterize the diatom assemblage in a wetland? Wetlands 26:793–802

    Article  Google Scholar 

  • Wilson SE, Cumming BF, Smol JP (1996) Assessing the reliability of salinity inference models from diatom assemblages: an examination of a 219-lake data set from western North America. Can J Fish Aquat Sci 53:1580–1594

    Google Scholar 

  • Wood SN (2006) Generalized additive models. An introduction with R. Chapman & Hall, Boca Raton, p 391

    Google Scholar 

Download references


The Minnesota lake dataset has been progressively developed by Steve Heiskary and Mark Tomasek (Minnesota Pollution Control Agency), Dan Engstrom, Mark Edlund, Shawn Schottler and Joy Ramstack (St. Croix Watershed Research Station). Amy Kireta, Gerald Sgro, Norman Andresen and Michael Ferguson supported diatom assessments for GLEI samples. Michael Agbeti supported diatom assessments of the GLNPO phytoplankton samples. There are several people to thank for GLEI project management and field support, including Valerie Brady, Jerry Henneck, John Ameel, Gerald Niemi, John (Jack) Kelly, Russell Kreis and Jeffrey Johansen. This research was supported by grants to E. Reavie from the US Environmental Protection Agency under Cooperative Agreements EPA/R–8286750 (GLEI) and GL-00E23101 (GLNPO). This document has not been subjected to the EPA’s required peer and policy review and therefore does not necessarily reflect the view of the Agency, and no official endorsement should be inferred. This is contribution number 530 of the Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota Duluth.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Euan D. Reavie.

Additional information

Handling Editor: Piet Spaak.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Reavie, E.D., Juggins, S. Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets. Aquat Ecol 45, 529–538 (2011).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Diatoms
  • Stressors
  • Training sets
  • Inference models
  • Sample size
  • Models