Skip to main content
Log in

Cleaning, standardization, and assessment of the accuracy and consistency of the Yellowstone National Park dataset (log book)

  • Research Article
  • Published:
Earth Science Informatics Aims and scope Submit manuscript

Abstract

Yellowstone National Park (YNP), Wyoming USA, contains over 10,000 geothermal features and 2 to 5 % of these features are geysers. Yellowstone has about half of the world’s geysers and the majority of YNP geysers are located in Upper geysers Basin. Beginning in 1970, details (time of eruption, height, duration, etc.) of about 25 geysers activities have been recorded in log books and later transcribed into an electronic dataset and posted on the park’s website. The data was collected by park rangers, visitors, and geyser enthusiasts, among others. The data collected by direct observation, camera, electronic, etc. The dataset contains a great deal of information that is relevant to scientists, educators and the public. However, the use of the dataset is severely limited without cleaning and standardization. Given the size, time span over which the data was collected, and the number of people involved in collecting the data, it’s inevitable that the data contains many inconsistencies. The dataset has been cleaned, standardized, reorganized in some parts and converted to a spreadsheet which makes the dataset much better suited for computations and analysis. The reorganization consists of two steps: step one was to remove text type information and extra information to a newly created column; and step two was to reorder the information in a set of records so that individual data entry is consistent with the column heading under which it should have been listed. The overall and monthly statistical summary of the data shows that interval and duration are both bimodal normally distributed, height is normally distributed and preplay display a Rayleigh type distribution. Comparison of the YNP and the electronic dataset was not feasible for all geysers and all variables; however, where it’s feasible such as the case for interval data, the two datasets are nearly identical.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Rahm R, Hai Do H (2000) Data cleaning: problems and current approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 23(4):1–11

    Google Scholar 

  • Birch F, Kennedy G (1972) Notes on geyser temperatures in Iceland and Yellowstone National Park. in Heard, H.C., Borg, I.Y., Carter, N.L., Raleigh, C.B., eds., Flow and fracture of rocks. Washington D.C., American Geophysical Union Geophysical Monograph Series (16): 329–336

  • Dowden J, Kapadia BG, Rymer H (1991) Dynamics of a geyser eruption. J Geophys Res 96:18,059–18,071

    Article  Google Scholar 

  • Hellerstein J (2008) Quantitative data cleaning for large databases. EECS Computer Science Division UC Berkeley, http://db.cs.berkeley.edu/jmh

  • Hutchinson R.A (1985) Hydrothermal changes in the Upper Geyser Basin, Yellowstone National Park, after the 1983 Borah Peak, Idaho, earthquake, in Stein, R.S., and Bucknam, R.C., eds., Proceedings of Workshop 28 on the 1983 Borah Peak, Idaho, Earthquake: U.S. Geological Survey Open-File Report 85–290-A, 612–624.

  • O’Hara DK, Esawi EK (2013) Model for the eruption of the old faithful geyser, Yellowstone national park. GSA Today 25(6):4–9

    Article  Google Scholar 

  • Rinehart JS (1980) Geysers and geothermal energy. Springer-Verlag, New York, 222p

Download references

Acknowledgments

I would like to thank Lynn Stephens, Marion Powell, Mary Beth Schwarz, and Don Might of the Geyser Observation and Study Association (GOSA) and Ralph Taylor for making the Old Faithful log book data and electronic data respectively available to the public; Kieran O’Hara for introducing me to the YNP data; and two anonymous reviewers for their excellent comments and suggestions which improved the MS tremendously. I would like to acknowledge the financial support of the Peden fund from Elizabethtown Community and Technical College. Last but not least I want to acknowledge the contribution of Mohammed Altayeb, a student at Al Akhawayn’s University school of business in Ifrane, Morocco, where he used Excel to do part of the data cleaning during my 2015 summer visit to AUI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to E. K. Esawi.

Additional information

Communicated by: H. A. Babaie

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Esawi, E.K. Cleaning, standardization, and assessment of the accuracy and consistency of the Yellowstone National Park dataset (log book). Earth Sci Inform 9, 281–289 (2016). https://doi.org/10.1007/s12145-015-0248-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12145-015-0248-9

Keyword

Navigation