Skip to main content

Materia: A Data Quality Control Embedded Domain Specific Language in Python

  • Conference paper
  • First Online:
Business Information Systems Workshops (BIS 2020)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 394))

Included in the following conference series:

Abstract

Current solutions for data quality control (QC) in the environmental sciences are locked within propriety platforms or reliant on specialized software. This can pose a problem for data users when attempting to integrate QC into their existing workflows. To address this limitation, we developed an embedded domain specific language (EDSL), Materia, that provides functions, data structures, and a fluent syntax for defining and executing quality control tests on data. Materia enables developers to more easily integrate QC into complex data pipelines and makes QC more accessible for students and citizen scientists. We evaluate Materia via two metrics: productivity and a quantitative performance analysis. Our productivity examples show how Materia can simplify complex descriptions of tests in Pandas and mirror natural language descriptions of common QC tests. We also demonstrate that Materia achieves satisfactory performance with over 200,000 floating-point values processed in under three seconds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brabyn, L., et al.: Accuracy assessment of land surface temperature retrievals from landsat 7 ETM + in the dry valleys of antarctica using iButton temperature loggers and weather station data. Environ. Monit. Assess. 186(4), 2619–2628 (2013). https://doi.org/10.1007/s10661-013-3565-9. ISSN: 1573-2959

    Article  Google Scholar 

  2. Campbell, J.L., et al.: Quantity is nothing without quality: automated QA/QC for streaming environmental sensor data. BioScience 63(7), 574–585 (2013). https://doi.org/10.1525/bio.2013.63.7.10. https://academic.oup.com/bioscience/article-lookup/doi/10.1525/bio.2013.63.7.10

    Article  Google Scholar 

  3. ESIP Envirosensing Cluster. Sensor Data Quality (2019). http://wiki.esipfed.org/index.php/Sensor_Data_Quality. Accessed 28 May 2020

  4. ESRI. ArcGIS October 2017. https://resources.arcgis.com/en/communities/data-reviewer/

  5. Gill, A.: Domain-specific languages and code synthesis using haskell. Commun. ACM 57(6), 42–49 (2014)

    Article  MathSciNet  Google Scholar 

  6. Gouldman, C.C., Bailey, K., Thomas, J.O.: Manual for real-time oceanographic data quality control flags. In: IOOS (2017)

    Google Scholar 

  7. Georgia Coastal Ecosystems LTER. GCE Data Toolbox for MATLAB (2017). http://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm

  8. Mckinney. W.: Pandas: a Foundational python library for data analysis and statistics. In: Python High Performance Science Computer (January 2011)

    Google Scholar 

  9. Campbell Scientific. LoggerNET, December 2017. https://www.campbellsci.com/loggernet

  10. Scully-Allison, C.: Materia. Version 0.11, May 2020. https://doi.org/10.5281/zenodo.3870396https://github.com/cscully-allison/Materia

  11. Scully-Allison, C., et al.: Near real-time autonomous quality control for streaming environmental sensor data. Procedia Comput. Sci. 126, 1656–1665 (2018)

    Article  Google Scholar 

  12. Scully-Allison, C.F.: Keystone: A Streaming Data Management Model for the Environmental Sciences. PhD thesis. (2019)

    Google Scholar 

  13. Sheldon, W.M.: Dynamic, rule-based quality control framework for real-time sensor data. In: Proceedings of the Environmental Information Management Conference, pp. 145–150 (2008). https://lternet.edu/wp-content/uploads/2010/12/eim-2008-proceedingssmall.pdf

  14. Wilkinson, M.D., et al. The FAIR guiding principles for scientific data management and stewardship. In: Scientific Data (2016). https://doi.org/10.1038/sdata.2016.18

Download references

Acknowledgements

We thank Michelle Strout for her invaluable input throughout the development of this project and editing provided for this paper. We would also like to thank Kate Isaacs for her input editing this paper and Chase Carthen for his input on the design of this language.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Connor Scully-Allison .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scully-Allison, C. (2020). Materia: A Data Quality Control Embedded Domain Specific Language in Python. In: Abramowicz, W., Klein, G. (eds) Business Information Systems Workshops. BIS 2020. Lecture Notes in Business Information Processing, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-61146-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61146-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61145-3

  • Online ISBN: 978-3-030-61146-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics