Skip to main content

DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets

  • Conference paper
  • First Online:
Cooperative Information Systems (CoopIS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13591))

Included in the following conference series:

Abstract

In recent years, the volume of data to be analyzed has increased tremendously. However, purposeful data analyses on large-scale data require in-depth domain knowledge. A common approach to reduce data volume and preserve interactivity are sampling algorithms. However, when using a sample, the semantic context across the entire dataset is lost, which impedes data preprocessing. In particular data imputation transformations, which aim to fill empty values for more accurate data analyses, suffer from this problem. To cope with this issue, we introduce DATA-IMP, a novel human-in-the-loop approach that enables data imputation transformations in an interactive manner while preserving scalability. We implemented a fully working prototype and conducted a comprehensive user study as well as a comparison to several non-interactive data imputation techniques. We show that our approach significantly outperforms state-of-the-art approaches regarding accuracy as well as preserves user satisfaction and enables domain experts to preprocess large-scale data in an interactive manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    London Crime Data: https://www.kaggle.com/LondonDataStore/london-crime.

  2. 2.

    OpenRefine: https://openrefine.org.

  3. 3.

    KNIME: https://www.knime.com.

  4. 4.

    RapidMiner: https://rapidminer.com.

  5. 5.

    Trifacta Wrangler: https://www.trifacta.com.

  6. 6.

    https://www.kaggle.com/LondonDataStore/london-crime.

  7. 7.

    https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university.

  8. 8.

    https://www.h2o.ai/.

References

  1. Behringer, M., Hirmer, P., Mitschang, B.: A human-centered approach for interactive data processing and analytics. In: Hammoudi, S., Śmiałek, M., Camp, O., Filipe, J. (eds.) ICEIS 2017. LNBIP, vol. 321, pp. 498–514. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93375-7_23

    Chapter  Google Scholar 

  2. Behringer, M., Hirmer, P., Fritz, M., Mitschang, B.: Empowering domain experts to preprocess massive distributed datasets. In: Abramowicz, W., Klein, G. (eds.) BIS 2020. LNBIP, vol. 389, pp. 61–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53337-3_5

    Chapter  Google Scholar 

  3. Bendre, M., et al.: Anti-freeze for large and complex spreadsheets: asynchronous formula computation. In: Proceedings of the SIGMOD 2019, pp. 1277–1294 (2019)

    Google Scholar 

  4. Biessmann, F., et al.: “Deep” learning for missing value imputation in tables with non-numerical data. In: Proceedings of CIKM 2018, pp. 2017–2025 (2018)

    Google Scholar 

  5. CrowdFlower Inc.: 2016 Data Science Report. Whitepaper (2016)

    Google Scholar 

  6. Cypher, A. (ed.): Watch What I Do - Programming by Demonstration (1993)

    Google Scholar 

  7. Endert, A., Hossain, M.S., Ramakrishnan, N., North, C., Fiaux, P., Andrews, C.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 43, 411–435 (2014)

    Article  Google Scholar 

  8. Fayyad, U.M., et al.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–54 (1996)

    Google Scholar 

  9. Field, A., Hole, G.: How to Design and Report Experiments (2002)

    Google Scholar 

  10. Gandel, S.: Damn Excel! How the ’most important software application of all time’ is ruining the world. Fortune.com, April 2013

    Google Scholar 

  11. Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)

    Article  Google Scholar 

  12. Goiri, I., et al.: ApproxHadoop - bringing approximations to MapReduce frameworks. ASPLOS 50(4), 383–397 (2015)

    Article  Google Scholar 

  13. Gulwani, S., et al.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)

    Article  Google Scholar 

  14. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of CHI 2011, pp. 3363–3372 (2011)

    Google Scholar 

  15. Keim, D.A., et al.: Visual analytics: how much visualization and how much analytics? SIGKDD Explor. 11(2), 5–8 (2010)

    Article  Google Scholar 

  16. Krishnan, D.R., et al.: IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of WWW 2016, pp. 1133–1144 (2016)

    Google Scholar 

  17. Krishnan, S., et al.: SampleClean - fast and reliable analytics on dirty data. IEEE Data Eng. Bull. 38(3), 59–75 (2015)

    Google Scholar 

  18. Krishnan, S., et al.: ActiveClean: an interactive data cleaning framework for modern machine learning. In: Proceedings of SIGMOD 2016, pp. 2117–2120 (2016)

    Google Scholar 

  19. Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019)

    Article  Google Scholar 

  20. Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Cengage Learning, Boston (2009)

    MATH  Google Scholar 

  21. Mack, K., et al.: Characterizing scalability issues in spreadsheet software using online forums. In: CHI EA 2018 (2018)

    Google Scholar 

  22. Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehous. 5(4), 13–22 (2000)

    Google Scholar 

  23. Thomas, T., Rajabi, E.: A systematic review of machine learning-based missing value imputation techniques. Data Technol. Appl. 55(4), 558–585 (2021)

    Article  Google Scholar 

  24. Wache, H., et al.: Ontology-based integration of information - a survey of existing approaches. In: OIS@IJCAI (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Behringer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Behringer, M., Fritz, M., Schwarz, H., Mitschang, B. (2022). DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets. In: Sellami, M., Ceravolo, P., Reijers, H.A., Gaaloul, W., Panetto, H. (eds) Cooperative Information Systems. CoopIS 2022. Lecture Notes in Computer Science, vol 13591. Springer, Cham. https://doi.org/10.1007/978-3-031-17834-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17834-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17833-7

  • Online ISBN: 978-3-031-17834-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics