DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets

Behringer, Michael; Fritz, Manuel; Schwarz, Holger; Mitschang, Bernhard

doi:10.1007/978-3-031-17834-4_4

Michael Behringer¹²,
Manuel Fritz¹²,
Holger Schwarz¹² &
…
Bernhard Mitschang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13591))

Included in the following conference series:

International Conference on Cooperative Information Systems

534 Accesses
1 Citations

Abstract

In recent years, the volume of data to be analyzed has increased tremendously. However, purposeful data analyses on large-scale data require in-depth domain knowledge. A common approach to reduce data volume and preserve interactivity are sampling algorithms. However, when using a sample, the semantic context across the entire dataset is lost, which impedes data preprocessing. In particular data imputation transformations, which aim to fill empty values for more accurate data analyses, suffer from this problem. To cope with this issue, we introduce DATA-IMP, a novel human-in-the-loop approach that enables data imputation transformations in an interactive manner while preserving scalability. We implemented a fully working prototype and conducted a comprehensive user study as well as a comparison to several non-interactive data imputation techniques. We show that our approach significantly outperforms state-of-the-art approaches regarding accuracy as well as preserves user satisfaction and enables domain experts to preprocess large-scale data in an interactive manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
London Crime Data: https://www.kaggle.com/LondonDataStore/london-crime.
2.
OpenRefine: https://openrefine.org.
3.
KNIME: https://www.knime.com.
4.
RapidMiner: https://rapidminer.com.
5.
Trifacta Wrangler: https://www.trifacta.com.
6.
https://www.kaggle.com/LondonDataStore/london-crime.
7.
https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university.
8.
https://www.h2o.ai/.

References

Behringer, M., Hirmer, P., Mitschang, B.: A human-centered approach for interactive data processing and analytics. In: Hammoudi, S., Śmiałek, M., Camp, O., Filipe, J. (eds.) ICEIS 2017. LNBIP, vol. 321, pp. 498–514. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93375-7_23
Chapter Google Scholar
Behringer, M., Hirmer, P., Fritz, M., Mitschang, B.: Empowering domain experts to preprocess massive distributed datasets. In: Abramowicz, W., Klein, G. (eds.) BIS 2020. LNBIP, vol. 389, pp. 61–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53337-3_5
Chapter Google Scholar
Bendre, M., et al.: Anti-freeze for large and complex spreadsheets: asynchronous formula computation. In: Proceedings of the SIGMOD 2019, pp. 1277–1294 (2019)
Google Scholar
Biessmann, F., et al.: “Deep” learning for missing value imputation in tables with non-numerical data. In: Proceedings of CIKM 2018, pp. 2017–2025 (2018)
Google Scholar
CrowdFlower Inc.: 2016 Data Science Report. Whitepaper (2016)
Google Scholar
Cypher, A. (ed.): Watch What I Do - Programming by Demonstration (1993)
Google Scholar
Endert, A., Hossain, M.S., Ramakrishnan, N., North, C., Fiaux, P., Andrews, C.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 43, 411–435 (2014)
Article Google Scholar
Fayyad, U.M., et al.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–54 (1996)
Google Scholar
Field, A., Hole, G.: How to Design and Report Experiments (2002)
Google Scholar
Gandel, S.: Damn Excel! How the ’most important software application of all time’ is ruining the world. Fortune.com, April 2013
Google Scholar
Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)
Article Google Scholar
Goiri, I., et al.: ApproxHadoop - bringing approximations to MapReduce frameworks. ASPLOS 50(4), 383–397 (2015)
Article Google Scholar
Gulwani, S., et al.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)
Article Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of CHI 2011, pp. 3363–3372 (2011)
Google Scholar
Keim, D.A., et al.: Visual analytics: how much visualization and how much analytics? SIGKDD Explor. 11(2), 5–8 (2010)
Article Google Scholar
Krishnan, D.R., et al.: IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of WWW 2016, pp. 1133–1144 (2016)
Google Scholar
Krishnan, S., et al.: SampleClean - fast and reliable analytics on dirty data. IEEE Data Eng. Bull. 38(3), 59–75 (2015)
Google Scholar
Krishnan, S., et al.: ActiveClean: an interactive data cleaning framework for modern machine learning. In: Proceedings of SIGMOD 2016, pp. 2117–2120 (2016)
Google Scholar
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019)
Article Google Scholar
Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Cengage Learning, Boston (2009)
MATH Google Scholar
Mack, K., et al.: Characterizing scalability issues in spreadsheet software using online forums. In: CHI EA 2018 (2018)
Google Scholar
Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehous. 5(4), 13–22 (2000)
Google Scholar
Thomas, T., Rajabi, E.: A systematic review of machine learning-based missing value imputation techniques. Data Technol. Appl. 55(4), 558–585 (2021)
Article Google Scholar
Wache, H., et al.: Ontology-based integration of information - a survey of existing approaches. In: OIS@IJCAI (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstraße 38, 70569, Stuttgart, Germany
Michael Behringer, Manuel Fritz, Holger Schwarz & Bernhard Mitschang

Authors

Michael Behringer
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Fritz
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwarz
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Mitschang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Behringer .

Editor information

Editors and Affiliations

Telecom SudParis - Institut Polytechnique de Paris, Evry, France
Mohamed Sellami
University of Milan, Milan, Italy
Paolo Ceravolo
Utrecht University, Utrecht, The Netherlands
Hajo A. Reijers
Telecom SudParis - Institut Polytechnique de Paris, Evry, France
Walid Gaaloul
University of Lorraine, Vandoeuvre-les-Nancy, France
Hervé Panetto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Behringer, M., Fritz, M., Schwarz, H., Mitschang, B. (2022). DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets. In: Sellami, M., Ceravolo, P., Reijers, H.A., Gaaloul, W., Panetto, H. (eds) Cooperative Information Systems. CoopIS 2022. Lecture Notes in Computer Science, vol 13591. Springer, Cham. https://doi.org/10.1007/978-3-031-17834-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-17834-4_4
Published: 25 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17833-7
Online ISBN: 978-3-031-17834-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets