Abstract
In recent years, the volume of data to be analyzed has increased tremendously. However, purposeful data analyses on large-scale data require in-depth domain knowledge. A common approach to reduce data volume and preserve interactivity are sampling algorithms. However, when using a sample, the semantic context across the entire dataset is lost, which impedes data preprocessing. In particular data imputation transformations, which aim to fill empty values for more accurate data analyses, suffer from this problem. To cope with this issue, we introduce DATA-IMP, a novel human-in-the-loop approach that enables data imputation transformations in an interactive manner while preserving scalability. We implemented a fully working prototype and conducted a comprehensive user study as well as a comparison to several non-interactive data imputation techniques. We show that our approach significantly outperforms state-of-the-art approaches regarding accuracy as well as preserves user satisfaction and enables domain experts to preprocess large-scale data in an interactive manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
London Crime Data: https://www.kaggle.com/LondonDataStore/london-crime.
- 2.
OpenRefine: https://openrefine.org.
- 3.
KNIME: https://www.knime.com.
- 4.
RapidMiner: https://rapidminer.com.
- 5.
Trifacta Wrangler: https://www.trifacta.com.
- 6.
- 7.
- 8.
References
Behringer, M., Hirmer, P., Mitschang, B.: A human-centered approach for interactive data processing and analytics. In: Hammoudi, S., Śmiałek, M., Camp, O., Filipe, J. (eds.) ICEIS 2017. LNBIP, vol. 321, pp. 498–514. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93375-7_23
Behringer, M., Hirmer, P., Fritz, M., Mitschang, B.: Empowering domain experts to preprocess massive distributed datasets. In: Abramowicz, W., Klein, G. (eds.) BIS 2020. LNBIP, vol. 389, pp. 61–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53337-3_5
Bendre, M., et al.: Anti-freeze for large and complex spreadsheets: asynchronous formula computation. In: Proceedings of the SIGMOD 2019, pp. 1277–1294 (2019)
Biessmann, F., et al.: “Deep” learning for missing value imputation in tables with non-numerical data. In: Proceedings of CIKM 2018, pp. 2017–2025 (2018)
CrowdFlower Inc.: 2016 Data Science Report. Whitepaper (2016)
Cypher, A. (ed.): Watch What I Do - Programming by Demonstration (1993)
Endert, A., Hossain, M.S., Ramakrishnan, N., North, C., Fiaux, P., Andrews, C.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 43, 411–435 (2014)
Fayyad, U.M., et al.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–54 (1996)
Field, A., Hole, G.: How to Design and Report Experiments (2002)
Gandel, S.: Damn Excel! How the ’most important software application of all time’ is ruining the world. Fortune.com, April 2013
Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)
Goiri, I., et al.: ApproxHadoop - bringing approximations to MapReduce frameworks. ASPLOS 50(4), 383–397 (2015)
Gulwani, S., et al.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of CHI 2011, pp. 3363–3372 (2011)
Keim, D.A., et al.: Visual analytics: how much visualization and how much analytics? SIGKDD Explor. 11(2), 5–8 (2010)
Krishnan, D.R., et al.: IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of WWW 2016, pp. 1133–1144 (2016)
Krishnan, S., et al.: SampleClean - fast and reliable analytics on dirty data. IEEE Data Eng. Bull. 38(3), 59–75 (2015)
Krishnan, S., et al.: ActiveClean: an interactive data cleaning framework for modern machine learning. In: Proceedings of SIGMOD 2016, pp. 2117–2120 (2016)
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019)
Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Cengage Learning, Boston (2009)
Mack, K., et al.: Characterizing scalability issues in spreadsheet software using online forums. In: CHI EA 2018 (2018)
Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehous. 5(4), 13–22 (2000)
Thomas, T., Rajabi, E.: A systematic review of machine learning-based missing value imputation techniques. Data Technol. Appl. 55(4), 558–585 (2021)
Wache, H., et al.: Ontology-based integration of information - a survey of existing approaches. In: OIS@IJCAI (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Behringer, M., Fritz, M., Schwarz, H., Mitschang, B. (2022). DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets. In: Sellami, M., Ceravolo, P., Reijers, H.A., Gaaloul, W., Panetto, H. (eds) Cooperative Information Systems. CoopIS 2022. Lecture Notes in Computer Science, vol 13591. Springer, Cham. https://doi.org/10.1007/978-3-031-17834-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-17834-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17833-7
Online ISBN: 978-3-031-17834-4
eBook Packages: Computer ScienceComputer Science (R0)