Abstract
To help automate the important pre-processing step in machine learning and data mining, we introduce synth-a-sizer, a tool for semi-automatically wrangling spreadsheets into attribute-value format, so that they can be used by popular machine learning tools, only requiring the user to mark cells belonging to one single example. synth-a-sizer is based on inductive programming principles. We introduce synth-a-sizer’s transformations, search algorithm as well as a heuristic and distance measure for identifying types. We also report on a first experimental evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Data Wrangling Automation, IEEE International Conference on Data Mining (2016). http://users.dsic.upv.es/~flip/DWA2016/
Barowy, D.W., Gulwani, S., Hart, T., Zorn, B.: Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In: ACM SIGPLAN Notices, vol. 50, pp. 218–228. ACM (2015)
Berthold, M.R., et al.: Knime-the konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explor. Newsl. 11(1), 26–31 (2009)
Boullé, M.: Towards automatic feature construction for supervised classification. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 181–196. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44848-9_12
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, New York (2003)
Dheeru, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems, pp. 2962–2970 (2015)
Gulwani, S., Polozov, O., Singh, R.: Program synthesis. Found. Trends® Program. Lang. 4(1–2), 1–119 (2017)
Guyon, I., et al.: A brief review of the ChaLearn AutoML challenge: any-time any-dataset learning without human intervention. In: Workshop on Automatic Machine Learning, pp. 21–30 (2016)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)
Polozov, O., Gulwani, S.: Flashmeta: a framework for inductive program synthesis. In: ACM SIGPLAN Notices, vol. 50, pp. 107–126. ACM (2015)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)
Verbruggen, G., De Raedt, L.: Towards automated relational data wrangling. In: Proceedings of AutoML 2017 @ ECML-PKDD: Automatic Selection, Configuration and Composition of Machine Learning Algorithms, pp. 18–26 (2017)
Acknowledgement
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No [694980] SYNTH: Synthesising Inductive Data Models).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Verbruggen, G., De Raedt, L. (2018). Automatically Wrangling Spreadsheets into Machine Learning Data Formats. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-01768-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01767-5
Online ISBN: 978-3-030-01768-2
eBook Packages: Computer ScienceComputer Science (R0)