Skip to main content

Automatically Wrangling Spreadsheets into Machine Learning Data Formats

  • Conference paper
  • First Online:
Advances in Intelligent Data Analysis XVII (IDA 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11191))

Included in the following conference series:

Abstract

To help automate the important pre-processing step in machine learning and data mining, we introduce synth-a-sizer, a tool for semi-automatically wrangling spreadsheets into attribute-value format, so that they can be used by popular machine learning tools, only requiring the user to mark cells belonging to one single example. synth-a-sizer is based on inductive programming principles. We introduce synth-a-sizer’s transformations, search algorithm as well as a heuristic and distance measure for identifying types. We also report on a first experimental evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.openrefine.org.

  2. 2.

    https://archive.ics.uci.edu/ml/datasets/.

References

  1. Data Wrangling Automation, IEEE International Conference on Data Mining (2016). http://users.dsic.upv.es/~flip/DWA2016/

  2. Barowy, D.W., Gulwani, S., Hart, T., Zorn, B.: Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In: ACM SIGPLAN Notices, vol. 50, pp. 218–228. ACM (2015)

    Article  Google Scholar 

  3. Berthold, M.R., et al.: Knime-the konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explor. Newsl. 11(1), 26–31 (2009)

    Article  Google Scholar 

  4. Boullé, M.: Towards automatic feature construction for supervised classification. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 181–196. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44848-9_12

    Chapter  Google Scholar 

  5. Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, New York (2003)

    Google Scholar 

  6. Dheeru, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017)

    Google Scholar 

  7. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems, pp. 2962–2970 (2015)

    Google Scholar 

  8. Gulwani, S., Polozov, O., Singh, R.: Program synthesis. Found. Trends® Program. Lang. 4(1–2), 1–119 (2017)

    Article  Google Scholar 

  9. Guyon, I., et al.: A brief review of the ChaLearn AutoML challenge: any-time any-dataset learning without human intervention. In: Workshop on Automatic Machine Learning, pp. 21–30 (2016)

    Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  11. Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)

    Google Scholar 

  12. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)

    Google Scholar 

  13. Polozov, O., Gulwani, S.: Flashmeta: a framework for inductive program synthesis. In: ACM SIGPLAN Notices, vol. 50, pp. 107–126. ACM (2015)

    Google Scholar 

  14. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)

    Google Scholar 

  15. Verbruggen, G., De Raedt, L.: Towards automated relational data wrangling. In: Proceedings of AutoML 2017 @ ECML-PKDD: Automatic Selection, Configuration and Composition of Machine Learning Algorithms, pp. 18–26 (2017)

    Google Scholar 

Download references

Acknowledgement

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No [694980] SYNTH: Synthesising Inductive Data Models).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gust Verbruggen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Verbruggen, G., De Raedt, L. (2018). Automatically Wrangling Spreadsheets into Machine Learning Data Formats. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01768-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01767-5

  • Online ISBN: 978-3-030-01768-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics