Advertisement

Predictive spreadsheet autocompletion with constraints

  • Samuel Kolb
  • Stefano TesoEmail author
  • Anton Dries
  • Luc De Raedt
Article
  • 74 Downloads

Abstract

Spreadsheets are arguably the most accessible data-analysis tool and are used by millions of people. Despite the fact that they lie at the core of most business practices, working with spreadsheets can be error prone, usage of formulas requires training and, crucially, spreadsheet users do not have access to state-of-the-art analysis techniques offered by machine learning. To tackle these issues, we introduce the novel task of predictive spreadsheet autocompletion, where the goal is to automatically predict the missing entries in the spreadsheets. This task is highly non-trivial: cells can hold heterogeneous data types and there might be unobserved relationships between their values, such as constraints or probabilistic dependencies. Critically, the exact prediction task itself is not given. We consider a simplified, yet non-trivial, setting and propose a principled probabilistic model to solve it. Our approach combines black-box predictive models specialized for different predictive tasks (e.g., classification, regression) and constraints and formulas detected by a constraint learner, and produces a maximally likely prediction for all target cells that is consistent with the constraints. Overall, our approach brings us one step closer to allowing end users to leverage machine learning in their workflows without writing a single line of code.

Keywords

Spreadsheets autocompletion Bayesian networks Constraint learning Machine learning 

Notes

References

  1. Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1), 121–164.MathSciNetCrossRefGoogle Scholar
  2. Beldiceanu, N., & Simonis, H. (2012). A model seeker: Extracting global constraint models from positive examples. In Principles and practice of constraint programming (pp. 141–157). Springer.Google Scholar
  3. Bessiere, C., Daoudi, A., Hebrard, E., Katsirelos, G., Lazaar, N., Mechqrane, Y., et al. (2016). New approaches to constraint acquisition. In C. Bessiere, et al. (Eds.), Data mining and constraint programming (pp. 51–76). Cham: Springer. CrossRefGoogle Scholar
  4. Bessiere, C., Coletta, R., Koriche, F., & O’Sullivan, B. (2005). A sat-based version space algorithm for acquiring constraint satisfaction problems. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, & L. Torgo (Eds.), Machine learning: ECML 2005 (pp. 23–34). Berlin: Springer.Google Scholar
  5. BigML Home Page. Retrieved 30 April 2019 https://bigml.com.
  6. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. zbMATHGoogle Scholar
  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefGoogle Scholar
  8. Breiman, L. (2017). Classification and regression trees. Abingdon: Routledge.CrossRefGoogle Scholar
  9. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthey Weather Review, 78(1), 1–3.CrossRefGoogle Scholar
  10. Clemen, R. T., & Winkler, R. L. (1999). Combining probability distributions from experts in risk analysis. Risk Analysis, 19(2), 187–203.Google Scholar
  11. De Raedt, L., Blockeel, H., Kolb, S., Teso, S., & Verbruggen, G. (2018). Elements of an automatic data scientist. In International symposium on intelligent data analysis (pp. 3–14). Springer.Google Scholar
  12. De Raedt, L., Kimmig, A., & Toivonen, H. (2007). Problog: A probabilistic prolog and its application in link discovery. In Proceedings 20th international joint conference on artificial intelligence.Google Scholar
  13. Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.r., & Kohli, P. (2017). Robustfill: Neural program learning under noisy i/o. In International conference on machine learning (pp. 990–998).Google Scholar
  14. Dietterich, T.G. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15). Springer.Google Scholar
  15. Dries, A., Kimmig, A., Meert, W., Renkens, J., Van den Broeck, G., Vlasselaer, J., & De Raedt, L. (2015). Problog2: Probabilistic logic programming. In Joint european conference on machine learning and knowledge discovery in databases (pp. 312–315). Springer.Google Scholar
  16. Elisseeff, A., & Pontil, M. (2003). Leave-one-out error and stability of learning algorithms with applications. NATO Science Series Sub Series III Computer and Systems Sciences, 190, 111–130.Google Scholar
  17. Fierens, D., Van den Broeck, G., Renkens, J., Shterionov, D., Gutmann, B., Thon, I., et al. (2015). Inference and learning in probabilistic logic programs using weighted boolean formulas. Theory and Practice of Logic Programming, 15(3), 358–401.MathSciNetCrossRefGoogle Scholar
  18. Fisher, M., & Rothermel, G. (2005). The euses spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In ACM SIGSOFT software engineering notes, vol. 30, (pp. 1–5). ACM.Google Scholar
  19. Gulwani, S. (2011). Automating string processing in spreadsheets using input–output examples. In ACM SIGPLAN notices, vol. 46, (pp. 317–330). ACM.Google Scholar
  20. Gulwani, S., Harris, W. R., & Singh, R. (2012). Spreadsheet data manipulation using examples. Communications of the ACM, 55(8), 97–105.CrossRefGoogle Scholar
  21. Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11), 90–99.CrossRefGoogle Scholar
  22. Gulwani, S., Polozov, O., Singh, R., et al. (2017). Program synthesis. Foundations and Trends®in Programming Languages, 4(1–2), 1–119.Google Scholar
  23. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2), 181–214.CrossRefGoogle Scholar
  24. Kolb, S., Paramonov, S., Guns, T., & De Raedt, L. (2017). Learning constraints in spreadsheets and tabular data. Machine Learning, 106(9–10), 1441–1468. MathSciNetCrossRefGoogle Scholar
  25. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT press.zbMATHGoogle Scholar
  26. Lawson, B. R., Baker, K. R., Powell, S. G., & Foster-Johnson, L. (2009). A comparison of spreadsheet users with different levels of experience. Omega, 37(3), 579–590.CrossRefGoogle Scholar
  27. Muggleton, S., & De Raedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19, 629–679.MathSciNetCrossRefGoogle Scholar
  28. Nickel, M., Tresp, V., & Kriegel, H. P. (2011). A three-way model for collective learning on multi-relational data. ICML, 11, 809–816.Google Scholar
  29. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
  30. Raza, M., & Gulwani, S. (2017). Automated data extraction using predictive program synthesis. In AAAI (pp. 882–890).Google Scholar
  31. Rossi, F., & Sperduti, A. (2004). Acquiring both constraint and solution preferences in interactive constraint systems. Constraints, 9(4), 311–332.CrossRefGoogle Scholar
  32. Scaffidi, C., Shaw, M., & Myers, B. (2005). Estimating the numbers of end users and end user programmers. In Visual languages and human-centric computing, 2005 IEEE symposium on IEEE (pp. 207–214).Google Scholar
  33. Scheuren, F. (2005). Multiple imputation: How it began and continues. The American Statistician, 59(4), 315–319.MathSciNetCrossRefGoogle Scholar
  34. Stekhoven, D. J., & Bühlmann, P. (2011). MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.CrossRefGoogle Scholar
  35. Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), 219–242.MathSciNetCrossRefGoogle Scholar
  36. Van Buuren, S. (2018). Flexible imputation of missing data. Boca Raton: Chapman and Hall/CRC.CrossRefGoogle Scholar
  37. Van Wolputte, E., Korneva, E., & Blockeel, H. (2018). Mercs: Multi-directional ensembles of regression and classification trees. In AAAI.Google Scholar
  38. Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2013). OpenML: Networked Science in machine learning. SIGKDD Explorations, 15(2), 49–60.  https://doi.org/10.1145/2641190.2641198.CrossRefGoogle Scholar
  39. Yin, X., Han, J., Yang, J., & Philip, S. Y. (2006). Efficient classification across multiple database relations: A crossmine approach. IEEE Transactions on Knowledge & Data Engineering, 6, 770–783.CrossRefGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  • Samuel Kolb
    • 1
  • Stefano Teso
    • 1
    Email author
  • Anton Dries
    • 1
  • Luc De Raedt
    • 1
  1. 1.KU LeuvenLeuvenBelgium

Personalised recommendations