KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10816)


A data lake is a loosely-structured collection of data at large scale that is usually fed with almost no requirement of data quality. This approach aims at eliminating any human effort before the actual exploitation of data, but the problem is only delayed since preparing and querying a data lake is usually a hard task. We address this problem by introducing Kayak, a framework that helps data scientists in the definition and optimization of pipelines of data preparation. Since in many cases approximations of the results, which can be computed rapidly, are enough informative, Kayak allows the users to specify their needs in terms of accuracy over performance and produces previews of the outputs satisfying such requirement. In this way, the pipeline is executed much faster and the process of data preparation is shortened. We discuss the design choices of Kayak including execution strategies, optimization techniques, scheduling of operations, and metadata management. With a set of preliminary experiments, we show that the approach is effective and scales well with the number of datasets in the data lake.


Data lake Data preparation Big data Schema-on-read 


  1. 1.
    CKAN: The open source data portal software. Accessed Nov 2017
  2. 2.
    Tamr. Accessed Nov 2017
  3. 3.
    Trifacta. Accessed Nov 2017
  4. 4.
    Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys, pp. 29–42 (2013)Google Scholar
  5. 5.
    Bhardwaj, A.P., Deshpande, A., Elmore, A.J., Karger, D.R., Madden, S., Parameswaran, A.G., Subramanyam, H., Wu, E., Zhang, R.: Collaborative data analytics with DataHub. PVLDB 8(12), 1916–1927 (2015)Google Scholar
  6. 6.
    Deng, D., Fernandez, R.C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Tang, N.: The data civilizer system. In: CIDR (2017)Google Scholar
  7. 7.
    Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic data profiling: simultaneous discovery of various metadata. In: EDBT, pp. 305–316 (2016)Google Scholar
  8. 8.
    Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)Google Scholar
  9. 9.
    Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: SIGMOD, pp. 2097–2100 (2016)Google Scholar
  10. 10.
    Halevy, A.Y., Korn, F., Noy, N.F., Olston, C., Polyzotis, N., Roy, S., Whang, S.E.: Goods: organizing Google’s datasets. In: SIGMOD (2016)Google Scholar
  11. 11.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD, pp. 171–182 (1997)CrossRefGoogle Scholar
  12. 12.
    Hellerstein, J.M., Sreekanti, V., Gonzalez, J.E., Dalton, J., Dey, A., Nag, S., Ramachandran, K., Arora, S., Bhattacharyya, A., Das, S., Donsky, M., Fierro, G., She, C., Steinbach, C., Subramanian, V., Sun, E.: Ground: a data context service. In: CIDR (2017)Google Scholar
  13. 13.
    Heudecker, N., White, A.: The data lake fallacy: all water and little substance. Gartner Report G 264950 (2014)Google Scholar
  14. 14.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658 (2004)Google Scholar
  15. 15.
    Maccioni, A., Torlone, R.: Crossing the finish line faster when paddling the data lake with KAYAK. PVLDB 10(12), 1853–1856 (2017)Google Scholar
  16. 16.
    Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F.: Data profiling with metanome. PVLDB 8(12), 1860–1863 (2015)Google Scholar
  17. 17.
    Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J., Schönberg, M., Zwiener, J., Naumann, F.: Functional dependency discovery: an experimental evaluation of seven algorithms. PVLDB 8(10), 1082–1093 (2015)Google Scholar
  18. 18.
    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD, pp. 821–833 (2016)Google Scholar
  19. 19.
    Pérez, F., Granger, B.E.: IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9(3), 21–29 (2007)CrossRefGoogle Scholar
  20. 20.
    Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. PVLDB 8(9), 898–909 (2015)Google Scholar
  21. 21.
    Sarma, A.D., Fang, L., Gupta, N., Halevy, A.Y., Lee, H., Wu, F., Xin, R., Yu, C.: Finding related tables. In: SIGMOD (2012)Google Scholar
  22. 22.
    Stonebraker, M., Bruckner, D., Ilyas, I.F., Beskales, G., Cherniack, M., Zdonik, S.B., Pagan, A., Xu, S.: Data curation at scale: the data tamer system. In: CIDR (2013)Google Scholar
  23. 23.
    Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)Google Scholar
  24. 24.
    Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Collective[i]New York CityUSA
  2. 2.Università Roma TreRomeItaly

Personalised recommendations