Similarity Aware Shuffling for the Distributed Execution of SQL Window Functions

  • Fábio CoelhoEmail author
  • Miguel Matos
  • José Pereira
  • Rui Oliveira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10320)


Window functions are extremely useful and have become increasingly popular, allowing ranking, cumulative sums and other analytic aggregations to be computed over a highly flexible and configurable sliding window. This powerful expressiveness comes naturally at the expense of heavy computational requirements which, so far, have been addressed through optimizations around centralized approaches by works both from the industry and academia. Distribution and parallelization has the potential to improve performance, but introduces several challenges associated with data distribution that may harm data locality. In this paper, we show how data similarity can be employed across partitions during the distributed execution of these operators to improve data co-locality between instances of a Distributed Query Engine and the associated data storage nodes. Our contribution can attain network gains in the average of 3 times and it is expected to scale as the number of instances increase. In the scenario with 8 nodes, we were to able attain bandwidth and time savings of 7.3 times and 2.61 times respectively.


Hash Function Window Function Computing Node Storage Node Query Execution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research leading to these results was part-funded by (1) the European Union’s Horizon 2020 - The EU Framework Programme for Research and Innovation 2014–2020, under grant agreement No. 732051; (1) Project TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01-0145-FEDER-000020 is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF) and by (1) the ERDF – European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT – Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013.


  1. 1.
    ANSI: Information technology - database languages - SQL multimedia and application packages. Technical report, ANSI (2003).
  2. 2.
    Apache: The apache derby project. Technical report, Apache Foundation (2016).
  3. 3.
    Brown, P.G., Hass, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 668–679. VLDB Endowment (2003)Google Scholar
  4. 4.
    Cao, Y., Chan, C.Y., Li, J., Tan, K.L.: Optimization of analytic window functions. Proc. VLDB Endowment 5(11), 1244–1255 (2012)CrossRefGoogle Scholar
  5. 5.
    Coelho, F., Pereira, J., Vilaça, R., Oliveira, R.: Holistic shuffler for the parallel processing of SQL window functions. In: Jelasity, M., Kalyvianaki, E. (eds.) DAIS 2016. LNCS, vol. 9687, pp. 75–81. Springer, Cham (2016). doi: 10.1007/978-3-319-39577-7_6 Google Scholar
  6. 6.
    Microsoft Corporation: Transact-SQL. Technical report, Microsoft Corporation (2013).
  7. 7.
    Oracle Corporation: SQL analysis and reporting. Technical report, Oracle Corporation (2015).
  8. 8.
    Cruz, F., Maia, F., Oliveira, R., Vilaça, R.: Workload-aware table splitting for NoSQL. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC 2014, pp. 399–404. ACM, New York (2014).
  9. 9.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. (TODS) 33(2), 6 (2008)CrossRefGoogle Scholar
  10. 10.
    Garofalakis, M., Keren, D., Samoladas, V.: Sketch-based geometric monitoring of distributed stream queries. Proc. VLDB Endowment 6(10), 937–948 (2013).
  11. 11.
    George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly Media, Inc., USA (2011)Google Scholar
  12. 12.
    Gonçalves, R.C., Pereira, J., Jiménez-Peris, R.: An RDMA middleware for asynchronous multi-stage shuffling in analytical processing. In: Jelasity, M., Kalyvianaki, E. (eds.) DAIS 2016. LNCS, vol. 9687, pp. 61–74. Springer, Cham (2016). doi: 10.1007/978-3-319-39577-7_5 Google Scholar
  13. 13.
    Hadoop Apache: Hadoop (2009)Google Scholar
  14. 14.
  15. 15.
    Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 647–658. ACM (2004)Google Scholar
  16. 16.
    Jimenez-Peris, R., Patiño-Martinez, M., Magoutis, K., Bilas, A., Brondino, I.: Cumulonimbo: a highly-scalable transaction processing platform as a service. ERCIM News 89(null), 34–35 (2012)Google Scholar
  17. 17.
    Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., et al.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR, vol. 1, p. 9 (2015)Google Scholar
  18. 18.
    Leis, V., Kundhikanjana, K., Kemper, A., Neumann, T.: Efficient processing of window functions in analytical SQL queries. Proc. VLDB Endowment 8(10), 1058–1069 (2015).
  19. 19.
    Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endowment 9(12), 1005–1016 (2016).
  20. 20.
    Postgresql: Advanced features - window functions. Technical report, Postgresql (2015).
  21. 21.
  22. 22.
    Transaction Processing Performance Council: TPC Benchmark DS (2012).
  23. 23.
    Zuzarte, C., Pirahesh, H., Ma, W., Cheng, Q., Liu, L., Wong, K.: Winmagic: subquery elimination using window aggregation. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 652–656. ACM (2003)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2017

Authors and Affiliations

  • Fábio Coelho
    • 1
    Email author
  • Miguel Matos
    • 2
  • José Pereira
    • 1
  • Rui Oliveira
    • 1
  1. 1.INESC TECUniversidade do MinhoBragaPortugal
  2. 2.INESC-ID/ISTLisboaPortugal

Personalised recommendations