Knowledge and Information Systems

, Volume 39, Issue 2, pp 463–489 | Cite as

Similarity measures for OLAP sessions

  • Julien Aligon
  • Matteo Golfarelli
  • Patrick Marcel
  • Stefano Rizzi
  • Elisa Turricchia
Regular Paper

Abstract

OLAP queries are not normally formulated in isolation, but in the form of sequences called OLAP sessions. Recognizing that two OLAP sessions are similar would be useful for different applications, such as query recommendation and personalization; however, the problem of measuring OLAP session similarity has not been studied so far. In this paper, we aim at filling this gap. First, we propose a set of similarity criteria derived from a user study conducted with a set of OLAP practitioners and researchers. Then, we propose a function for estimating the similarity between OLAP queries based on three components: the query group-by set, its selection predicate, and the measures required in output. To assess the similarity of OLAP sessions, we investigate the feasibility of extending four popular methods for measuring similarity, namely the Levenshtein distance, the Dice coefficient, the tf–idf weight, and the Smith–Waterman algorithm. Finally, we experimentally compare these four extensions to show that the Smith–Waterman extension is the one that best captures the users’ criteria for session similarity.

Keywords

OLAP Similarity measures Query comparison Sequence comparison 

References

  1. 1.
    Abiteboul S, Hull R, Vianu V (1995) Foundations of databases. Addison-Wesley, ReadingMATHGoogle Scholar
  2. 2.
    Agrawal R, Rantzau R, Terzi E (2006) Context-sensitive ranking. In: Proceedings ACM SIGMOD international conference on management of data. Chicago, IL, pp 383–394Google Scholar
  3. 3.
    Akbarnejad J, Chatzopoulou G, Eirinaki M, Koshy S, Mittal S, On D, Polyzotis N, Varman JSV (2010) SQL QueRIE recommendations. PVLDB 3(2):1597–1600Google Scholar
  4. 4.
    Aligon J, Golfarelli M, Marcel P, Rizzi S, Turricchia E (2011) Mining preferences from OLAP query logs for proactive personalization. In: Proceedings ADBIS. Vienna, Austria, pp 84–97Google Scholar
  5. 5.
    Aouiche K, Jouve P-E, Darmont J (2006) Clustering-based materialized view selection in data warehouses. In: Proceedings ADBIS. Thessaloniki, Greece, pp 81–95Google Scholar
  6. 6.
    Baikousi E, Rogkakos G, Vassiliadis P (2011) Similarity measures for multidimensional data. In: Proceedings ICDE. Hannover, Germany, pp 171–182Google Scholar
  7. 7.
    Brown PF, Pietra VJD, de Souza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479Google Scholar
  8. 8.
    Bustos B, Skopal T (2011) Non-metric similarity search problems in very large collections. In: Proceedings ICDE. Hannover, Germany, pp 1362–1365Google Scholar
  9. 9.
    Chatzopoulou G, Eirinaki M, Koshy S, Mittal S, Polyzotis N, Varman JSV (2011) The QueRIE system for personalized query recommendations. IEEE Data Eng Bull 34(2):55–60Google Scholar
  10. 10.
    Chatzopoulou G, Eirinaki M, Polyzotis N (2009) Query recommendations for interactive database exploration. In: Proceedings SSDBM. New Orleans, LA, pp 3–18Google Scholar
  11. 11.
    Cohen WW, Ravikumar PD, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings IJCAI-03 workshop on information integration on the web. Acapulco, Mexico, pp 73–78Google Scholar
  12. 12.
    Drosou M, Pitoura E (2011) ReDRIVE: result-driven database exploration through recommendations. In: Proceedings CIKM. Glasgow, UK, pp 1547–1552Google Scholar
  13. 13.
    Garcia-Molina H, Ullman JD, Widom JD (2008) Database systems: the complete book, 2nd edn. Prentice Hall, Englewood CliffsGoogle Scholar
  14. 14.
    Ghosh A, Parikh J, Sengar VS, Haritsa JR (2002) Plan selection based on query clustering. In: Proceedings VLDB. Hong Kong, China, pp 179–190Google Scholar
  15. 15.
    Giacometti A, Marcel P, Negre E (2009) Recommending multidimensional queries. In: ‘Proceedings DaWaK. Linz, Austria, pp 453–466Google Scholar
  16. 16.
    Golfarelli M (2003) Handling large workloads by profiling and clustering. In: Proceedings DaWaK. Czech Republic, Prague, pp 212–223Google Scholar
  17. 17.
    Golfarelli M, Rizzi S, Biondi P (2011) myOLAP: an approach to express and evaluate OLAP preferences. IEEE TKDE 23(7):1050–1064Google Scholar
  18. 18.
    Grossman D, Frieder O (2004) Information retrieval: algorithms and heuristics. Springer, BerlinCrossRefGoogle Scholar
  19. 19.
    Gupta A, Mumick I (1999) Materialized views: techniques, implementations, and applications. MIT Press, CambridgeGoogle Scholar
  20. 20.
    Khoussainova N, Kwon Y, Balazinska M, Suciu D (2010) SnipSuggest: context-aware autocompletion for SQL. PVLDB 4(1):22–33Google Scholar
  21. 21.
    Khoussainova N, Kwon, Y, Liao W-T, Balazinska M, Gatterbauer W, Suciu D (2011) Session-based browsing for more effective query reuse. In: Proceedings SSDBM. Portland, OR, pp. 583–585Google Scholar
  22. 22.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5):589–595CrossRefGoogle Scholar
  23. 23.
    Minnesota Population Center (2008) Integrated public use microdata series. http://www.ipums.org
  24. 24.
    Monge AE, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings workshop on research issues on data mining and knowledge discoveryGoogle Scholar
  25. 25.
    Moreau E, Yvon F, Cappé O (2008) Robust similarity measures for named entities matching. In: Proceedings international conference on computational linguistics. Manchester, UK, pp 593–600Google Scholar
  26. 26.
    Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  27. 27.
    Ögüdücü SG (2010) Web page recommendation models: theory and algorithms. In: Synthesis lectures on data management. Morgan & Claypool PublishersGoogle Scholar
  28. 28.
    Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532CrossRefGoogle Scholar
  29. 29.
    Sapia C (2000) PROMISE: predicting query behavior to enable predictive caching strategies for OLAP systems. In: Proceedings DaWaK. London, UK, pp 224–233Google Scholar
  30. 30.
    Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197CrossRefGoogle Scholar
  31. 31.
    Stefanidis K, Drosou M, Pitoura E (2009) “You May Also Like” results in relational databases. In: Proceedings international workshop on personalized access. Profile management and context awareness: Databases. Lyon, FranceGoogle Scholar
  32. 32.
    Wagner R, Fischer M (1974) The string-to-string correction problem. J ACM 21(1):168–173CrossRefMATHMathSciNetGoogle Scholar
  33. 33.
    Yang X, Procopiuc CM, Srivastava D (2009) Recommending join queries via query log analysis. In: Proceedings ICDE. Shanghai, China, pp 964–975Google Scholar
  34. 34.
    Yao Q, An A, Huang X (2005) Finding and analyzing database user sessions. In: Proceedings DASFAA. Beijing, China, pp 851–862Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Julien Aligon
    • 1
  • Matteo Golfarelli
    • 2
  • Patrick Marcel
    • 1
  • Stefano Rizzi
    • 2
  • Elisa Turricchia
    • 2
  1. 1.Laboratoire d’InformatiqueUniversité François RabelaisToursFrance
  2. 2.DISIUniversity of BolognaBolognaItaly

Personalised recommendations