Skip to main content
Log in

Similarity measures for OLAP sessions

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

OLAP queries are not normally formulated in isolation, but in the form of sequences called OLAP sessions. Recognizing that two OLAP sessions are similar would be useful for different applications, such as query recommendation and personalization; however, the problem of measuring OLAP session similarity has not been studied so far. In this paper, we aim at filling this gap. First, we propose a set of similarity criteria derived from a user study conducted with a set of OLAP practitioners and researchers. Then, we propose a function for estimating the similarity between OLAP queries based on three components: the query group-by set, its selection predicate, and the measures required in output. To assess the similarity of OLAP sessions, we investigate the feasibility of extending four popular methods for measuring similarity, namely the Levenshtein distance, the Dice coefficient, the tf–idf weight, and the Smith–Waterman algorithm. Finally, we experimentally compare these four extensions to show that the Smith–Waterman extension is the one that best captures the users’ criteria for session similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Available at http://www.julien.aligon.fr/recherche/similarityform.aspx.

  2. http://cs.ulb.ac.be/conferences/ebiss2011/.

  3. Note that, while substrings are consecutive parts of a string, subsequences need not be.

  4. While this enables a simpler formalization for group-by sets (see Definition 4.2), it does not significantly impact on the overall approach. Indeed, partially-ordered hierarchies could be easily solved with by extending Definition 5.4 to measure the distance between two group-by sets on the multidimensional lattice as suggested by Golfarelli [16].

  5. In a relational implementation, a multidimensional schema can be translated into either a star or a snowflake schema. While the specific joins required in these two cases to formulate the same query are different, a user is completely unaware of this difference because OLAP tools completely hide the underlying SQL and logical schemata to let users reason on the multidimensional cube abstraction.

  6. In the formula, the three rows of the \(min\) argument deal with deletions, insertions, and substitutions, respectively.

References

  1. Abiteboul S, Hull R, Vianu V (1995) Foundations of databases. Addison-Wesley, Reading

    MATH  Google Scholar 

  2. Agrawal R, Rantzau R, Terzi E (2006) Context-sensitive ranking. In: Proceedings ACM SIGMOD international conference on management of data. Chicago, IL, pp 383–394

  3. Akbarnejad J, Chatzopoulou G, Eirinaki M, Koshy S, Mittal S, On D, Polyzotis N, Varman JSV (2010) SQL QueRIE recommendations. PVLDB 3(2):1597–1600

    Google Scholar 

  4. Aligon J, Golfarelli M, Marcel P, Rizzi S, Turricchia E (2011) Mining preferences from OLAP query logs for proactive personalization. In: Proceedings ADBIS. Vienna, Austria, pp 84–97

  5. Aouiche K, Jouve P-E, Darmont J (2006) Clustering-based materialized view selection in data warehouses. In: Proceedings ADBIS. Thessaloniki, Greece, pp 81–95

  6. Baikousi E, Rogkakos G, Vassiliadis P (2011) Similarity measures for multidimensional data. In: Proceedings ICDE. Hannover, Germany, pp 171–182

  7. Brown PF, Pietra VJD, de Souza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  8. Bustos B, Skopal T (2011) Non-metric similarity search problems in very large collections. In: Proceedings ICDE. Hannover, Germany, pp 1362–1365

  9. Chatzopoulou G, Eirinaki M, Koshy S, Mittal S, Polyzotis N, Varman JSV (2011) The QueRIE system for personalized query recommendations. IEEE Data Eng Bull 34(2):55–60

    Google Scholar 

  10. Chatzopoulou G, Eirinaki M, Polyzotis N (2009) Query recommendations for interactive database exploration. In: Proceedings SSDBM. New Orleans, LA, pp 3–18

  11. Cohen WW, Ravikumar PD, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings IJCAI-03 workshop on information integration on the web. Acapulco, Mexico, pp 73–78

  12. Drosou M, Pitoura E (2011) ReDRIVE: result-driven database exploration through recommendations. In: Proceedings CIKM. Glasgow, UK, pp 1547–1552

  13. Garcia-Molina H, Ullman JD, Widom JD (2008) Database systems: the complete book, 2nd edn. Prentice Hall, Englewood Cliffs

    Google Scholar 

  14. Ghosh A, Parikh J, Sengar VS, Haritsa JR (2002) Plan selection based on query clustering. In: Proceedings VLDB. Hong Kong, China, pp 179–190

  15. Giacometti A, Marcel P, Negre E (2009) Recommending multidimensional queries. In: ‘Proceedings DaWaK. Linz, Austria, pp 453–466

  16. Golfarelli M (2003) Handling large workloads by profiling and clustering. In: Proceedings DaWaK. Czech Republic, Prague, pp 212–223

  17. Golfarelli M, Rizzi S, Biondi P (2011) myOLAP: an approach to express and evaluate OLAP preferences. IEEE TKDE 23(7):1050–1064

    Google Scholar 

  18. Grossman D, Frieder O (2004) Information retrieval: algorithms and heuristics. Springer, Berlin

    Book  Google Scholar 

  19. Gupta A, Mumick I (1999) Materialized views: techniques, implementations, and applications. MIT Press, Cambridge

    Google Scholar 

  20. Khoussainova N, Kwon Y, Balazinska M, Suciu D (2010) SnipSuggest: context-aware autocompletion for SQL. PVLDB 4(1):22–33

    Google Scholar 

  21. Khoussainova N, Kwon, Y, Liao W-T, Balazinska M, Gatterbauer W, Suciu D (2011) Session-based browsing for more effective query reuse. In: Proceedings SSDBM. Portland, OR, pp. 583–585

  22. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5):589–595

    Article  Google Scholar 

  23. Minnesota Population Center (2008) Integrated public use microdata series. http://www.ipums.org

  24. Monge AE, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings workshop on research issues on data mining and knowledge discovery

  25. Moreau E, Yvon F, Cappé O (2008) Robust similarity measures for named entities matching. In: Proceedings international conference on computational linguistics. Manchester, UK, pp 593–600

  26. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  27. Ögüdücü SG (2010) Web page recommendation models: theory and algorithms. In: Synthesis lectures on data management. Morgan & Claypool Publishers

  28. Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532

    Article  Google Scholar 

  29. Sapia C (2000) PROMISE: predicting query behavior to enable predictive caching strategies for OLAP systems. In: Proceedings DaWaK. London, UK, pp 224–233

  30. Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  Google Scholar 

  31. Stefanidis K, Drosou M, Pitoura E (2009) “You May Also Like” results in relational databases. In: Proceedings international workshop on personalized access. Profile management and context awareness: Databases. Lyon, France

  32. Wagner R, Fischer M (1974) The string-to-string correction problem. J ACM 21(1):168–173

    Article  MATH  MathSciNet  Google Scholar 

  33. Yang X, Procopiuc CM, Srivastava D (2009) Recommending join queries via query log analysis. In: Proceedings ICDE. Shanghai, China, pp 964–975

  34. Yao Q, An A, Huang X (2005) Finding and analyzing database user sessions. In: Proceedings DASFAA. Beijing, China, pp 851–862

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Rizzi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aligon, J., Golfarelli, M., Marcel, P. et al. Similarity measures for OLAP sessions. Knowl Inf Syst 39, 463–489 (2014). https://doi.org/10.1007/s10115-013-0614-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0614-1

Keywords

Navigation