“Here growes the wine Pucinum, now called Prosecho, much celebrated by Pliny.”
–Fynes Moryson, An Itinerary, 1617
Abstract
We present ProSecCo, an algorithm for the progressive mining of frequent sequences from large transactional datasets: It processes the dataset in blocks and it outputs, after having analyzed each block, a high-quality approximation of the collection of frequent sequences. ProSecCo can be used for interactive data exploration, as the intermediate results enable the user to make informed decisions as the computation proceeds. These intermediate results have strong probabilistic approximation guarantees and the final output is the exact collection of frequent sequences. Our correctness analysis uses the Vapnik–Chervonenkis (VC) dimension, a key concept from statistical learning theory. The results of our experimental evaluation of ProSecCo on real and artificial datasets show that it produces fast-converging high-quality results almost immediately. Its practical performance is even better than what is guaranteed by the theoretical analysis, and ProSecCo can even be faster than existing state-of-the-art non-progressive algorithms. Additionally, our experimental results show that ProSecCo uses a constant amount of memory, and orders of magnitude less than other standard, non-progressive, sequential pattern mining algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The last block may have fewer than \(b\) transactions.
Some additional care is needed when handling the initial block. See Sect. 4.4.
The last block may contain fewer than b transactions. For ease of presentation, we assume that all blocks have size \(b\).
I.e., the ith intermediate result.
References
Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The AQUA approximate query answering system. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’99, pp 574–576. https://doi.org/10.1145/304182.304581
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European conference on computer systems, ACM, pp 29–42
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the eleventh international conference on data engineering, IEEE, ICDE’95, pp 3–14
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22:207–216. https://doi.org/10.1145/170036.170072
Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, KDD’02. https://doi.org/10.1145/775047.775109
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In: NSDI, pp 313–328
Crotty A, Galakatos A, Zgraggen E, Binnig C, Kraska T (2015) Vizdom: interactive analytics through pen and touch. Proc VLDB Endow 8(12):2024–2027
Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Mining Knowl Discov 29(3):732–764. https://doi.org/10.1007/s10618-014-0362-1
Fournier-Viger P, Lin C, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. In: Proceedings of 19th European conference on machine learning and principles and practice of knowledge discovery and data mining (Part III), ECML PKDD’16. http://www.philippe-fournier-viger.com/spmf/
Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD’97, pp 171–182. https://doi.org/10.1145/253260.253291
Hellerstein JM, Avnur R, Chou A, Hidber C, Olston C, Raman V, Roth T, Haas PJ (1999) Interactive data analysis: the control project. Computer 32(8):51–59
Jermaine C, Arumugam S, Pol A, Dobra A (2008) Scalable approximate query processing with the DBO engine. ACM Trans Database Syst 33:23:1–23:54. https://doi.org/10.1145/1412331.1412335
Kamat N, Jayachandran P, Tunga K, Nandi A (2014) Distributed and interactive cube exploration. In: 30th IEEE international conference on data engineering, IEEE, ICDE’14, pp 472–483
Klösgen W (1992) Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. Int J Intell Syst 7:649–673
Li Y, Long PM, Srinivasan A (2001) Improved bounds on the sample complexity of learning. J Comput Syst Sci 62(3):516–527
Liu Z, Heer J (2014) The effects of interactive latency on exploratory visual analysis. IEEE Trans Vis Comput Graph 20(12):2122–2131
Mendes LF, Ding B, Han J (2008) Stream sequential pattern mining with precise error bounds. In: Eighth IEEE international conference on data mining, IEEE, ICDM’08, pp 941–946
Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, Cambridge
Olken F (1993) Random sampling from databases. Ph.D. thesis, University of California, Berkeley
Pansare N, Borkar VR, Jermaine C, Condie T (2011) Online aggregation for large MapReduce jobs. Proc VLDB Endow 4(11):1135–1145
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440
Pollard D (1984) Convergence of Stochastic Processes. Springer, Berlin
Raïssi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Seventh IEEE international conference on data mining, IEEE, ICDM’07, pp 631–636
Riondato M, Upfal E (2014) Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans Knowl Discov Data 8(4):20. https://doi.org/10.1145/2629586
Riondato M, Upfal E (2015) Mining frequent itemsets through progressive sampling with Rademacher averages. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining, ACM, KDD ’15, pp 1005–1014
Riondato M, Vandin F (2014) Finding the true frequent itemsets. In: Zaki MJ, Obradovic Z, Tan P, Banerjee A, Kamath C, Parthasarathy S (eds) Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, 2014, SIAM, pp 497–505. https://doi.org/10.1137/1.9781611973440.57
Riondato M, Vandin F (2018) MiSoSouP: mining interesting subgroups with sampling and pseudodimension. In: Proceedings of 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, KDD’18, pp 2130–2139
Servan-Schreiber S, Riondato M, Zgraggen E (2018) ProSecCo: progressive sequence mining with convergence guarantees. In: Proceedings of the 18th IEEE international conference on data mining, pp 417–426
Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, ICDT’96, pp 1–17
Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of 22nd international conference very large data bases. Morgan Kaufmann Publishers Inc., San Francisco, VLDB’96, pp 134–145
Vapnik VN (1998) Statistical Learning Theory. Wiley, New York
Vapnik VN, Chervonenkis AJ (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob Appl 16(2):264–280. https://doi.org/10.1137/1116025
Wang J, Han J, Li C (2007) Frequent closed sequence mining without candidate maintenance. IEEE Trans Knowl Data Eng 19(8):1042–1056
Zeng K, Agarwal S, Dave A, Armbrust M, Stoica I (2015) G-OLA: generalized on-line aggregation for interactive analysis on big data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, pp 913–918
Zeng K, Agarwal S, Stoica I (2016) IOLAP: managing uncertainty for efficient incremental OLAP. In: Proceedings of the 2016 international conference on management of data. ACM, SIGMOD’16, pp 1347–1361
Zgraggen E, Galakatos A, Crotty A, Fekete JD, Kraska T (2017) How progressive visualizations affect exploratory analysis. IEEE Trans Vis Comput Graph 23(8):1977–1987
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work appeared in the proceedings of IEEE ICDM’18 [28], where it was deemed the runner-up for the Best Student Paper Award.
Rights and permissions
About this article
Cite this article
Servan-Schreiber, S., Riondato, M. & Zgraggen, E. ProSecCo: progressive sequence mining with convergence guarantees. Knowl Inf Syst 62, 1313–1340 (2020). https://doi.org/10.1007/s10115-019-01393-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01393-8