Abstract
In a tabular database, patterns that occur over a frequency threshold are called frequent patterns. They are central in numerous data processes and various efficient algorithms were recently designed for mining them. Unfortunately, very little is known about the real difficulty of this mining, which is closely related to the number of such patterns. The worst case analysis always leads to an exponential number of frequent patterns, but experimentation shows that algorithms become efficient for reasonable frequency thresholds. In order to explain this behaviour, we perform here a probabilistic analysis of the number of frequent patterns. We first introduce a general model of random databases that encompasses all the previous classical models. In this model, the rows of the database are seen as independent words generated by the same probabilistic source (i.e., a random process that emits symbols). Under natural conditions on the source, the average number of frequent patterns is studied for various frequency thresholds. Note that the source may be nonexplicit since the conditions deal with the words. Then, we exhibit a large class of sources, the class of dynamical sources, which is proven to satisfy our general conditions. This finally shows that our results hold in a quite general context of random databases.
Keywords and phrases
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, USA, pp. 207–216, 1993.
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining, pp. 307–328, AAAI/MIT Press, Cambridge, MA 1996.
J. Bourdon. Size and path-length of Patricia tries: Dynamical sources context. Random Structures and Algorithms, 19: 289–315, 2001.
J. Bourdon, M. Nebel, and B. Vallée. On the stack-size of general tries. Theoretical Informatics and Applications, 35:163–185, 2001.
J. Clément, P. Flajolet, and B. Vallée. Dynamical sources in information theory: A general analysis of trie structures. Algorithmica, 29(1):307–369, 2001.
F. Geerts, B. Goethals, and J. Van den Bussche. A tight upper bound on the number of candidate patterns. In IEEE International Conference on Data Mining (ICDM’01), San Jose, USA, pp. 155–162, 2001.
B. Goethals. Survey on frequent pattern mining, Helsinki, Institute for Information Technology, Technical report, 2003.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. In ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, USA, pp. 1–12, 2000.
L. Lhote, F. Rioult, and A. Soulet. Average number of frequent (closed) patterns in Bernouilli and Markovian databases. In IEEE International Conference on Data Mining (ICDM’05), Houston, USA, pp. 713–716, 2005.
P.W. Purdom, D. Van Gucht, and D.P. Groth. Average-case performance of the Apriori algorithm. SIAM Journal on Computing, 33(5):1223–1260, 2004.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB’95, 1995.
H. Toivonen. Sampling large databases for association rules. In International Conference on Very Large Data Bases (VLDB’96), Mumbai, India, pp. 134–145. Morgan Kaufman, San Francisco, 1996.
B. Vallée. Dynamical sources in information theory: Fundamental intervals and word prefixes. Algorithmica, 29:262–306, 2001.
M.J. Zaki. Scalable algorithms for association mining. IEEE Transactions or Knowledge and Data Engineering, 12(2):372–390, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Birkhäuser Boston
About this chapter
Cite this chapter
Lhote, L. (2010). Number of Frequent Patterns in Random Databases. In: Skiadas, C. (eds) Advances in Data Analysis. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4799-5_4
Download citation
DOI: https://doi.org/10.1007/978-0-8176-4799-5_4
Published:
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4798-8
Online ISBN: 978-0-8176-4799-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)