Abstract
The mining of changes or differences or other comparative patterns from a pair of datasets is an interesting problem. This paper is focused on the mining of one type of comparative pattern called emerging patterns. Emerging patterns are denoted by EPs and are defined as patterns for which support increases from one dataset to the other with a big ratio. The number of EPs is sometimes huge. To provide a good structure for and to reduce the size of mining results, we use borders to concisely describe large collections of EPs in a lossless way. Such a border consists of only the minimal (under set inclusion) and the maximal EPs in the collection. We also present an algorithm for efficiently computing the borders of some desired EPs by manipulating the input borders only. Our experience with many datasets in the UCI Repository and recent cancer diagnosis datasets demonstrated that: Both the EP pattern type and our algorithm are useful for building accurate classifiers and useful for mining multifactor interactions, for example, minimal gene groups potentially responsible for the development of cancer.
Similar content being viewed by others
References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings ACM-SIGMOD international conference on management of data. Washington, DC, May 1993, pp 207–216
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Bailey J, Manoukian T, Ramamohanarao K (2002) Fast algorithms for mining emerging patterns. In: Proceedings of PKDD
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov
Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM-SIGMOD international conference on management of data, pp 85–93
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group
Cai Y, Cercone N, Han J (1991) Attribute-oriented induction in relational databases. In: Piatetsky-Shapiro G, Frawley WJ (eds) Knowledge discovery in databases. AAAI/MIT Press, pp 213–228
Dong G, Deshpande K (2001) Efficient mining of niches and set routines. In: Pacific-Asia conference on knowledge discovery and data mining
Dong G, Han J, Lam J, Pei J, Wang K (2001) Mining multi-dimensional constrained gradients in data cubes. In: Proceedings of the 2001 international conference on very large data bases (VLDB’01), Rome, Italy, Sept 2001, pp 321–330
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining
Dong G, Zhang X, Wong L, Li J (1999a) CAEP: Classification by aggregating emerging patterns. In: Proceedings of the 2nd international conference on discovery science, Tokyo, Japan
Dong J, Zhong N, Ohsuga S (1999b) Probabilistic rough induction: the GDT-RS methodology and algorithms. In: Proceedings of ISMIS, pp 621–629
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy R (ed) Proceedings of the 13th international joint conference on artificial intelligence. Morgan Kaufmann, pp 1022–1029
Ganti V, Gehrke J, Ramakrishnan R, Loh WY (1999) A framework for measuring changes in data characteristics. In: PODS, pp 126–137
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Gunter CA, Ngair T-H, Subramanian D (1997) The common order-theoretic structure of version spaces and ATMs. Artif Intell 95:357–407
Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, pp 399–421
Han J, Kamber M (2000) Data mining: concepts and techniques. Kaufmann
Li J, Dong G, Ramamohanarao K (2001a) Making use of the most expressive jumping emerging patterns for classification. Knowl Inf Syst Int J 3:131–145
Li J, Dong G, Ramamohanarao K, Wong L (2004) DeEPs: a new instance-based lazy discovery and classification system. Mach Learn 54:99–124
Li J, Liu H, Downing JR, Yeoh A-EJ, Wong L (2003) Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics 19:71–78
Li J, Ramamohanarao K, Dong G (2000) The space of jumping emerging patterns and its incremental maintenance algorithms. In: Proceedings of the 17th international conference on machine learning, Stanford, CA, USA, June 2000. Kaufmann, San Francisco, pp 551–558
Li J, Ramamohanarao K, Dong G (2001b) Combining the strength of pattern frequency and distance for classification. In: Pacific-Asia KDD
Li J, Wong L (2002a) Geography of differences between two classes of data. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery, PKDD 2002, Helsinki, Finland. Springer, Berlin Heidelberg New York, pp 325–337
Li J, Wong L (2002b) Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics 18:725–734
Liu B, Hsu W, Ma Y (1999) Mining association rules with multiple minimum supports. In: KDD, pp 337–341
Liu B, Hsu W, Han H-S, Xia Y (2000) Mining changes for real-life applications. In: DaWaK, pp 337–346
Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: KDD
Lockhart DJ et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680
Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Min Knowl Discov 1:241–258
Mitchell TM (1977) Version spaces: a candidate elimination approach to rule learning. In: Proceedings of the 5th international joint conference on artificial intelligence, Cambridge, MA, pp 305–310
Mitchell TM (1997) Machine learning. McGraw Hill
Pei J, Dong G, Zou W, Han J (2002) On computing condensed frequent pattern bases. In: Proceedings of IEEE ICDM
Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–577
Quinlan JR (1993) C4.5: Programs for machine learning. Kaufmann, San Mateo, CA
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470
Sebag M (1996) Delaying the choice of bias: a disjunctive version space approach. In: Machine Learning: Proceedings of the 13th international conference. Kaufmann, pp 444–452
Singh D, Febbol PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Vapnik VN (1998) Statistical learning theory. Wiley
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487
Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui C-H, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143
Zaki M, Hsiao C (1999) Charm: an efficient algorithm for closed association rule mining. In: Tech Report, RPI
Zhang X, Dong G, Ramamohanarao K (2000) Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In: KDD, pp 310–314
Zhang X, Dong G, Wong L (2001) Using CAEP to predict translation initiation sites from genomic DNA sequences. Technical Report TR2001/22, CSSE, University of Melbourne
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dong, G., Li, J. Mining border descriptions of emerging patterns from dataset pairs. Knowl Inf Syst 8, 178–202 (2005). https://doi.org/10.1007/s10115-004-0178-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0178-1