Skip to main content
Log in

Mining border descriptions of emerging patterns from dataset pairs

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The mining of changes or differences or other comparative patterns from a pair of datasets is an interesting problem. This paper is focused on the mining of one type of comparative pattern called emerging patterns. Emerging patterns are denoted by EPs and are defined as patterns for which support increases from one dataset to the other with a big ratio. The number of EPs is sometimes huge. To provide a good structure for and to reduce the size of mining results, we use borders to concisely describe large collections of EPs in a lossless way. Such a border consists of only the minimal (under set inclusion) and the maximal EPs in the collection. We also present an algorithm for efficiently computing the borders of some desired EPs by manipulating the input borders only. Our experience with many datasets in the UCI Repository and recent cancer diagnosis datasets demonstrated that: Both the EP pattern type and our algorithm are useful for building accurate classifiers and useful for mining multifactor interactions, for example, minimal gene groups potentially responsible for the development of cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings ACM-SIGMOD international conference on management of data. Washington, DC, May 1993, pp 207–216

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750

    Article  Google Scholar 

  • Bailey J, Manoukian T, Ramamohanarao K (2002) Fast algorithms for mining emerging patterns. In: Proceedings of PKDD

  • Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov

    Google Scholar 

  • Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM-SIGMOD international conference on management of data, pp 85–93

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group

  • Cai Y, Cercone N, Han J (1991) Attribute-oriented induction in relational databases. In: Piatetsky-Shapiro G, Frawley WJ (eds) Knowledge discovery in databases. AAAI/MIT Press, pp 213–228

  • Dong G, Deshpande K (2001) Efficient mining of niches and set routines. In: Pacific-Asia conference on knowledge discovery and data mining

  • Dong G, Han J, Lam J, Pei J, Wang K (2001) Mining multi-dimensional constrained gradients in data cubes. In: Proceedings of the 2001 international conference on very large data bases (VLDB’01), Rome, Italy, Sept 2001, pp 321–330

  • Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining

  • Dong G, Zhang X, Wong L, Li J (1999a) CAEP: Classification by aggregating emerging patterns. In: Proceedings of the 2nd international conference on discovery science, Tokyo, Japan

  • Dong J, Zhong N, Ohsuga S (1999b) Probabilistic rough induction: the GDT-RS methodology and algorithms. In: Proceedings of ISMIS, pp 621–629

  • Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy R (ed) Proceedings of the 13th international joint conference on artificial intelligence. Morgan Kaufmann, pp 1022–1029

  • Ganti V, Gehrke J, Ramakrishnan R, Loh WY (1999) A framework for measuring changes in data characteristics. In: PODS, pp 126–137

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537

    Article  Google Scholar 

  • Gunter CA, Ngair T-H, Subramanian D (1997) The common order-theoretic structure of version spaces and ATMs. Artif Intell 95:357–407

    Article  MathSciNet  Google Scholar 

  • Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, pp 399–421

  • Han J, Kamber M (2000) Data mining: concepts and techniques. Kaufmann

    Google Scholar 

  • Li J, Dong G, Ramamohanarao K (2001a) Making use of the most expressive jumping emerging patterns for classification. Knowl Inf Syst Int J 3:131–145

    Article  Google Scholar 

  • Li J, Dong G, Ramamohanarao K, Wong L (2004) DeEPs: a new instance-based lazy discovery and classification system. Mach Learn 54:99–124

    Article  Google Scholar 

  • Li J, Liu H, Downing JR, Yeoh A-EJ, Wong L (2003) Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics 19:71–78

    Article  Google Scholar 

  • Li J, Ramamohanarao K, Dong G (2000) The space of jumping emerging patterns and its incremental maintenance algorithms. In: Proceedings of the 17th international conference on machine learning, Stanford, CA, USA, June 2000. Kaufmann, San Francisco, pp 551–558

  • Li J, Ramamohanarao K, Dong G (2001b) Combining the strength of pattern frequency and distance for classification. In: Pacific-Asia KDD

  • Li J, Wong L (2002a) Geography of differences between two classes of data. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery, PKDD 2002, Helsinki, Finland. Springer, Berlin Heidelberg New York, pp 325–337

  • Li J, Wong L (2002b) Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics 18:725–734

    Article  Google Scholar 

  • Liu B, Hsu W, Ma Y (1999) Mining association rules with multiple minimum supports. In: KDD, pp 337–341

  • Liu B, Hsu W, Han H-S, Xia Y (2000) Mining changes for real-life applications. In: DaWaK, pp 337–346

  • Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: KDD

  • Lockhart DJ et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680

    Article  Google Scholar 

  • Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Min Knowl Discov 1:241–258

    Article  Google Scholar 

  • Mitchell TM (1977) Version spaces: a candidate elimination approach to rule learning. In: Proceedings of the 5th international joint conference on artificial intelligence, Cambridge, MA, pp 305–310

  • Mitchell TM (1997) Machine learning. McGraw Hill

  • Pei J, Dong G, Zou W, Han J (2002) On computing condensed frequent pattern bases. In: Proceedings of IEEE ICDM

  • Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–577

    Article  Google Scholar 

  • Quinlan JR (1993) C4.5: Programs for machine learning. Kaufmann, San Mateo, CA

  • Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470

    Article  Google Scholar 

  • Sebag M (1996) Delaying the choice of bias: a disjunctive version space approach. In: Machine Learning: Proceedings of the 13th international conference. Kaufmann, pp 444–452

  • Singh D, Febbol PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209

    Article  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley

  • Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487

    Article  Google Scholar 

  • Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui C-H, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143

    Article  Google Scholar 

  • Zaki M, Hsiao C (1999) Charm: an efficient algorithm for closed association rule mining. In: Tech Report, RPI

  • Zhang X, Dong G, Ramamohanarao K (2000) Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In: KDD, pp 310–314

  • Zhang X, Dong G, Wong L (2001) Using CAEP to predict translation initiation sites from genomic DNA sequences. Technical Report TR2001/22, CSSE, University of Melbourne

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guozhu Dong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dong, G., Li, J. Mining border descriptions of emerging patterns from dataset pairs. Knowl Inf Syst 8, 178–202 (2005). https://doi.org/10.1007/s10115-004-0178-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0178-1

Keywords

Navigation