Skip to main content

Part of the book series: Texts in Computer Science ((TCS))

  • 8680 Accesses

Abstract

This chapter introduces a variety of methods that are useful to get an overview of the data, which includes a summary of the whole database as well as the identification of areas that exceptionally deviate from the remainder. They provide answers to questions such as: Does it naturally subdivide into groups? How do attributes depend on each other? Are there certain conditions leading to exceptions from the average behaviour? The chapter provides an overview of clustering methods (hierarchical clustering, k-Means, density-based clustering), association analysis, self-organizing maps and deviation analysis. The definition and choice of distance or similarity measures, which is required by almost every technique to compare different cases in the database, is also tackled.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Isotropic means that the density is symmetric around the mean, that is, the cluster shapes are roughly hyperspheres. Using the Mahalanobis distance instead would allow for ellipsoidal shape, but we restrict ourselves to the isotropic case here.

  2. 2.

    The node are usually called neurons, since self-organizing maps are a special form of neural network.

  3. 3.

    As n is relatively large, we approximate the binomial distribution by a normal distribution. We use the z-test for reasons of simplicity—typically p 0 is not known but has to be estimated from the sample, and then Student’s t-test is used.

  4. 4.

    We will discuss the use of other types of distance metrics in KNIME later.

References

  1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: Proc. 1999 ACM SIGMOD Int. Conf. on Management of Data, pp. 61–72. ACM Press, New York (1999)

    Chapter  Google Scholar 

  2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on very Large Databases (VLDB 1994, Santiago de Chile), pp. 487–499. Morgan Kaufmann, San Mateo (1994)

    Google Scholar 

  3. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI Press/MIT Press, Cambridge (1996)

    Google Scholar 

  4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11, 5–33 (2005)

    Article  MathSciNet  Google Scholar 

  5. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ICMD, pp. 49–60, Philadelphia (1999)

    Google Scholar 

  6. Atzmueller, M., Puppe, F.: Sd-map: a fast algorithm for exhaustive subgroup discovery. In: Proc. Int. Conf. Knowledge Discovery in Databases (PKDD). Lecture Notes in Computer Science, vol. 4213. Springer, Berlin (2006)

    Google Scholar 

  7. Baumgartner, C., Plant, C., Kailing, K., Kriegel, H.-P., Kröger, P.: Subspace selection for clustering high-dimensional data. In: Proc. IEEE Int. Conf. on Data Mining, pp. 11–18. IEEE Press, Piscataway (2003)

    Google Scholar 

  8. Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2004, Brighton, UK), CEUR Workshop Proceedings 126, Aachen, Germany (2004). http://www.ceur-ws.org/Vol-126/

  9. Bellman, R.: Adaptive Control Processes. Princeton University Press, Princeton (1961)

    MATH  Google Scholar 

  10. Böttcher, M., Spott, M., Nauck, D.: Detecting temporally redundant association rules. In: Proc. 4th Int. Conf. on Machine Learning and Applications (ICMLA 2005, Los Angeles, CA), pp. 397–403. IEEE Press, Piscataway (2005)

    Google Scholar 

  11. Böttcher, M., Spott, M., Nauck, D.: A framework for discovering and analyzing changing customer segments. In: Advances in Data Mining—Theoretical Aspects and Applications. Lecture Notes in Computer Science, vol. 4597, pp. 255–268. Springer, Berlin (2007)

    Chapter  Google Scholar 

  12. Borgelt, C., Berthold, M.R.: Mining molecular fragments: finding relevant substructures of molecules. In: Proc. IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), pp. 51–58. IEEE Press, Piscataway (2002)

    Google Scholar 

  13. Borgelt, C.: On canonical forms for frequent graph mining. In: Proc. 3rd Int. Workshop on Mining Graphs, Trees and Sequences (MGTS’05, Porto, Portugal), pp. 1–12. ECML/PKDD 2005 Organization Committee, Porto (2005)

    Google Scholar 

  14. Borgelt, C., Wang, X.: SaM: a split and merge algorithm for fuzzy frequent item set mining (to appear)

    Google Scholar 

  15. Branko, K., Lavrac, N.: Apriori-sd: adapting association rule learning to subgroup discovery. Appl. Artif. Intell. 20(7), 543–583 (2006)

    Article  Google Scholar 

  16. Cheng, Y., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining (KDD’01, San Francisco, CA), pp. 194–203. ACM Press, New York (2001)

    Google Scholar 

  17. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Trans. Intell. Syst. 15(2), 32–41 (2000)

    Article  Google Scholar 

  18. Davé, R.N.: Characterization and detection of noise in clustering. Pattern Recognit. Lett. 12, 657–664 (1991)

    Article  Google Scholar 

  19. Ding, C., He, X.: Cluster merging and splitting in hierarchical clustering algorithms. In: Proc. IEEE Int. Conference on Data Mining, p. 139. IEEE Press, Piscataway (2002)

    Google Scholar 

  20. Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  21. Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD 96, Portland, Oregon), pp. 226–231. AAAI Press, Menlo Park (1996)

    Google Scholar 

  22. Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley, Chichester (2001)

    MATH  Google Scholar 

  23. Finn, P.W., Muggleton, S., Page, D., Srinivasan, A.: Pharmacore discovery using the inductive logic programming system PROGOL. Mach. Learn. 30(2–3), 241–270 (1998)

    Article  Google Scholar 

  24. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and application. J. Artif. Intell. Res. 17, 501–527 (2007)

    Google Scholar 

  25. Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA), CEUR Workshop Proceedings 90, Aachen, Germany (2003). http://www.ceur-ws.org/Vol-90/

  26. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  27. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)

    Article  MATH  Google Scholar 

  28. Han, J., Pei, H., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. Conf. on the Management of Data (SIGMOD’00, Dallas, TX), pp. 1–12. ACM Press, New York (2000)

    Google Scholar 

  29. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia satabases with noise. In: Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press, Menlo Park (1998)

    Google Scholar 

  30. Hinneburg, A., Keim, D.A.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proc. 25th Int. Conf. on Very Large Databases, pp. 506–517. Morgan Kaufmann, San Mateo (1999)

    Google Scholar 

  31. Höppner, F.: Speeding up Fuzzy C-means: using a hierarchical data organisation to control the precision of membership calculation. Fuzzy Sets Syst. 128(3), 365–378 (2002)

    Article  MATH  Google Scholar 

  32. Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy C-means and derivatives. IEEE Trans. Fuzzy Syst. 11(5), 682–694 (2003)

    Article  Google Scholar 

  33. Höppner, F., Klawonn, F., Kruse, R., Runkler, T.A.: Fuzzy Cluster Analysis. Wiley, Chichester (1999)

    MATH  Google Scholar 

  34. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In: Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM 2003, Melbourne, FL), pp. 549–552. IEEE Press, Piscataway (2003)

    Google Scholar 

  35. Kaski, S., Oja, E., Oja, E.: Kohonen Maps. Elsevier, Amsterdam (1999)

    MATH  Google Scholar 

  36. Klösgen, W.: Efficient discovery of interesting statements in databases. J. Intell. Inf. Syst. 4, 53–69 (1995)

    Article  Google Scholar 

  37. Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge (1996). Chap. 10

    Google Scholar 

  38. Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1480 (1990)

    Article  Google Scholar 

  39. Kramer, S., de Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2001, San Francisco, CA), pp. 136–143. ACM Press, New York (2001)

    Google Scholar 

  40. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proc. 1st IEEE Int. Conf. on Data Mining (ICDM 2001, San Jose, CA), pp. 313–320. IEEE Press, Piscataway (2001)

    Chapter  Google Scholar 

  41. Leman, D., Feelders, A., Knobbe, A.: Exceptional model mining. In: Proc. Europ. Conf. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 5212, pp. 1–16. Springer, Berlin (2008)

    Chapter  Google Scholar 

  42. Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD2004, Seattle, WA), pp. 647–652. ACM Press, New York (2004)

    Google Scholar 

  43. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl. 6(1), 90–105 (2004)

    Article  Google Scholar 

  44. Pei, J., Tung, A.K.H., Han, J.: Fault-tolerant frequent pattern mining: problems and challenges. In: Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMK’01, Santa Babara, CA). ACM Press, New York (2001)

    Google Scholar 

  45. Ritter, H., Martinez, T., Schulten, K.: Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading (1992)

    MATH  Google Scholar 

  46. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  47. Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)

    MATH  MathSciNet  Google Scholar 

  48. Scholz, M.: Sampling-based sequential subgroup mining. In: Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 265–274. AAAI Press, Menlo Park (2005)

    Google Scholar 

  49. Smyth, P., Goodman, R.M.: An information theoretic approach to rule induction from databases. IEEE Trans. Knowl. Discov. Eng. 4(4), 301–316 (1992)

    Article  Google Scholar 

  50. Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The chemistry development kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci. 43(2), 493–500 (2003)

    Article  Google Scholar 

  51. Vesanto, J.: SOM-based data visualization methods. Intell. Data Anal. 3(2), 111–126 (1999)

    Article  MATH  Google Scholar 

  52. Wang, X., Borgelt, C., Kruse, R.: Mining fuzzy frequent item sets. In: Proc. 11th Int. Fuzzy Systems Association World Congress (IFSA’05, Beijing, China), pp. 528–533. Tsinghua University Press/Springer, Beijing/Heidelberg (2005)

    Google Scholar 

  53. Webb, G.I., Zhang, S.: k-Optimal-rule-discovery. Data Min. Knowl. Discov. 10(1), 39–79 (2005)

    Article  MathSciNet  Google Scholar 

  54. Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)

    Article  Google Scholar 

  55. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Proc. 1st Europ. Symp. on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science, vol. 1263, pp. 78–87. Springer, London (1997)

    Chapter  Google Scholar 

  56. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proc. 2nd IEEE Int. Conf. on Data Mining (ICDM 2003, Maebashi, Japan), pp. 721–724. IEEE Press, Piscataway (2002)

    Google Scholar 

  57. Yan, X., Han, J.: Close-graph: mining closed frequent graph patterns. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003, Washington, DC), pp. 286–295. ACM Press, New York (2003)

    Google Scholar 

  58. Xie, X.L., Beni, G.A.: Validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 3(8), 841–846 (1991)

    Article  Google Scholar 

  59. Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97, Newport Beach, CA), pp. 283–296. AAAI Press, Menlo Park (1997)

    Google Scholar 

  60. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)

    Article  Google Scholar 

  61. Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10, 141–168 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael R. Berthold .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London Limited

About this chapter

Cite this chapter

Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. (2010). Finding Patterns. In: Guide to Intelligent Data Analysis. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84882-260-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-260-3_7

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-259-7

  • Online ISBN: 978-1-84882-260-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics