HSM: Heterogeneous Subspace Mining in High Dimensional Data

  • Emmanuel Müller
  • Ira Assent
  • Thomas Seidl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5566)

Abstract

Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the “curse of dimensionality” pose additional challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes.

In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogeneity of the attributes. We extend an indexing structure for continuous attributes such that HSM indexing adapts to different attribute types. In our experiments we show that HSM efficiently mines patterns in arbitrary subspaces of heterogeneous high dimensional data.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)MATHGoogle Scholar
  2. 2.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases. In: KDD, pp. 226–231 (1996)Google Scholar
  3. 3.
    Zaki, M., Peters, M., Assent, I., Seidl, T.: CLICKS: An effective algorithm for mining subspace clusters in categorical datasets. DKE 60, 51–70 (2007)CrossRefGoogle Scholar
  4. 4.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbors meaningful. In: IDBT, pp. 217–235 (1999)Google Scholar
  5. 5.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD, pp. 94–105 (1998)Google Scholar
  6. 6.
    Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: SDM, pp. 246–257 (2004)Google Scholar
  7. 7.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: DUSC: Dimensionality unbiased subspace clustering. In: ICDM, pp. 409–414 (2007)Google Scholar
  8. 8.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: INSCY: Indexing subspace clusters with in-process-removal of redundancy. In: ICDM, pp. 719–724 (2008)Google Scholar
  9. 9.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)Google Scholar
  10. 10.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD, pp. 1–12 (2000)Google Scholar
  11. 11.
    Joliffe, I.: Principal Component Analysis. Springer, New York (1986)CrossRefGoogle Scholar
  12. 12.
    Sequeira, K., Zaki, M.: SCHISM: A new approach for interesting subspace mining. In: ICDM, pp. 186–193 (2004)Google Scholar
  13. 13.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: EDSC: Efficient density-based subspace clustering. In: CIKM, pp. 1093–1102 (2008)Google Scholar
  14. 14.
    Kailing, K., Kriegel, H.-P., Kröger, P., Wanka, S.: Ranking interesting subspaces for clustering high dimensional data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 241–252. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Zaki, M.J.: Generating non-redundant association rules. In: SIGKDD, pp. 34–43 (2000)Google Scholar
  16. 16.
    Pei, J., Han, J., Mao, R.: CLOSET: An efficient algorithm for mining frequent closed itemsets. In: SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 21–30 (2000)Google Scholar
  17. 17.
    Jagadish, H.V., Madar, J., Ng, R.T.: Semantic compression and pattern extraction with fascicles. In: VLDB, pp. 186–198 (1999)Google Scholar
  18. 18.
    Müller, E., Assent, I., Steinhausen, U., Seidl, T.: OutRank: ranking outliers in high dimensional data. In: DBRank at ICDE, pp. 600–603 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Emmanuel Müller
    • 1
  • Ira Assent
    • 2
  • Thomas Seidl
    • 1
  1. 1.Data management and exploration groupRWTH Aachen UniversityGermany
  2. 2.Department of Computer ScienceAalborg UniversityDenmark

Personalised recommendations