SubClass: Classification of Multidimensional Noisy Data Using Subspace Clusters

  • Ira Assent
  • Ralph Krieger
  • Petra Welter
  • Jörg Herbers
  • Thomas Seidl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5012)


Classification has been widely studied and successfully employed in various application domains. In multidimensional noisy settings, however, classification accuracy may be unsatisfactory. Locally irrelevant attributes often occlude class-relevant information. A global reduction to relevant attributes is often infeasible, as relevance of attributes is not necessarily a globally uniform property. In a current project with an airport scheduling software company, locally varying attributes in the data indicate whether flights will be on time, delayed or ahead of schedule. To detect locally relevant information, we propose combining classification with subspace clustering (SubClass). Subspace clustering aims at detecting clusters in arbitrary subspaces of the attributes. It has proved to work well in multidimensional and noisy domains. However, it does not utilize class label information and thus does not necessarily provide appropriate groupings for classification. We propose incorporating class label information into subspace search. As a result we obtain locally relevant attribute combinations for classification. We present the SubClass classifier that successfully exploits classifying subspace cluster information. Experiments on both synthetic and real world datasets demonstrate that classification accuracy is clearly improved for noisy multidimensional settings.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)Google Scholar
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Databases (VLDB), pp. 487–499 (1994)Google Scholar
  3. 3.
    Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)Google Scholar
  4. 4.
    Assent, I., Krieger, R., Glavic, B., Seidl, T.: Spatial multidimensional sequence clustering. In: Proceedings of International Workshop on Spatial and Spatio-Temporal Data Mining (SSTDM), conjunction with IEEE International Conference on Data Mining (ICDM) (2006)Google Scholar
  5. 5.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: DUSC: Dimensionality unbiased subspace clustering. In: Proceedings of IEEE International Conference on Data Mining (ICDM) (2007)Google Scholar
  6. 6.
    Bolat, A.: Procedures for providing robust gate assignments for arriving aircrafts. European Journal of Operational Research 120, 63–80 (2000)MATHCrossRefGoogle Scholar
  7. 7.
    Bureau of Transportation Statistics. Airline on-time performance data,
  8. 8.
    Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93 (1999)Google Scholar
  9. 9.
    Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9), 1281–1285 (2002)CrossRefGoogle Scholar
  10. 10.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, Chichester (2000)Google Scholar
  11. 11.
    Eurocontrol Central Office for Delay Analysis. Delays to air transport in europe,
  12. 12.
    Gray, R.: Entropy and Information Theory. Springer, Heidelberg (1990)MATHGoogle Scholar
  13. 13.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)Google Scholar
  14. 14.
    Hettich, S., Bay, S.: The uci kdd archive. University of California, Department of Information and Computer Science, Irvine, CA (1999), Google Scholar
  15. 15.
    Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 246–257 (2004)Google Scholar
  16. 16.
    Li, W., Han, J., Pei, J.: CMAR: accurate and efficient classification based on multipleclass-association rules. In: Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 369–376 (2001)Google Scholar
  17. 17.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, Burges, Smola (eds.) Advances in Kernel Methods, MIT Press, Cambridge (1998)Google Scholar
  18. 18.
    Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1992)Google Scholar
  19. 19.
    Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana, Illinois (1949)MATHGoogle Scholar
  20. 20.
    Silva, L., de Sa, J.M., Alexandre, L.: Neural network classification using shannon’s entropy. In: Proceedings of European Symposium on Artificial Neural Networks (ESANN) (2005)Google Scholar
  21. 21.
    Washio, T., Nakanishi, K., Motoda, H.: Deriving Class Association Rules Based on Levelwise Subspace Clustering. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 692–700. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ira Assent
    • 1
  • Ralph Krieger
    • 1
  • Petra Welter
    • 1
  • Jörg Herbers
    • 2
  • Thomas Seidl
    • 1
  1. 1.Data Management and Exploration GroupRWTH Aachen UniversityAachenGermany
  2. 2.INFORM GmbHAachenGermany

Personalised recommendations