A Clustering Framework Based on Adaptive Space Mapping and Rescaling

  • Yiling Zeng
  • Hongbo Xu
  • Jiafeng Guo
  • Yu Wang
  • Shuo Bai
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5839)


Traditional clustering algorithms often suffer from model misfit problem when the distribution of real data does not fit the model assumptions. To address this problem, we propose a novel clustering framework based on adaptive space mapping and rescaling, referred as M-R framework. The basic idea of our approach is to adjust the data representation to make the data distribution fit the model assumptions better. Specifically, documents are first mapped into a low dimensional space with respect to the cluster centers so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained in hand, a rescaling operation is then applied to regularize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to constantly improve the clustering performance. In our work, we apply the M-R framework on the most widely used clustering algorithm, i.e. k-means, as an example. Experiments on well known datasets show that our M-R framework can obtain comparable performance with state-of-the-art methods.


Document Clustering Space Mapping Data Representation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dumais, S.T.: LSI Meets TREC: A Status Report. In: Harman, D. (ed.) The First Text REtrieval Conference (TREC1), pp. 137–152. National Institute of Standards and Technology Special Publication 500-207 (1993)Google Scholar
  2. 2.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1989)zbMATHGoogle Scholar
  3. 3.
    Liu, X., Croft, W.B.: Cluster-Based Retrieval Using Language Models. In: Proc. of SIGIR 2004, pp. 186–193 (2004)Google Scholar
  4. 4.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)Google Scholar
  5. 5.
    Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: KDD 1997, pp. 287–290 (1997)Google Scholar
  6. 6.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishes, San Francisco (2006)zbMATHGoogle Scholar
  7. 7.
    Wu, H., Phang, T.H., Liu, B., Li, X.: A Refinement Approach to Handling Model Misfit in Text Categorization. In: SIGKDD, pp. 207–216 (2002)Google Scholar
  8. 8.
    Tan, S., Cheng, X., Ghanem, M.M., Wang, B., Xu, H.: A Novel Refinement Approach for Text Categorization. In: Proc. of the 14th ACM CIKM 2005, pp. 469–476 (2005)Google Scholar
  9. 9.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  10. 10.
    Ng, A., Jordan, M., Weiss, Y.: On Spectral Clustering: Analysis and an Algorithm. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2002)Google Scholar
  11. 11.
    Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)CrossRefGoogle Scholar
  12. 12.
    Chan, P.K., Schlag, D.F., Zien, J.Y.: Spectral K-way Ratio-Cut Partitioning and Clustering. IEEE Trans. Computer-Aided Design 13, 1088–1096 (1994)CrossRefGoogle Scholar
  13. 13.
    Ding, C., He, X., Zha, H., Gu, M., Simon, H.D.: A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering. In: Proc. of ICDM 2001, pp. 107–114 (2001)Google Scholar
  14. 14.
    Liu, X., Gong, Y.: Document Clustering with Cluster Refinement and Model Selection Capabilities. In: Proc. of SIGIR 2002, pp. 191–198 (2002)Google Scholar
  15. 15.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience Publishes, Hoboken (2000)zbMATHGoogle Scholar
  16. 16.
    Dhillon, I.: Co-clustering Documents and Words using Bipartite Spectral Graph Partitioning (Technical Report). Department of Computer Science, University of Texas at Austin (2001)Google Scholar
  17. 17.
    Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research (2004)Google Scholar
  18. 18.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Yiling Zeng
    • 1
  • Hongbo Xu
    • 1
  • Jiafeng Guo
    • 1
  • Yu Wang
    • 1
  • Shuo Bai
    • 1
    • 2
  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Shanghai Stock ExchangeShanghaiChina

Personalised recommendations