Machine Learning

, Volume 42, Issue 1–2, pp 143–175 | Cite as

Concept Decompositions for Large Sparse Text Data Using Clustering

  • Inderjit S. Dhillon
  • Dharmendra S. Modha


Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.

concept vectors fractals high-dimensional data information retrieval k-means algorithm least-squares principal angles principal component analysis self-similarity singular value decomposition sparsity vector space models text mining 


  1. Berry, M. W., Dumais, S. T., & O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595.Google Scholar
  2. Björck, A. & Golub, G. (1973). Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27(123).Google Scholar
  3. Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the World Wide Web using WebACE. AI Review 13(5-6), 365–391.Google Scholar
  4. Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Technical Report 1997-015, Digital Systems Research Center.Google Scholar
  5. Caid, W. R. & Oing, P. (1997). System and method of context vector generation and retrieval. US Patent No. 5619709.Google Scholar
  6. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J.W. (1992). Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR.Google Scholar
  7. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407.Google Scholar
  8. Dhillon, I. S. & Modha, D. S. (1999). Concept decompositions for large sparse text data using clustering. Technical Report RJ 10147 (95022), IBM Almaden Research Center.Google Scholar
  9. Dhillon, I. S. & Modha, D. S. (2000). A parallel data-clustering algorithm for distributed memory multiprocessors. In: M. J. Zaki and C. T. Ho (eds.): Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Volume 1759. Springer-Verlag, New York, pp. 245–260. Presented at the 1999 Large-Scale Parallel KDD Systems Workshop, San Diego, CA.Google Scholar
  10. Dhillon, I. S., Modha, D. S., & Spangler, W. S. (1998). Visualizing Class Structure of Multidimensional Data. In: S. Weisberg (ed.): Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Vol. 30. Minneapolis, MN, pp. 488–493.Google Scholar
  11. Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley.Google Scholar
  12. Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, New Jersey Prentice Hall.Google Scholar
  13. Gallant, S. I. (1994). Methods for generating or revising context vectors for a plurality of word stems. US Patent No. 5325298.Google Scholar
  14. Garey, M. R., Johnson, D. S., & Witsenhausen, H. S. (1982). The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory 28(2), 255/256.Google Scholar
  15. Golub, G. H. and Van Loan, C. F. (1996). Matrix computations. Baltimore, MD, USA: The Johns Hopkins University Press.Google Scholar
  16. Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.Google Scholar
  17. Hearst, M. A. & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proc. ACM SIGIR.Google Scholar
  18. Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proc. ACM SIGIR.Google Scholar
  19. Isbell, C. L. & Viola, P. (1998). Restructuring sparse high dimensional data for effective retrieval. In: Advances in neural information processing (Vol. 11).Google Scholar
  20. Kleinberg, J., Papadimitriou, C. H., & Raghavan, P. (1998). A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324.Google Scholar
  21. Kolda, T. G. (1997). Limited-Memory Matrix Methods with Applications. Ph.D. Thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland.Google Scholar
  22. Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1994). On the self-similar nature of ethernet traffic. IEEE/ACM Transactions on Networking 2(1), 1–15.Google Scholar
  23. Mandelbrot, B. B. (1988). Fractal geometry of nature. W. H. Freeman & Company.Google Scholar
  24. O'Leary, D. P. & Peleg, S. (1983). Digital image compression by outer product expansion. IEEE Trans. Communications 31, 441–444.Google Scholar
  25. Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. In: Proc. Seventeenth ACM-SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Seattle, Washington. pp. 159–168.Google Scholar
  26. Pollard, D. (1982). Quantization and the method of k-means. IEEE Trans. Inform. Theory 28, 199–205.Google Scholar
  27. Rasmussen, E. (1992). Clustering Algorithms. In: W. B. Frakes & R. Baeza-Yates (eds.): Information retrieval: Data structures and algorithms. pp. 419–442, Prentice-Hall.Google Scholar
  28. Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38, 315–323.Google Scholar
  29. Sabin, M. J. & Gray, R. M. (1986). Global convergence and empirical consistency of the generalized Lloyd algorithm. IEEE Trans. Inform. Theory 32(2), 148–155.Google Scholar
  30. Sahami, M., Yusufali, S., & Baldonado, M., (1999). SONIA: A Service for Organizing Networked Information Autonomously. In: Proc. ACM Digital Libraries.Google Scholar
  31. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inform. proc. & management. pp. 513–523.Google Scholar
  32. Salton, G. & McGill, M. J. (1983). Introduction to modern retrieval. New York: McGraw-Hill Book Company.Google Scholar
  33. Saul, L. & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. In: Proc. 2nd Int. Conf. Empirical Methods in Natural Language Processing.Google Scholar
  34. Schütze, H. & Silverstein, C. (1997). Projections for efficient document clustering. In: Proc. ACM SIGIR.Google Scholar
  35. Silverstein, C. & Pedersen, J. O. (1997). Almost-constant-time clustering of arbitrary corpus subsets. In: Proc. ACM SIGIR.Google Scholar
  36. Singhal, A., Buckley, C., Mitra, M., & Salton, G. (1996). Pivoted document length normalization. In: Proc. ACM SIGIR.Google Scholar
  37. Vaithyanathan, S. & Dom, B. (1999). Model selection in unsupervised learning with applications to document clustering. In: Proc. 16th Int. Machine Learning Conf., Bled, Slovenia.Google Scholar
  38. Willet, P. (1988). Recent trends in hierarchic document clustering: a critical review. Inform. Proc. & Management pp. 577–597.Google Scholar
  39. Zamir, O. & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In: Proc. ACM SIGIR.Google Scholar
  40. Zipf, G. K. (1949). Human behavior and the principle of least effort. Reading, MA: Addison Wesley.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Inderjit S. Dhillon
    • 1
  • Dharmendra S. Modha
    • 2
  1. 1.Department of Computer ScienceUniversity of TexasAustinUSA
  2. 2.IBM Almaden Research CenterSan JoseUSA

Personalised recommendations