Vector Space Models for Search and Cluster Mining

  • Mei Kobayashi
  • Masaki Aono

This chapter reviews some search and cluster mining algorithms based on vector space modeling (VSM). The first part of the review considers two methods to address polysemy and synonomy problems in very large data sets: latent semantic indexing (LSI) and principal component analysis (PCA). The second part focuses on methods for finding minor clusters. Until recently, the study of minor clusters has been relatively neglected, even though they may represent rare but significant types of events or special types of customers. A novel new algorithm for finding minor clusters is introduced. It addresses some difficult issues in database analysis, such as accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and user-controlled trade-off between speed of computation and quality of results. Implementation studies with new articles from Reuters and Los Angeles Times TREC datasets show the effectiveness of the algorithm compared to previous methods.


Vector Space Modeling Document Cluster Latent Semantic Indexing Document Vector Minor Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. R. Ando. Latent semantic space. In Proceedings of ACM SIGIR, pages 213-232. ACM Press, New York, 2000.Google Scholar
  2. M.W. Berry, Z. Drmac, and E.R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335-362, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  3. M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.zbMATHCrossRefMathSciNetGoogle Scholar
  4. J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.zbMATHGoogle Scholar
  5. K. Blom and A. Ruhe. Information retrieval using very short Krylov sequences. In Proceedings of Computational Information Retrieval Workshop, North Carolina State University, pages 3-24. SIAM, Philadelphia, 2001.Google Scholar
  6. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.Google Scholar
  7. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.CrossRefGoogle Scholar
  8. J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.zbMATHGoogle Scholar
  9. G. Dupret. Latent concepts and the number of orthogonal factors in latent semantic analysis. In Proceedings of ACM SIGIR, pages 221-226. ACM Press, New York, 2003.Google Scholar
  10. B. Everitt, S. Landau, and N. Leese. Cluster Analysis. Arnold, London, UK, fourth edition, 2001.Google Scholar
  11. C. Eckart and G. Young. A principal axis transformation for non-Hermitian matrices. Bulletin of the American Mathematics Society, 45:118-121, 1939.CrossRefMathSciNetGoogle Scholar
  12. G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.zbMATHGoogle Scholar
  13. G. Hamerly. Learning Structure and Concepts in Data Through Data Clustering. PhD thesis, University of California at San Diego, CA, 2003.Google Scholar
  14. D. Harman. Ranking algorithms. In R. Baeza-Yates and B. Ribeiro-Neto (eds.), Information Retrieval, pages 363-392, ACM Press, New York, 1999.Google Scholar
  15. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall, Upper Saddle River, NJ, second edition, 1999.zbMATHGoogle Scholar
  16. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On cluster validation techniques. Journal of Intelligent Infoormation Systems, 17(2-3):107, 145, 2001.zbMATHCrossRefGoogle Scholar
  17. M. Hearst. The use of categories and clusters for organizing retrieval results. In T. Strzalkowski, editor, Natural Language Information Retrieval, pages 333-374. Kluwer Academic, Dordrecht, The Netherlands, 1999.Google Scholar
  18. J. Han and M. Kamber. Data Mining: Concepts & Techniques. Morgan Kaufmann, San Francisco, 2000.Google Scholar
  19. D. Hundley and M. Kirby. Estimation of topological dimension. In Proceedings of SIAM International Conference on Data Mining, pages 194-202. SIAM, Philadelphia, 2003.Google Scholar
  20. H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441, 1933.CrossRefGoogle Scholar
  21. M. Houle. Navigating massive sets via local clustering. In Proceedings of ACM KDD, pages 547-552. ACM Press, New York, 2003.Google Scholar
  22. Y. Ishii. Analysis of customer data for targeted marketing: case studies using airline industry data (in Japanese). In Proceedings of ACM SIGMOD of Japan Conference, pages 37-49, 2004.Google Scholar
  23. I. Jolliffe. Principal Component Analysis. Springer, New York, second edition, 2002.zbMATHGoogle Scholar
  24. M. Kobayashi and M. Aono. Major and outlier cluster analysis using dynamic rescaling of document vectors. In Proceedings of the SIAM Text Mining Workshop, Arlington, VA, pages 103-113, SIAM, Philadelphia, 2002.Google Scholar
  25. M. Kobayashi and M. Aono. Exploring overlapping clusters using dynamic rescaling and sampling. Knowledge & Information Systems, 10(3):295-313, 2006.CrossRefGoogle Scholar
  26. M. Kobayashi, M. Aono, H. Samukawa, and H. Takeuchi. Matrix computations for information retrieval and major and outlier cluster detection. Journal of Computational and Applied Mathematics, 149(1):119-129, 2002.zbMATHCrossRefMathSciNetGoogle Scholar
  27. S. Katz. Distribution of context words and phrases in text and language modeling. Natural Language Engineering, 2(1):15-59, 1996.CrossRefGoogle Scholar
  28. S. Kumar and J. Ghosh. GAMLS: a generalized framework for associative modular learning systems. Proceedings of Applications & Science of Computational Intelligence II, pages 24-34, 1999.Google Scholar
  29. K.-I. Lin and R. Kondadadi. A similarity-based soft clustering algorithm for documents. In Proceedings of the International Conference on Database Systems for Advanced Applications, pages 40-47. IEEE Computer Society, Los Alamitos, CA, 2001.Google Scholar
  30. S. Macskassy, A. Banerjee, B. Davison, and H. Hirsh. Human performance on clustering Web pages. In Proceedings of KDD, pages 264-268. AAAI Press, Menlo Park, CA, 1998.Google Scholar
  31. K. Mardia, J. Kent, and J. Bibby. Multivariate Analysis. Academic Press, New York, 1979.zbMATHGoogle Scholar
  32. L. Malassis, M. Kobayashi, and H. Samukawa. Statistical methods for search engines. Technical Report RT-5181, IBM Tokyo Research Laboratory, 2000.Google Scholar
  33. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 2000.Google Scholar
  34. Z.-Y. Niu, D.-H. Ji, and C.-L. Tan. Document clustering based on cluster validation. In Proceedings of ACM CIKM, pages 501-506. ACM Press, New York, 2004.Google Scholar
  35. B. Parlett. The Symmetric Eigenvalue Problem. SIAM, Philadelphia, 1997.Google Scholar
  36. K. Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series, 2:559-572, 1901.Google Scholar
  37. H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data in vector space based information retrieval. In M. Berry (ed.), Proceedings of the Computational Information Retrieval Conference held at North Carolina State University, Raleigh, Oct. 22, 2000, pages 3-24, SIAM, Philadelphia, 2001.Google Scholar
  38. D. Pelleg and A. Moore. Mixtures of rectangles: interpretable soft clustering. In Proceedings of ICML, pages 401-408. Morgan Kaufmann, San Francisco, 2001.Google Scholar
  39. Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist. Principal component analysis for dimension reduction in massive distributed datasets. In S. Parthasarathy, H. Kargupta, V. Kumar, D. Skillicorn, and M. Zaki (eds.), SIAM Workshop on High Performance Data Mining, pages 7-18, Arlington, VA, 2002.Google Scholar
  40. G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.Google Scholar
  41. A. Strehl. Relationship-based clustering and cluster ensembles for highdimensional data mining. PhD thesis, University of Texas at Austin, 2002.Google Scholar
  42. H. Sakano and K. Yamada. Horror story: the curse of dimensionality). Information Processing Society of Japan (IPSJ) Magazine, 43(5):562-567, 2002.Google Scholar
  43. I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999.Google Scholar
  44. O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of ACM SIGIR, pages 46-54. ACM Press, New York, 1998.Google Scholar
  45. O. Zaine, A. Foss, C.-H. Lee, and W. Wang. On data clustering analysis: scalability, constraints and validation. In Proceedings of PAKDD, Lecture Notes in Artificial Intelligence, No. 2336, pages 28-39. Springer, New York, 2002.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Mei Kobayashi
    • 1
  • Masaki Aono
    • 2
  1. 1.IBM Research, Tokyo Research LaboratoryYamato-shi, Kanagawa-kenJapan
  2. 2.Department of Information and Computer Sciences, C-511Toyohashi University of TechnologyTempaku-cho, Toyohashi-shi, AichiJapan

Personalised recommendations