Data Mining and Knowledge Discovery

, Volume 2, Issue 4, pp 325–344 | Cite as

Principal Direction Divisive Partitioning

  • Daniel Boley
Article

Abstract

We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e., in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The documents are assembled into a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press.Google Scholar
  2. Berry, M.W., Dumais, S.T., and O'Brien, G.W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595.Google Scholar
  3. Bishop, C. and Tipping, M. 1998. A hierarchical latent variable model for data visualization. IEEE Trans. Patt. Anal. Mach. Intell., 20(3):281-293.CrossRefGoogle Scholar
  4. Boley, D. 1998. Experimental PDDP Software. http://www.cs.umn.edu/∼boley/PDDP.html.Google Scholar
  5. Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. Document categorization and query generation on the world wide web using WebACE. AI Review, to appear.Google Scholar
  6. Cheeseman, P. and Stutz, J. 1996. Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, pp. 153-180.Google Scholar
  7. Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), pp. 318-329.Google Scholar
  8. Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons.Google Scholar
  9. Frakes, W.B. and Baeza-Yates, R. 1992. Information Retrieval Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
  10. Golub, G.H. and van Loan, C.F. 1996. Matrix Computations, 3rd edition. Johns Hopkins Univ. Press.Google Scholar
  11. Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. WebACE: AWeb Agent for Document Categorization and Exploration, Proceedings ACM Autonomous Agents'98 Conference, Minneapolis, MN. pp. 408-415.Google Scholar
  12. Hull, D., Pederson, J., and Schütze, H. 1996. Method Combination for Document Filtering. ACM SIGIR 96, 279-287.Google Scholar
  13. Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.Google Scholar
  14. Lewis, D. 1997. Reuters-21578. http://www.research.att.com/∼lewis.Google Scholar
  15. Lu, S. and Fu, K. 1978. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man and Cybernetics, 8:381-389.Google Scholar
  16. Moore, J., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B. 1997. Web page categorization and feature selection using association rule and principal component clustering. 7th Workshop on Information Technologies and Systems (WITS'97), Atlanta.Google Scholar
  17. Nadler, M. and Smith, E.P. 1993. Pattern Recognition Engineering. Wiley.Google Scholar
  18. Northern Light, 1998. http://www.nlsearch.com.Google Scholar
  19. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523.CrossRefGoogle Scholar
  20. Schütwe, H. and Silverstein, C. 1997. Projections for efficient document clustering. ACM SIGIR 97, pp. 74-81.Google Scholar
  21. Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. ACMSIGIR 96, pp. 21-29.Google Scholar
  22. Titterington, D., Smith, A., and Makov, U. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.Google Scholar
  23. Zamir, O., Etzioni, O., Madani, O., and Karp, R. 1997. Fast and intuitive clustering of web documents. KDD 97.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Daniel Boley
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations