Improving Quality of Search Results Clustering with Approximate Matrix Factorisations

  • Stanislaw Osinski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering algorithms: Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). For our experiments we use the standard merge-then-cluster approach based on the Open Directory Project web catalogue as a source of human-clustered document summaries.


Base Vector Singular Value Decomposition Outlier Detection Cluster Label Topic Coverage 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 46–54. ACM Press, New York (1998)Google Scholar
  2. 2.
    Zamir, O.E.: Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. PhD thesis, University of Washington (1999)Google Scholar
  3. 3.
    Dong, Z.: Towards Web Information Clustering. PhD thesis, Southeast University, Nanjing, China (2002)Google Scholar
  4. 4.
    Lang, N.C.: A tolerance rough set approach to clustering web search results. Master’s thesis, Faculty of Mathematics, Informatics and Mechanics, Warsaw University (2004)Google Scholar
  5. 5.
    Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web, pp. 658–665. ACM Press, New York (2004)Google Scholar
  6. 6.
    Stefanowski, J., Weiss, D.: Carrot2 and language properties in web search results clustering. In: AWIC 2003. LNCS, vol. 2663, pp. 240–249. Springer, Heidelberg (2003)Google Scholar
  7. 7.
    Osiński, S., Stefanowski, J., Weiss, D.: Lingo: Search results clustering algorithm based on Singular Value Decomposition. In: Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing, Zakopane, Poland, pp. 359–368. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Osi´nski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20(3), 48–54 (2005)CrossRefGoogle Scholar
  9. 9.
    Lee, D., Seung, S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)CrossRefMATHGoogle Scholar
  10. 10.
    Li, S.Z., Hou, X.W., Zhang, H., Cheng, Q.: Learning spatially localized, parts-based representation. CVPR (1), 207–212 (2001)Google Scholar
  11. 11.
    Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001)CrossRefMATHGoogle Scholar
  12. 12.
    Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR 1996, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 76–84 (1996)Google Scholar
  13. 13.
    Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. Computer Networks 31(11–16), 1361–1374 (1999)CrossRefGoogle Scholar
  14. 14.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. ACM Press, New York (2003)CrossRefGoogle Scholar
  15. 15.
    Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston (1989)Google Scholar
  16. 16.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Neural Information Processing Systems 13, 556–562 (2000)Google Scholar
  17. 17.
    Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: KDD 2001: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250. ACM Press, New York (2001)Google Scholar
  18. 18.
    Dom, B.E.: An information-theoretic external cluster-validity measure. Technical Report IBM Research Report RJ 10219, IBM (2001)Google Scholar
  19. 19.
    Osiński, S.: Dimensionality reduction techniques for search results clustering. Master’s thesis, The University of Sheffield (2004)Google Scholar
  20. 20.
    Xu, W., Gong, Y.: Document clustering by concept factorization. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 202–209. ACM Press, New York (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Stanislaw Osinski
    • 1
  1. 1.Poznan Supercomputing and Networking CenterPoznanPoland

Personalised recommendations