Skip to main content

Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering

  • Conference paper
Large-Scale Knowledge Resources. Construction and Application (LKR 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4938))

Included in the following conference series:

Abstract

In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://japan.internet.com/

  2. http://nlp.kookmin.ac.kr/HAM/kor/

  3. http://mecab.sourceforge.net/

  4. http://okwave.jp/

  5. http://www.quintura.com/

  6. http://www.seoul.co.kr/

  7. http://vivisimo.com/

  8. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)

    Article  MATH  Google Scholar 

  9. Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. of KDD 2001, pp. 245–250 (2001)

    Google Scholar 

  10. Blei, D., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  11. Blei, D., Jordan, M.I.: Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis 1(1), 121–144 (2005)

    MathSciNet  Google Scholar 

  12. Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective Document Clustering for Large Heterogeneous Law Firm Collections. In: Proc. of ICAIL 2005, pp. 177–187 (2005)

    Google Scholar 

  13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  14. Elango, P.K., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), available at http://www.cs.wisc.edu/~pradheep/

  15. Fattori, M., Pedrazzi, G., Turra, R.: Text Mining Applied to Patent Mapping: a Practical Business Case. World Patent Information 25, 335–342 (2003)

    Article  Google Scholar 

  16. Griffiths, T., Steyvers, M.: Finding Scientific Topics. Proc. of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)

    Article  Google Scholar 

  17. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. of SIGIR 1999, pp. 50–57 (1999)

    Google Scholar 

  18. Hsu, F.-C., Trappey, A.J.C., Trappey, C.V., Hou, J.-L., Liu, S.-J.: Technology and Knowledge Document Cluster Analysis for Enterprise R&D Strategic Planning. International Journal of Technology Management 36(4), 336–353 (2006)

    Article  Google Scholar 

  19. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Proc. of ICML 2005, pp. 545–552 (2005)

    Google Scholar 

  20. Malisiewicz, T.J., Huang, J.C., Efros, A.A.: Detecting Objects via Multiple Segmentations and Latent Topic Models (2006), available at http://www.cs.cmu.edu/~tmalisie/

  21. Minka, T.: Estimating a Dirichlet distribution (2000), available at http://research.microsoft.com/~minka/papers/

  22. Mimno, D., McCallum, A.: Expertise Modeling for Matching Papers with Reviewers. In: Proc. of KDD 2007, pp. 500–509 (2007)

    Google Scholar 

  23. Mimno, D., McCallum, A.: Organizing the OCA: Learning Faceted Subjects from a library of digital books. In: Proc. of JCDL 2007, pp. 376–385 (2007)

    Google Scholar 

  24. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39(2/3), 103–134 (2000)

    Article  MATH  Google Scholar 

  25. Rose, K., Gurewitz, E., Fox, G.: A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters 11, 589–594 (1990)

    Article  MATH  Google Scholar 

  26. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk Email. AAAI Technical Report WS-98-05 (1998)

    Google Scholar 

  27. Teh, Y.W., Newman, D., Welling, M.: A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In: Proc. of NIPS 2006, pp. 1353–1360 (2006)

    Google Scholar 

  28. Yamamoto, M., Sadamitsu, K.: Dirichlet Mixtures in Text Modeling. CS Technical report CS-TR-05-1, University of Tsukuba (2005)

    Google Scholar 

  29. Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proc. of SIGIR 2004, pp. 210–217 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takenobu Tokunaga Antonio Ortega

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Masada, T., Kiyasu, S., Miyahara, S. (2008). Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering. In: Tokunaga, T., Ortega, A. (eds) Large-Scale Knowledge Resources. Construction and Application. LKR 2008. Lecture Notes in Computer Science(), vol 4938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78159-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78159-2_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78158-5

  • Online ISBN: 978-3-540-78159-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics