Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering

Masada, Tomonari; Kiyasu, Senya; Miyahara, Sueharu

doi:10.1007/978-3-540-78159-2_2

Tomonari Masada¹,
Senya Kiyasu¹ &
Sueharu Miyahara¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4938))

Included in the following conference series:

International Conference on Large-Scale Knowledge Resources

732 Accesses
11 Citations

Abstract

In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://japan.internet.com/
http://nlp.kookmin.ac.kr/HAM/kor/
http://mecab.sourceforge.net/
http://okwave.jp/
http://www.quintura.com/
http://www.seoul.co.kr/
http://vivisimo.com/
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
Article MATH Google Scholar
Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. of KDD 2001, pp. 245–250 (2001)
Google Scholar
Blei, D., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Blei, D., Jordan, M.I.: Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis 1(1), 121–144 (2005)
MathSciNet Google Scholar
Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective Document Clustering for Large Heterogeneous Law Firm Collections. In: Proc. of ICAIL 2005, pp. 177–187 (2005)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)
MATH MathSciNet Google Scholar
Elango, P.K., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), available at http://www.cs.wisc.edu/~pradheep/
Fattori, M., Pedrazzi, G., Turra, R.: Text Mining Applied to Patent Mapping: a Practical Business Case. World Patent Information 25, 335–342 (2003)
Article Google Scholar
Griffiths, T., Steyvers, M.: Finding Scientific Topics. Proc. of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. of SIGIR 1999, pp. 50–57 (1999)
Google Scholar
Hsu, F.-C., Trappey, A.J.C., Trappey, C.V., Hou, J.-L., Liu, S.-J.: Technology and Knowledge Document Cluster Analysis for Enterprise R&D Strategic Planning. International Journal of Technology Management 36(4), 336–353 (2006)
Article Google Scholar
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Proc. of ICML 2005, pp. 545–552 (2005)
Google Scholar
Malisiewicz, T.J., Huang, J.C., Efros, A.A.: Detecting Objects via Multiple Segmentations and Latent Topic Models (2006), available at http://www.cs.cmu.edu/~tmalisie/
Minka, T.: Estimating a Dirichlet distribution (2000), available at http://research.microsoft.com/~minka/papers/
Mimno, D., McCallum, A.: Expertise Modeling for Matching Papers with Reviewers. In: Proc. of KDD 2007, pp. 500–509 (2007)
Google Scholar
Mimno, D., McCallum, A.: Organizing the OCA: Learning Faceted Subjects from a library of digital books. In: Proc. of JCDL 2007, pp. 376–385 (2007)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Rose, K., Gurewitz, E., Fox, G.: A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters 11, 589–594 (1990)
Article MATH Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk Email. AAAI Technical Report WS-98-05 (1998)
Google Scholar
Teh, Y.W., Newman, D., Welling, M.: A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In: Proc. of NIPS 2006, pp. 1353–1360 (2006)
Google Scholar
Yamamoto, M., Sadamitsu, K.: Dirichlet Mixtures in Text Modeling. CS Technical report CS-TR-05-1, University of Tsukuba (2005)
Google Scholar
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proc. of SIGIR 2004, pp. 210–217 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, 852-8521, Japan
Tomonari Masada, Senya Kiyasu & Sueharu Miyahara

Authors

Tomonari Masada
View author publications
You can also search for this author in PubMed Google Scholar
Senya Kiyasu
View author publications
You can also search for this author in PubMed Google Scholar
Sueharu Miyahara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takenobu Tokunaga Antonio Ortega

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Masada, T., Kiyasu, S., Miyahara, S. (2008). Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering. In: Tokunaga, T., Ortega, A. (eds) Large-Scale Knowledge Resources. Construction and Application. LKR 2008. Lecture Notes in Computer Science(), vol 4938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78159-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-78159-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78158-5
Online ISBN: 978-3-540-78159-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics