Abstract
In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. of KDD 2001, pp. 245–250 (2001)
Blei, D., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Blei, D., Jordan, M.I.: Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis 1(1), 121–144 (2005)
Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective Document Clustering for Large Heterogeneous Law Firm Collections. In: Proc. of ICAIL 2005, pp. 177–187 (2005)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)
Elango, P.K., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), available at http://www.cs.wisc.edu/~pradheep/
Fattori, M., Pedrazzi, G., Turra, R.: Text Mining Applied to Patent Mapping: a Practical Business Case. World Patent Information 25, 335–342 (2003)
Griffiths, T., Steyvers, M.: Finding Scientific Topics. Proc. of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. of SIGIR 1999, pp. 50–57 (1999)
Hsu, F.-C., Trappey, A.J.C., Trappey, C.V., Hou, J.-L., Liu, S.-J.: Technology and Knowledge Document Cluster Analysis for Enterprise R&D Strategic Planning. International Journal of Technology Management 36(4), 336–353 (2006)
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Proc. of ICML 2005, pp. 545–552 (2005)
Malisiewicz, T.J., Huang, J.C., Efros, A.A.: Detecting Objects via Multiple Segmentations and Latent Topic Models (2006), available at http://www.cs.cmu.edu/~tmalisie/
Minka, T.: Estimating a Dirichlet distribution (2000), available at http://research.microsoft.com/~minka/papers/
Mimno, D., McCallum, A.: Expertise Modeling for Matching Papers with Reviewers. In: Proc. of KDD 2007, pp. 500–509 (2007)
Mimno, D., McCallum, A.: Organizing the OCA: Learning Faceted Subjects from a library of digital books. In: Proc. of JCDL 2007, pp. 376–385 (2007)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39(2/3), 103–134 (2000)
Rose, K., Gurewitz, E., Fox, G.: A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters 11, 589–594 (1990)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk Email. AAAI Technical Report WS-98-05 (1998)
Teh, Y.W., Newman, D., Welling, M.: A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In: Proc. of NIPS 2006, pp. 1353–1360 (2006)
Yamamoto, M., Sadamitsu, K.: Dirichlet Mixtures in Text Modeling. CS Technical report CS-TR-05-1, University of Tsukuba (2005)
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proc. of SIGIR 2004, pp. 210–217 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Masada, T., Kiyasu, S., Miyahara, S. (2008). Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering. In: Tokunaga, T., Ortega, A. (eds) Large-Scale Knowledge Resources. Construction and Application. LKR 2008. Lecture Notes in Computer Science(), vol 4938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78159-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-78159-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78158-5
Online ISBN: 978-3-540-78159-2
eBook Packages: Computer ScienceComputer Science (R0)