Abstract
PLSA (Probabilistic Latent Semantic Analysis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large dataset incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynamically discover topics and incrementally learn the topics from new documents. The experiments verify that the proposed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good performance in the application of document categorization.
Similar content being viewed by others
Notes
References
Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84
Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215:74–89
Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24
Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algorithms, INVITED SPEAKER, p 2001
Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kannan R, Stolper CD, Inouye D et al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data
Mei Q, Zhai C (2001) A note on em algorithm for probabilistic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM
Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73
Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst
Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26
Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28
Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191
Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 424–433
Wang C, Blei D, Heckerman D (2012) Continuous time dynamic topic models. arXiv:1206.3298
Aggarwal CC, Zhai C (2012) Mining text data. Springer
Gruber A, Rosen-Zvi M, Weiss Y (2012) Latent topic models for hypertext. arXiv:1206.3254
Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models can improve domain term extraction. In: Advances in Information Retrieval. Springer, pp 684–687
Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 50–57
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM
Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888
Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilistic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49
Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207
Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender systems. ACM, pp 99–106
Tzu-Chuan Chou MCC (2008) Using incremental plsi for threshold-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196
Surendran AC, Sra S (2006) Incremental aspect models for mining document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640
Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92
Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24
Acknowledgements
The work is supported by the National Natural Science Foundation of China (No. 91546122, 61602438, 61573335, 61473273, 61473274, 61363058), National High-tech R&D Program of China (863 Program) (No. 2014AA015105), National Science and Technology Support Program (No. 2014BAK02B07), National major R&D program of Beijing Municipal Science & Technology Commission (Z161100002616032), Guangdong provincial science and technology plan projects (No. 2015 B 010109005).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, N., Luo, W., Yang, K. et al. Self-organizing weighted incremental probabilistic latent semantic analysis. Int. J. Mach. Learn. & Cyber. 9, 1987–1998 (2018). https://doi.org/10.1007/s13042-017-0681-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-017-0681-9