Abstract
The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.
Similar content being viewed by others
References
Bharti KK, Singh PK (2015) Chaotic gradient artificial bee colony for text clustering. Soft Comput. doi:10.1007/s12652-014-0237-8
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3(1):993–1022
Chen J, Huang DP, Hu SY, Liu Y (2014) Cai Y, Min HQ An opinion mining framework for cantonese reviews. J Ambient Intell Humaniz Comput. doi:10.1007/s12652-014-0237-8
Chen YW, Du JX (2014) A new method for classifying chinese text based on semantic topics and desity peaks. Int J Appl Math Mach Learn 1(1):35–54
Chen YW, Wang HZ et al (2012) A topic extraction method for chinese web text based on baidubaike and text classification. J Chin Comput Syst 33(12):2605–2010
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1):1–38
Durao F, Dolog P (2014) Improving tag-based recommendation with the collaborative value of wiki pages for knowledge sharing. J Ambient Intell Hum Comput 5(1):21–38
Fabrizio S (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Fu RJ, Qin B, Liu T (2015) Open-categorical text classification based on multi-lda models. Soft Comput 19(1):29–38
Fudan NLP (2013) Chinese texts database. IOP Publishing PhysicsWeb. http://www.datatang.com/data/44082. Accessed 11 September 2014
Hofmann T(1999) Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, pp 289–296
Huang C, Zhao H (2006) Which is essential for chinese word segmentation:character versus word. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp 1–12
Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–18
Jiang YY, Li P, Wang Q (2013) An improved labeled latent dirichlet allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed 49(4):425–432
Li J, Wang YM (2006) Universal designated verifier ring signature (proof) without random oracles. Emerging Dir Embed Ubiquitous Comput 4097:332–341
Li J, Zhang FG, Wang YM (2006) A new hierarchical id-based cryptosystem and cca-secure pke. Emerging Dir Embed Ubiquitous Comput, 4097:362–371
Li RL (2010) Svmcls IOP Publishing PhysicsWeb. http://download.csdn.net/detail/superyangtze/2710559. Accessed 8 Sept 2014
Li WB, Sun L, Zhang DK (2008) Text classification based on labeled-lda model. Chin J Comput 31(4):621–627
Sartaj S (1999) Data structures, algorithms, and applications in java suffix trees. IOP Publishing PhysicsWeb. http://www.cise.ufl.edu/ sahni/dsaaj/enrich/c16/suffix.html. Accessed 11 Sept 2014
SogouC (2013) Sogou lab data. IOP Publishing PhysicsWeb. http://www.sogou.com/labs/dl/c.html. Accessed 12 Sept 2014
Song SL, Wang SL, Chen P (2013) Chinese text semantic representation for text classification. J Xidian Univ 40(2):89–97
Su JS, Zhang BF, Xu X (2005) Advances in machine learning based text categorization. J Softw 17(9):1848–1859
Teng SH (2009) Study on chinese short-text classification. Master’s thesis, Tsinghua University
Xia YQ, Wong KF, Zhang P (2007) Toward anomalous and dynamic nature of the chinese network chat language. J Chin Inf Process 21(3):83–91
Xu G, Wang HF (2011) The development of topic models in natural language processing. Chin J Comput 34(8):1423–136
Zhang HP, Yu HK, Xiong DY, Liu Q (2003) Hhmm-based chinese lexical analyzer ictclas. In: 2nd SIGHAN workshop affiliated with 41st ACL; Sapporo Japan, July 2003, pp 184–187
Acknowledgments
Supported by the Grant of the National Science Foundation of China (No. 61175121); the Grant of the National Science Foundation of Fujian Province (No. 201-3J06014); the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQNYX108); the Fundamental Research Funds for the Central Universities (No. JBZR-1217); the Natural Science Foundation of Fujian Province, China (No. 2012J05117).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, YW., Wang, JL., Cai, YQ. et al. A method for Chinese text classification based on apparent semantics and latent aspects. J Ambient Intell Human Comput 6, 473–480 (2015). https://doi.org/10.1007/s12652-015-0257-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-015-0257-z