A method for Chinese text classification based on apparent semantics and latent aspects

Chen, Ye-Wang; Wang, Jiong-Liang; Cai, Yi-Qiao; Du, Ji-Xiang

doi:10.1007/s12652-015-0257-z

A method for Chinese text classification based on apparent semantics and latent aspects

Original Research
Published: 12 February 2015

Volume 6, pages 473–480, (2015)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Ye-Wang Chen¹,
Jiong-Liang Wang¹,
Yi-Qiao Cai¹ &
…
Ji-Xiang Du¹

431 Accesses
17 Citations
Explore all metrics

Abstract

The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

References

Bharti KK, Singh PK (2015) Chaotic gradient artificial bee colony for text clustering. Soft Comput. doi:10.1007/s12652-014-0237-8
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3(1):993–1022
MATH Google Scholar
Chen J, Huang DP, Hu SY, Liu Y (2014) Cai Y, Min HQ An opinion mining framework for cantonese reviews. J Ambient Intell Humaniz Comput. doi:10.1007/s12652-014-0237-8
Chen YW, Du JX (2014) A new method for classifying chinese text based on semantic topics and desity peaks. Int J Appl Math Mach Learn 1(1):35–54
MathSciNet Google Scholar
Chen YW, Wang HZ et al (2012) A topic extraction method for chinese web text based on baidubaike and text classification. J Chin Comput Syst 33(12):2605–2010
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1):1–38
MathSciNet MATH Google Scholar
Durao F, Dolog P (2014) Improving tag-based recommendation with the collaborative value of wiki pages for knowledge sharing. J Ambient Intell Hum Comput 5(1):21–38
Article Google Scholar
Fabrizio S (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Fu RJ, Qin B, Liu T (2015) Open-categorical text classification based on multi-lda models. Soft Comput 19(1):29–38
Fudan NLP (2013) Chinese texts database. IOP Publishing PhysicsWeb. http://www.datatang.com/data/44082. Accessed 11 September 2014
Hofmann T(1999) Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, pp 289–296
Huang C, Zhao H (2006) Which is essential for chinese word segmentation:character versus word. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp 1–12
Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–18
Google Scholar
Jiang YY, Li P, Wang Q (2013) An improved labeled latent dirichlet allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed 49(4):425–432
Google Scholar
Li J, Wang YM (2006) Universal designated verifier ring signature (proof) without random oracles. Emerging Dir Embed Ubiquitous Comput 4097:332–341
Article Google Scholar
Li J, Zhang FG, Wang YM (2006) A new hierarchical id-based cryptosystem and cca-secure pke. Emerging Dir Embed Ubiquitous Comput, 4097:362–371
Li RL (2010) Svmcls IOP Publishing PhysicsWeb. http://download.csdn.net/detail/superyangtze/2710559. Accessed 8 Sept 2014
Li WB, Sun L, Zhang DK (2008) Text classification based on labeled-lda model. Chin J Comput 31(4):621–627
MathSciNet Google Scholar
Sartaj S (1999) Data structures, algorithms, and applications in java suffix trees. IOP Publishing PhysicsWeb. http://www.cise.ufl.edu/ sahni/dsaaj/enrich/c16/suffix.html. Accessed 11 Sept 2014
SogouC (2013) Sogou lab data. IOP Publishing PhysicsWeb. http://www.sogou.com/labs/dl/c.html. Accessed 12 Sept 2014
Song SL, Wang SL, Chen P (2013) Chinese text semantic representation for text classification. J Xidian Univ 40(2):89–97
MathSciNet Google Scholar
Su JS, Zhang BF, Xu X (2005) Advances in machine learning based text categorization. J Softw 17(9):1848–1859
Article Google Scholar
Teng SH (2009) Study on chinese short-text classification. Master’s thesis, Tsinghua University
Xia YQ, Wong KF, Zhang P (2007) Toward anomalous and dynamic nature of the chinese network chat language. J Chin Inf Process 21(3):83–91
Google Scholar
Xu G, Wang HF (2011) The development of topic models in natural language processing. Chin J Comput 34(8):1423–136
Article MathSciNet Google Scholar
Zhang HP, Yu HK, Xiong DY, Liu Q (2003) Hhmm-based chinese lexical analyzer ictclas. In: 2nd SIGHAN workshop affiliated with 41st ACL; Sapporo Japan, July 2003, pp 184–187

Download references

Acknowledgments

Supported by the Grant of the National Science Foundation of China (No. 61175121); the Grant of the National Science Foundation of Fujian Province (No. 201-3J06014); the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQNYX108); the Fundamental Research Funds for the Central Universities (No. JBZR-1217); the Natural Science Foundation of Fujian Province, China (No. 2012J05117).

Author information

Authors and Affiliations

College of Computer Science and Technology of Huaqiao University Xiamen, Xiamen, China
Ye-Wang Chen, Jiong-Liang Wang, Yi-Qiao Cai & Ji-Xiang Du

Authors

Ye-Wang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiong-Liang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Qiao Cai
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Xiang Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji-Xiang Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, YW., Wang, JL., Cai, YQ. et al. A method for Chinese text classification based on apparent semantics and latent aspects. J Ambient Intell Human Comput 6, 473–480 (2015). https://doi.org/10.1007/s12652-015-0257-z

Download citation

Received: 24 September 2014
Accepted: 22 January 2015
Published: 12 February 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s12652-015-0257-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A method for Chinese text classification based on apparent semantics and latent aspects

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A method for Chinese text classification based on apparent semantics and latent aspects

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation