CCODM: conditional co-occurrence degree matrix document representation method
- 216 Downloads
Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.
KeywordsDocument representation Word co-occurrence Conditional co-occurrence degree matrix Classification Feature selection
This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.
Compliance with ethical standards
Conflict of interest
Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent was obtained from all individual participants included in the study.
- Bernotas M, Laurutis R (2007) The peculiarities of the text document representation, using ontology and tagging-based clustering technique. J Inf Technol Control 36(2):217–220Google Scholar
- Bettina G, Kurt H (2017) Topicmodels: an R package for fitting topic models. Version 0.2-6. doi: 10.18637/jss.v040.i13
- Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022, http://dl.acm.org/citation.cfm?id=944919.944937
- Jin L, Gong W, Fu W, Wu H (2015) A text classifier of english movie reviews based on information gain. In: The 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence, pp 454–457. doi: 10.1109/ACIT-CSI.2015.86
- Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
- Liaw A, Wiener M (2015) Package ’randomForest’. Breiman and Cutlers random forests for classification and regression. Version 4.6-12. https://www.stat.berkeley.edu/~breiman/RandomForests/
- Liu Q, Zhang H, Yu H, Cheng X (2004) Chinese lexical analysis using cascaded hidden Markov model. J Comput Res Dev 41(8):1421–1429Google Scholar
- Liu Z, Yu W, Deng Y, Bian Z (2010) A feature selection method for document clustering based on part-of-speech and word co-occurrence. In: 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol 5, pp 2331–2334. doi: 10.1109/FSKD.2010.5569827
- Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359
- Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
- Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space, pp 1–12. arXiv preprint arXiv:1301.3781
- Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 427–436, http://arxiv.org/abs/1412.1897
- Radim Ř, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50Google Scholar
- Tang G, Xia Y, Sun J, Zhang M, Zheng TF (2015) Statistical word sense aware topic models. Soft Comput 19(1):13–27Google Scholar
- Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28. doi: 10.1016/j.ins.2017.02.009
- Xu H, Zhang F, Wang W (2015) Implicit feature identification in Chinese reviews using explicit topic mining model. Knowl Based Syst 76:166–175. doi: 10.1016/j.knosys.2014.12.012
- Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of fourteenth international conference on machine learning (ICML), vol 4, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
- Zheng Y, Han W, Zhu C (2014) A novel feature selection method based on category distribution and phrase attributes. In: International conference on trustworthy computing and services (ISCTCS), Berlin, Heidelberg, pp 25–32. doi: 10.1007/978-3-662-47401-3_4