# CCODM: conditional co-occurrence degree matrix document representation method

- 216 Downloads

## Abstract

Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.

## Keywords

Document representation Word co-occurrence Conditional co-occurrence degree matrix Classification Feature selection## Notes

### Acknowledgements

This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.

### Compliance with ethical standards

### Conflict of interest

Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.

### Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

### Informed consent

Informed consent was obtained from all individual participants included in the study.

## References

- Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768. doi: 10.1016/j.eswa.2011.09.160 Google Scholar
- Benabdeslem K, Elghazel H, Hindawi M (2016) Ensemble constrained laplacian score for efficient and robust semi-supervised feature selection. Knowl Inf Syst 49(3):1161–1185. doi: 10.1007/s10115-015-0901-0 Google Scholar
- Bengio Y, Courville A, Vincent P (2014) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. doi: 10.1109/TPAMI.2013.50 Google Scholar
- Bengio Y, Schwenk H, Sencal J, Morin F, Gauvain J (2003) Neural probabilistic language models. J Mach Learn Res 3(6):1137–1155, doi: 10.1162/153244303322533223, http://dl.acm.org/citation.cfm?id=944919.944966
- Bernotas M, Laurutis R (2007) The peculiarities of the text document representation, using ontology and tagging-based clustering technique. J Inf Technol Control 36(2):217–220Google Scholar
- Bettina G, Kurt H (2017) Topicmodels: an R package for fitting topic models. Version 0.2-6. doi: 10.18637/jss.v040.i13
- Bhushan S, Danti A (2017) Classification of text documents based on score level fusion approach. Pattern Recognit Lett 94:118–126. doi: 10.1016/j.patrec.2017.05.003 Google Scholar
- Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022, http://dl.acm.org/citation.cfm?id=944919.944937
- Boulares M, Jemni M (2016) Learning sign language machine translation based on elastic net regularization and latent semantic analysis. Artif Intell Rev 46(2):145–166. doi: 10.1007/s10462-016-9460-3 Google Scholar
- Bullinaria J, Levy J (2012) Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods 44(3):890–907. doi: 10.3758/s13428-011-0183-8 Google Scholar
- Cambria E, Gastaldo P, Bisio F, Zunino R (2015) An ELM-based model for affective analogical reasoning. Neurocomputing 149:443–455. doi: 10.1016/j.neucom.2014.01.064 Google Scholar
- Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. doi: 10.1109/TKDE.2014.2313872 Google Scholar
- Du Y, Liu W, Lv X, Peng G (2015) An improved focused crawler based on semantic similarity vector space model. Appl Soft Comput 36:392–407. doi: 10.1016/j.asoc.2015.07.026 Google Scholar
- Farahat A, Kamel M (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393. doi: 10.1007/s10115-010-0367-z Google Scholar
- Franco-Salvador M, Gupta P, Rosso P, Banchs R (2016) Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl Based Syst 111:87–99. doi: 10.1016/j.knosys.2016.08.004 Google Scholar
- Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209. doi: 10.1016/j.asoc.2016.02.015 Google Scholar
- Huang H, Kuo Y (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. IEEE Trans Fuzzy Syst 18(6):1098–1111. doi: 10.1142/S0218001411008890 Google Scholar
- Ibrahim O, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061. doi: 10.1007/s00500-015-1935-7 Google Scholar
- Jin L, Gong W, Fu W, Wu H (2015) A text classifier of english movie reviews based on information gain. In: The 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence, pp 454–457. doi: 10.1109/ACIT-CSI.2015.86
- Johnson-laird P, Oatley K (1989) The language of emotions: an analysis of a semantic field. Cogn Emot 3(3):81–123. doi: 10.1080/02699938908408075 Google Scholar
- Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71. doi: 10.1016/j.knosys.2008.06.002 Google Scholar
- Lau R, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi: 10.1109/MCI.2013.2291689 Google Scholar
- Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196Google Scholar
- Li J, Li J, Fu X, Masud M, Huang J (2016) Learning distributed word representation with multi-contextual mixed embedding. Knowl Based Syst 106:220–230. doi: 10.1016/j.knosys.2016.05.045 Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
- Liaw A, Wiener M (2015) Package ’randomForest’. Breiman and Cutlers random forests for classification and regression. Version 4.6-12. https://www.stat.berkeley.edu/~breiman/RandomForests/
- Liu Q, Zhang H, Yu H, Cheng X (2004) Chinese lexical analysis using cascaded hidden Markov model. J Comput Res Dev 41(8):1421–1429Google Scholar
- Liu Z, Yu W, Deng Y, Bian Z (2010) A feature selection method for document clustering based on part-of-speech and word co-occurrence. In: 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol 5, pp 2331–2334. doi: 10.1109/FSKD.2010.5569827
- Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199. doi: 10.1016/j.knosys.2016.12.013 Google Scholar
- Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr J 14(2):178–203. doi: 10.1007/s10791-010-9141-9 Google Scholar
- Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98. doi: 10.1016/j.ins.2015.10.038 MathSciNetzbMATHGoogle Scholar
- Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359
- Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
- Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space, pp 1–12. arXiv preprint arXiv:1301.3781
- Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54. doi: 10.1162/COLI_a_00241 MathSciNetGoogle Scholar
- Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 427–436, http://arxiv.org/abs/1412.1897
- Pessiot J, Kim Y, Amini M, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manag 46(2):180–192. doi: 10.1016/j.ipm.2009.09.007 Google Scholar
- Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. IEEE Trans Knowl Data Eng 23(7):961–976. doi: 10.1109/TKDE.2010.27 Google Scholar
- Radim Ř, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50Google Scholar
- Ravi D, Bober M, Farinella G, Guarnera M, Battiato S (2016) Semantic segmentation of images exploiting DCT based features and random forest. Pattern Recognit 52:260–273. doi: 10.1016/j.patcog.2015.10.021 Google Scholar
- Ren F, Sohrab M (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. doi: 10.1016/j.ins.2013.02.029 Google Scholar
- Rule A, Cointet J, Bearman P (2015) Lexical shifts, substantive changes, and continuity in State of the Union discourse. Proc Natl Acad Sci USA 112(35):10,837–10,844. doi: 10.1073/pnas.1512221112 Google Scholar
- Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi: 10.1145/361219.361220 zbMATHGoogle Scholar
- Tang G, Xia Y, Sun J, Zhang M, Zheng TF (2015) Statistical word sense aware topic models. Soft Comput 19(1):13–27Google Scholar
- Trovati M, Bessis N (2016) An influence assessment method based on co-occurrence for topologically reduced big data sets. Soft Comput 20(5):2021–2030. doi: 10.1007/s00500-015-1621-9 Google Scholar
- Vila M, Bardera A, Feixas M, Sbert M (2011) Tsallis mutual information for document classification. Entropy 13(9):1694–1707. doi: 10.3390/e13091694 zbMATHGoogle Scholar
- Wang H (2015) Study on the application of feature selection for big text data using expected cross entropy. J Inf Comput Sci 12(18):6835–6843. doi: 10.12733/jics20150077 Google Scholar
- Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45(11):1–10. doi: 10.1016/j.patrec.2014.02.013 Google Scholar
- Wang D, Shen H, Truong Y (2016a) Efficient dimension reduction for high-dimensional matrix-valued data. Neurocomputing 190:25–34. doi: 10.1016/j.neucom.2015.12.096 Google Scholar
- Wang D, Zhang H, Liu R, Liu X, Wang J (2016b) Unsupervised feature selection through Gram–Schmidt orthogonalization—a word co-occurrence perspective. Neurocomputing 173(P3):845–854. doi: 10.1016/j.neucom.2015.08.038 Google Scholar
- Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28. doi: 10.1016/j.ins.2017.02.009
- Xiao Q, Song R (2017) Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput 21(1):255–265. doi: 10.1007/s00500-016-2059-4 MathSciNetGoogle Scholar
- Xu H, Zhang F, Wang W (2015) Implicit feature identification in Chinese reviews using explicit topic mining model. Knowl Based Syst 76:166–175. doi: 10.1016/j.knosys.2014.12.012
- Yan H, Yang J (2014) Joint laplacian feature weights learning. Pattern Recognit 47(3):1425–1432. doi: 10.1016/j.patcog.2013.09.038 zbMATHGoogle Scholar
- Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of fourteenth international conference on machine learning (ICML), vol 4, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
- Zheng Y, Han W, Zhu C (2014) A novel feature selection method based on category distribution and phrase attributes. In: International conference on trustworthy computing and services (ISCTCS), Berlin, Heidelberg, pp 25–32. doi: 10.1007/978-3-662-47401-3_4
- Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl Based Syst 95:1–11. doi: 10.1016/j.knosys.2015.11.010 Google Scholar