Soft Computing

, Volume 22, Issue 18, pp 6047–6065 | Cite as

Wikipedia-based hybrid document representation for textual news classification

  • Marcos Antonio Mouriño-García
  • Roberto Pérez-Rodríguez
  • Luis Anido-Rifón
  • Manuel Vilares-Ferro


The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.


News classification Bag of concepts Bag of words Hybrid model Wikipedia Miner Document representation 



Work supported by the European Regional Development Fund (ERDF) and the Galician Regional Government under agreement for funding the Atlantic Research Centre for Information and Communication Technologies (AtlantTIC), and the projects R2014/034 (RedPlir), and R2014/029 (TELGalicia).

Compliance with ethical standards

Conflict of interest

All authors declared that they have no conflict of interest.


  1. Arif MH, Li J, Iqbal M, Liu K (2017) Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Comput 1–11.
  2. Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208zbMATHGoogle Scholar
  3. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155zbMATHGoogle Scholar
  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  6. Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pp 182–189Google Scholar
  7. Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. AAAI 2:830–835Google Scholar
  8. Colace F, De Santo M, Greco L, Napoletano P (2014) Text classification using a few labeled examples. Comput Hum Behav 30:689–697CrossRefGoogle Scholar
  9. De Smet W, Tang J, Moens MF (2011) Knowledge transfer across multilingual corpora via latent topics. Adv Knowl Discov Data Min 549–560.
  10. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407CrossRefGoogle Scholar
  11. Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8CrossRefGoogle Scholar
  12. Elberrichi Z, Rahmoun A, Bentaallah MA (2008) Using wordnet for text categorization. Int Arab J Inf Technol 5(1):16–24Google Scholar
  13. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611Google Scholar
  14. Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRefzbMATHGoogle Scholar
  15. Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRefGoogle Scholar
  16. Huang A, Milne D, Frank E, Witten IH (2009) Clustering documents using a wikipedia-based concept representation. In: Advances in knowledge discovery and data mining. Springer, pp 628–636Google Scholar
  17. Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inform Sci Technol 63(8):1593–1608CrossRefGoogle Scholar
  18. Jadhav BR, Mahajan M, GHR CEM W, (2016) Dual sentiment analysis using adaboost algorithm sentiment analysis. Int J Eng Sci 6(6):7641–7645Google Scholar
  19. Jiang M, Cao J-Z (2016) Positive-unlabeled learning for pupylation sites prediction. BioMed Res Int 2016:4525786
  20. Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. Int Jt Conf Artif Intell 25:2824–2830Google Scholar
  21. Khan A, Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20Google Scholar
  22. Kim HK, Kim M (2016) Model-induced term-weighting schemes for text classification. Appl Intell 45(1):30–43CrossRefGoogle Scholar
  23. Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53MathSciNetzbMATHGoogle Scholar
  24. King RD, Feng C, Sutherland A (1995) Statlog: comparison of classification algorithms on large real-world problems. Appl Artif Intell Int J 9(3):289–333CrossRefGoogle Scholar
  25. Kozielski S, Mrozek D, Kasprowski P, Kostrzewa D et al (2015) Beyond databases, architectures and structures. Springer, BerlinCrossRefGoogle Scholar
  26. Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. ACL 2016:78Google Scholar
  27. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196Google Scholar
  28. Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Machine learning: ECML-98. Springer, pp 4–15Google Scholar
  29. Li J, Fong S, Zhuang Y, Khoury R (2016) Hierarchical classification in text mining for sentiment analysis of online news. Soft Comput 20(9):3411–3420CrossRefGoogle Scholar
  30. Manimala K, David IG, Selvi K (2015) A novel data selection technique using fuzzy c-means clustering to enhance svm-based power quality classification. Soft Comput 19(11):3123–3144CrossRefGoogle Scholar
  31. Mekala D, Gupta V, Karnick H (2016) Text classification with sparse composite document vectors. arXiv preprint arXiv:1612.06778
  32. Mihalcea R, Corley C, Strapparava C et al (2006) Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6:775–780Google Scholar
  33. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  34. Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239MathSciNetCrossRefGoogle Scholar
  35. Ming ZY, Chua TS (2015) Resolving polysemy and pseudonymity in entity linking with comprehensive name and context modeling. Inf Sci 307:18–38CrossRefGoogle Scholar
  36. Mogadala A, Rettinger A (2016) Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of NAACL-HLT, pp 692–702Google Scholar
  37. Moise G, Vladoiu M, Constantinescu Z (2014) Maseco: a multi-agent system for evaluation and classification of oers and ocw based on quality criteria. In: E-Learning paradigms and applications. Springer, pp 185–227Google Scholar
  38. Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE (2015) Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. Peer J 3:e1279CrossRefGoogle Scholar
  39. Mouriño-García M, Pérez-Rodríguez R, Anido-Rifón L, Gómez-Carballa M (2016a) Bag-of-concepts document representation for bayesian text classification. In: 2016 IEEE international conference on computer and information technology (CIT). IEEE, pp 281–288Google Scholar
  40. Mouriño García MA, Pérez Rodríguez R, Anido Rifón L (2016) Reuters 27000 corpus. URL
  41. Mouriño-García MA, Pérez-Rodríguez R, Anido-Rifón L (2017) Wikipedia-based cross-language text classification. Inf Sci 406–407:12–28. CrossRefGoogle Scholar
  42. Nezreg H, Lehbab H, Belbachir H (2014) Conceptual representation using wordnet for text categorization. Int J Comput Commun Eng 3(1):27CrossRefGoogle Scholar
  43. Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 375–384Google Scholar
  44. Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134CrossRefzbMATHGoogle Scholar
  45. Pavlinek M, Podgorelec V (2017) Text classification method based on self-training and lda topic models. Expert Syst Appl 80:83–93CrossRefGoogle Scholar
  46. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  47. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  48. Rehurek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, CiteseerGoogle Scholar
  49. Rodrigues F, Lourenco M, Ribeiro B, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 39(12):2409–2422CrossRefGoogle Scholar
  50. Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC 2:827–832Google Scholar
  51. Roul RK, Asthana SR, Kumar G (2017) Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput 21(15):4239–4256CrossRefGoogle Scholar
  52. Sahlgren M, Cöster R (2004) Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, p 487Google Scholar
  53. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRefGoogle Scholar
  54. Selamat A, Yanagimoto H, Omatu S (2002) Web news classification using neural networks based on pca. In: SICE 2002. Proceedings of the 41st SICE annual conference, vol 4. IEEE, pp 2389–2394Google Scholar
  55. Settles B (1994) Active learning literature survey. Mach Learn 15(2):201–221Google Scholar
  56. Singh A, Chhillar SK (2017) News category classification using distinctive bag of words and ann classifier. Int J Emerg Res Manag Technol 6(6):311–317CrossRefGoogle Scholar
  57. Stock WG (2010) Concepts and semantic relations in information science. J Am Soc Inform Sci Technol 61(10):1951–1969CrossRefGoogle Scholar
  58. Van TP, Thanh TM (2017) Vietnamese news classification based on bow with keywords extraction and neural network. In: 2017 21st Asia Pacific symposium on intelligent and evolutionary systems (IES). IEEE, pp 43–48Google Scholar
  59. Vulić I, De Smet W, Tang J, Moens MF (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147CrossRefGoogle Scholar
  60. Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281CrossRefGoogle Scholar
  61. Wenliang C, Xingzhi C, Huizhen W, Jingbo Z, Tianshun Y (2004) Automatic word clustering for text categorization using global information. In: Asia information retrieval symposium. Springer, pp 1–11Google Scholar
  62. Yao D, Bi J, Huang J, Zhu J (2015) A word distributed representation based framework for large-scale short text classification. In: 2015 international joint conference on neural networks (IJCNN)Google Scholar
  63. Yousif SA, Samawi VW, Elkabani I, Zantout R (2015) The effect of combining different semantic relations on arabic text classification. World Comput Sci Inf Technol J 5(1):12–118Google Scholar
  64. Zhang H (2004) The optimality of naive bayes. AA 1(2):3Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Telematics Engineering, Campus Lagoas-MarcosendeUniversity of VigoVigoSpain
  2. 2.Department of Computer ScienceUniversity of VigoOurenseSpain

Personalised recommendations