World Wide Web

, Volume 19, Issue 5, pp 887–920 | Cite as

Graph vs. bag representation models for the topic classification of web documents

  • George Papadakis
  • George Giannakopoulos
  • Georgios Paliouras
Article
  • 633 Downloads

Abstract

Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.

Keywords

Text classification N-gram graphs Web document types 

References

  1. 1.
    Amini, M.R., Usunier, N., Goutte, C.: Learning from multiple partially observed views - an application to multilingual text categorization. In: NIPS, pp. 28–36 (2009)Google Scholar
  2. 2.
    Batista, F., Ribeiro, R.: Sentiment analysis and topic classification based on binary maximum entropy classifiers. Proc. Leng. Nat. 50, 77–84 (2013)Google Scholar
  3. 3.
    Berry, M.W., Kogan, J.: Text Mining: Applications and Theory. Wiley, Chichester (2010)Google Scholar
  4. 4.
    Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)Google Scholar
  5. 5.
    Choudhary, B., Bhattacharyya, P.: Text clustering using semantics. World Wide Web Conference (2002)Google Scholar
  6. 6.
    Choudhary, B., Bhattacharyya, P.: Text clustering using universal networking language representation. World Wide Web Conference (2002)Google Scholar
  7. 7.
    Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification. In: KDD, pp 230–239 (2007)Google Scholar
  8. 8.
    Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  9. 9.
    D’hondt, E., Verberne, S., Koster, C.H.A., Boves, L.: Text representations for patent classification. Comput. Linguist. 39(3), 755–775 (2013)CrossRefGoogle Scholar
  10. 10.
    Dumais, S., Chen, H.: Hierarchical classification of web content. In: SIGIR, pp. 256–263 (2000)Google Scholar
  11. 11.
    Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP, pp. 1277–1287 (2010)Google Scholar
  12. 12.
    Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  13. 13.
    Figueiredo, F., Belém, F., Pinto, H., Almeida, J.M., Gonçalves, M.A., Fernandes, D., de Moura, E.S., Cristo, M.: Evidence of quality of textual features on the web 2.0. In: CIKM, pp 909–918 (2009)Google Scholar
  14. 14.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)MATHGoogle Scholar
  15. 15.
    Garcia Esparza, S., O’Mahony, M., Smyth, B.: Towards tagging and categorization for micro-blogs. In: AICS (2010)Google Scholar
  16. 16.
    Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia, pp 484–492 (2011)Google Scholar
  17. 17.
    Giannakopoulos, G., Karkaletsis, V., Vouros, G.A., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. TSLP 5(3) (2008)Google Scholar
  18. 18.
    Giannakopoulos, G., Palpanas, T.: Content and type as orthogonal modeling features: a study on user interest awareness in entity subscription services. Int. J. Adv. Netw. Serv. 3(2) (2010)Google Scholar
  19. 19.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  20. 20.
    Hong, L., Davison, B.: Empirical study of topic modeling in twitter. In: SOMA, pp. 80–88 (2010)Google Scholar
  21. 21.
    Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on twitter through text classification. In: CEAS, pp. 40–49 (2010)Google Scholar
  22. 22.
    Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: ECML, pp. 137–142 (1998)Google Scholar
  23. 23.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: PACLING, pp. 255–264 (2003)Google Scholar
  24. 24.
    Khorsi, A.: An overview of content-based spam filtering techniques. Informatica 31, 269–277 (2007)MATHGoogle Scholar
  25. 25.
    Kinsella, S., Passant, A., Breslin, J.G.: Topic classification in social media using metadata from hyperlinked objects. In: ECIR, pp 201–206 (2011)Google Scholar
  26. 26.
    Kinsella, S., Wang, M., Breslin, J.G., Hayes, C.: Improving categorisation in social media using hyperlinks to structured data sources. In: ESWC (2), pp 390–404 (2011)Google Scholar
  27. 27.
    Li, Z., Zhou, D., Juan, Y.F., Han, J.: Keyword extraction for social snippets. In: WWW, pp. 1143–1144 (2010)Google Scholar
  28. 28.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)MATHGoogle Scholar
  29. 29.
    Manning, C., Raghavan, P., Schuetze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press (2008)Google Scholar
  30. 30.
    Meng, W., Lanfen, L., Jing, W., Penghua, Y., Jiaolong, L., Fei, X.: Improving short text classification using public search engines. In: Integrated Uncertainty in Knowledge Modelling and Decision Making, pp 157–166 (2013)Google Scholar
  31. 31.
    Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining, pp. 1320–1326. LREC (2010)Google Scholar
  32. 32.
    Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. Advances in Information Retrieval, pp. 547–547 (2003)Google Scholar
  33. 33.
    Phan, X.H., Nguyen, M.L., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp. 91–100 (2008)Google Scholar
  34. 34.
    Rosa, H., Batista, F., Carvalho, J.P.: Twitter topic fuzzy fingerprints. In: IEEE International Conference on Fuzzy Systems, pp 776–783 (2014)Google Scholar
  35. 35.
    Salton, G.: The Smart Retrieval System – Experiments in Automatic Document Processing, p. 556. Prentice-Hall (1971)Google Scholar
  36. 36.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  37. 37.
    Sebastiani, F.: Text categorization. In: Encyclopedia of Database Technologies and Applications, pp. 683–687 (2005)Google Scholar
  38. 38.
    Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: SIGIR, pp. 841–842 (2010)Google Scholar
  39. 39.
    Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proceedings of the 3rd International Workshop on Text-based Information Retrieval, pp. 41–46 (2006)Google Scholar
  40. 40.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  41. 41.
    Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421–439 (2013)Google Scholar
  42. 42.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  43. 43.
    Sun, X., Wang, H., Yu, Y.: Towards effective short text deep classification. In: SIGIR, pp. 1143–1144 (2011)Google Scholar
  44. 44.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, p. 560. Morgan Kaufmann, San Francisco (2005)Google Scholar
  45. 45.
    Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: WSDM, pp. 177–186 (2011)Google Scholar
  46. 46.
    Yang, S., Kolcz, A., Schlaikjer, A., Gupta, P.: Large-scale high-precision topic modeling on twitter. In: KDD, pp. 1907–1916 (2014)Google Scholar
  47. 47.
    Zelikovitz, S., Hirsh, H.: Transductive lsi for short text classification problems. In: FLAIRS, pp. 556–561 (2004)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • George Papadakis
    • 1
  • George Giannakopoulos
    • 2
  • Georgios Paliouras
    • 2
  1. 1.Department of Informatics and TelecommunicationsUniversity of AthensAthensGreece
  2. 2.National Center for Scientific Research “Demokritos”, Patriarchou Grigoriou 27AtticaGreece

Personalised recommendations