Skip to main content

Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification

Abstract

It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture “latent semantics”. These findings have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move beyond not only instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.

This is a preview of subscription content, access via your institution.

References

  1. Taskar B, Abbeel P, Koller D. Discriminative probabilistic models for relational data. In Proc. the 18th Conf. Uncertainty in Artificial Intelligence, August 2002, pp.485-492.

  2. Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In Proc. International Conference on Management of Data, June 1998, pp.307-318.

  3. Neville J, Jensen D. Iterative classification in relational data. In Proc. AAAI 2000 Workshop on Learning Statistical Models from Relational Data, July 2000, pp.13-20.

  4. Getoor L, Diehl C P. Link mining: A survey. ACM SIGKDD Explorations Newsletter, 2005, 7(2): 3-12.

    Article  Google Scholar 

  5. Ganiz M C, Kanitkar S, Chuah M C, Pottenger W M. Detection of interdomain routing anomalies based on higher-order path analysis. In Proc. the 6th IEEE International Conference on Data Mining, December 2006, pp.874-879.

  6. Ganiz M C, Lytkin N, Pottenger W M. Leveraging higher order dependencies between features for text classification. In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, September 2009, pp.375-390.

  7. Ganiz M C, George C, Pottenger W M. Higher order Naive Bayes: A novel non-IID approach to text classification. IEEE Trans. Knowledge and Data Engineering, 2011, 23(7): 1022-1034.

    Article  Google Scholar 

  8. Lytkin N. Variance-based clustering methods and higher order data transformations and their applications [Ph.D. Thesis]. Rutgers University, NJ, 2009.

    Google Scholar 

  9. Edwards A, Pottenger W M. Higher order Q-Learning. In Proc. IEEE Symp. Adaptive Dynamic Programming and Reinforcement Learning, April 2011, pp.128-134.

  10. Deerwester S C, Dumais S T, Landauer T K et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391-407.

    Article  Google Scholar 

  11. Kontostathis A, Pottenger W M. A framework for understanding latent semantic indexing (LSI) performance. Journal of the Information Processing and Management, 2006, 42(1): 56-73.

    Article  Google Scholar 

  12. Sarah Z, Hirsh H. Transductive LSI for short text classification problems. In Proc. the 17th International Florida Artificial Intelligence Research Society Conference, May 2004, pp.556-561.

  13. Li S, Wu T, Pottenger W M. Distributed higher order association rule mining using information extracted from textual data. SIGKDD Explorations Newsletter — Natural Language Processing and Text Mining, 2005, 7(1): 26-35.

  14. McCallum A, Nigam K. A comparison of event models for Naive Bayes text classification. In Proc. AAAI 1998 Workshop on Learning for Text Categorization, July 1998, pp.41-48.

  15. Kim S B, Han K S, Rim H C, Myaeng S H. Some effective techniques for naive Bayes text classification. IEEE Trans. Knowl. Data Eng., 2006, 18(11): 1457-1466.

    Article  Google Scholar 

  16. Schneider K M. On word frequency information and negative evidence in Naive Bayes text classification. In Proc. Int. Conf. Advances in Natural Language Processing, October 2004, pp.474-485.

  17. Metsis V, Androutsopoulos I, Paliouras G. Spam filtering with Naive Bayes — Which Naive Bayes?. In Proc. Conference on Email and Anti-Spam, July 2006.

  18. McCallum A, Nigam K. Text classification by bootstrapping with keywords, EM and shrinkage. In Proc. ACL 1999 Workshop for the Unsupervised Learning in Natural Language Processing, June 1999, pp.52-58.

  19. Juan A, Ney H. Reversing and smoothing the multinomial Naive Bayes text classifier. In Proc. International Workshop on Pattern Recognition in Information Systems, April 2002, pp.200-212.

  20. Peng F, Schuurmans D, Wang S. Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 2004, 7(3/4): 317-345.

    Article  Google Scholar 

  21. Zhou X, Zhang X, Hu X. Semantic smoothing for Bayesian text classification with small training data. In Proc. International Conference on Data Mining, April 2008, pp.289-300.

  22. Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In Proc. the 34th Annual Meeting on Association for Computational Linguistics, June 1996, pp.310-318

  23. Joachims T. Text categorization with support vector machines: Learning with many relevant features. In Proc. the 10th European Conf. Machine Learning, Apr. 1998, pp.137-142.

  24. Gao B, Liu T, Feng G, Qin T, Cheng Q, Ma W. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph co-partitioning. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1263-1273.

    Article  Google Scholar 

  25. Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.

    Article  Google Scholar 

  26. Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.

    Article  Google Scholar 

  27. Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85

    Article  Google Scholar 

  28. Chakrabarti S. Supervised learning. In Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002, pp.148-151.

  29. Manning C D, Schütze H. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.

    MATH  Google Scholar 

  30. AmasyalıM F, Beken A. Measurement of Turkish word semantic similarity and text categorization application. In Proc. IEEE Signal Processing and Communications Applications Conference, April 2009. (in Turkish)

  31. Torunoğlu D, Çakırman E, Ganiz M C et al. Analysis of preprocessing methods on classification of Turkish texts. In Proc. International Symposium on Innovations in Intelligent Systems and Applications, June 2011, pp.112-118.

  32. Rennie J D, Shih L, Teevan J, Karger D R. Tackling the poor assumptions of Naive Bayes text classifiers. In Proc. ICML2003, August 2003, pp.616-623.

  33. Eyheramendy S, Lewis D D, Madigan D. On the Naive Bayes model for text categorization. In Proc. the 9th International Workshop on Artificial Intelligence and Statistics, January 2003, pp.332-339.

  34. Kolcz A, Yih W. Raising the baseline for high-precision text classifiers. In Proc. the 13th Int. Conf. Knowledge Discovery and Data Mining, August 2007, pp.400-409.

  35. Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, 2011.

  36. Su J, Shirab J S, Matwin S. Large scale text classification using semi-supervised multinomial Naive Bayes. In Proc. the 28th Int. Conf. Machine Learning, June 2011, pp.97-104.

  37. Nakov P, Popova A, Mateev P. Weight functions impact on LSA performance. In Proc. the EuroConference Recent Advances in Natural Language Processing, September 2001, pp.187-193.

  38. Poyraz M, Kilimci Z H, Ganiz M C. A novel semantic smoothing method based on higher order paths for text classification. In Proc. IEEE Int. Conf. Data Mining, Dec. 2012, pp.615-624.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Murat Can Ganiz.

Additional information

This work was supported in part by the Scientific and Technological Research Council of Turkey (TÜBÍTAK) under Grant No. 111E239. Points of views in this document are those of the authors and do not necessarily represent the official position or policies of the TÜBÍTAK

A preliminary version of this paper was published in the Proceedings of ICDM 2012.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 75 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Poyraz, M., Kilimci, Z.H. & Ganiz, M.C. Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification. J. Comput. Sci. Technol. 29, 376–391 (2014). https://doi.org/10.1007/s11390-014-1437-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1437-6

Keywords

  • Naive Bayes
  • semantic smoothing
  • higher-order Naive Bayes
  • higher-order smoothing
  • text classification