Skip to main content
Log in

Non parametric statistical models for on-line text classification

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Social media, such as blogs and on-line forums, contain a huge amount of information that is typically unorganized and fragmented. An important issue, that has been raising importance so far, is to classify on-line texts in order to detect possible anomalies. For example on-line texts representing consumer opinions can be, not only very precious and profitable for companies, but can also represent a serious damage if they are negative or faked. In this contribution we present a novel statistical methodology rooted in the context of classical text classification, in order to address such issues. In the literature, several classifiers have been proposed, among them support vector machine and naive Bayes classifiers. These approaches are not effective when coping with the problem of classifying texts belonging to an unknown author. To this aim, we propose to employ a new method, based on the combination of classification trees with non parametric approaches, such as Kruskal–Wallis and Brunner–Dette–Munk test. The main application of what we propose is the capability to classify an author as a new one, that is potentially trustable, or as an old one, that is potentially faked.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

References

  • Andrews FC (1954) Asymptotic behavior of some rank test for analysis of variance. Ann Math Stat 25(4):724–736

    Article  MATH  Google Scholar 

  • Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval (Melbourne), pp 96–103

  • Benzecri J (1973) L’analyse des donnees. Dunod, Paris

    Google Scholar 

  • Boullé M (2009) Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3(1):39–61

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont

    MATH  Google Scholar 

  • Brunner E, Dette H, Munk A (1997) Box-type approximations in nonparametric factorial designs. J Am Statist Assoc 92:1494–1502

    Article  MathSciNet  MATH  Google Scholar 

  • Cerchiello P (2011) Statistical models to measure corporate reputation. In J Appl Quant Method 6(4):58–71

  • Conover WJ (1971) Practical nonparametric statistics. Wiley, New York

    Google Scholar 

  • Dagan I, Karov Y, Roth D (1997) Mistake driven learning in text categorization. In: Proceedings of EMNLP-97, second conference on empirical methods in natural language processing, Providence, pp 55–63

  • Forman G (2003) An Extensive empirical study of feature selection metrics for text classification. J Mach Lear Res 3:1289–1306

    MATH  Google Scholar 

  • Frame S, Jammalamadaka S (2007) Generalized mixture models, semi-supervised learning, and unknown class inference. Adv Data Anal Classif 1(1):23–38

    Article  MathSciNet  MATH  Google Scholar 

  • Greenacre M (2007) Correspondence Analysis in Practice, 2nd edn. Chapman and Hall, CRC, London

    Book  MATH  Google Scholar 

  • Guyon I, Elissee A (2003) An introduction to variable and feature selection. J Mach Lear Res 3(3): 1157–1182

    Google Scholar 

  • Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings WSDM-08, USA. doi:10.1145/1341531.1341560

  • Jindal N, Liu B, Lim EP (2010) Finding unusual review patterns using unexpected rules. In: Proceedings ACM-10, Canada. doi:10.1145/1871437.1871669

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, Germany, pp 137–142

  • Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions 2, 2nd edn. Wiley, New York

  • Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2):119–127

    Article  Google Scholar 

  • Kim YH, Hahn SY, Zhang BT (2000) Text filtering by boosting naive Bayes classifiers. In: Proceedings of SIGIR-00, Greece. doi:10.1145/345508.345572

  • Le Thi H, Le H, Nguyen V, Pham Dinh T (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2(3):259–278

    Article  MathSciNet  Google Scholar 

  • Najork M (2009) Web spam detection encyclopedia of database systems. Springer, Berlin

    Google Scholar 

  • Rust SW, Fligner MA (1984) A modification of the Kruskal–Wallis statistic for the generalized Behrens–Fisher problem. Commun Stat Theor Meth 13(16):2013–2027

    Google Scholar 

  • Siegel S, Castellan NJ Jr (1988) Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill, London

    Google Scholar 

  • Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Lear Res 3:1399–1414

    MATH  Google Scholar 

  • Wilcox RR (2005) Introduction to robust estimation and hypothesis testing, 2nd edn. Elsevier Academic Press, Burlington

Download references

Acknowledgments

The author thanks European Union for funding within the MUSING project (FP6/027097). This paper is the result of the close collaboration between the authors, however, it has been written by Paola Cerchiello under the supervision of Professor Paolo Giudici.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paola Cerchiello.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cerchiello, P., Giudici, P. Non parametric statistical models for on-line text classification. Adv Data Anal Classif 6, 277–288 (2012). https://doi.org/10.1007/s11634-012-0122-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-012-0122-2

Keywords

Mathematics Subject Classification

Navigation