Abstract
Social media, such as blogs and on-line forums, contain a huge amount of information that is typically unorganized and fragmented. An important issue, that has been raising importance so far, is to classify on-line texts in order to detect possible anomalies. For example on-line texts representing consumer opinions can be, not only very precious and profitable for companies, but can also represent a serious damage if they are negative or faked. In this contribution we present a novel statistical methodology rooted in the context of classical text classification, in order to address such issues. In the literature, several classifiers have been proposed, among them support vector machine and naive Bayes classifiers. These approaches are not effective when coping with the problem of classifying texts belonging to an unknown author. To this aim, we propose to employ a new method, based on the combination of classification trees with non parametric approaches, such as Kruskal–Wallis and Brunner–Dette–Munk test. The main application of what we propose is the capability to classify an author as a new one, that is potentially trustable, or as an old one, that is potentially faked.
Similar content being viewed by others
References
Andrews FC (1954) Asymptotic behavior of some rank test for analysis of variance. Ann Math Stat 25(4):724–736
Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval (Melbourne), pp 96–103
Benzecri J (1973) L’analyse des donnees. Dunod, Paris
Boullé M (2009) Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3(1):39–61
Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
Brunner E, Dette H, Munk A (1997) Box-type approximations in nonparametric factorial designs. J Am Statist Assoc 92:1494–1502
Cerchiello P (2011) Statistical models to measure corporate reputation. In J Appl Quant Method 6(4):58–71
Conover WJ (1971) Practical nonparametric statistics. Wiley, New York
Dagan I, Karov Y, Roth D (1997) Mistake driven learning in text categorization. In: Proceedings of EMNLP-97, second conference on empirical methods in natural language processing, Providence, pp 55–63
Forman G (2003) An Extensive empirical study of feature selection metrics for text classification. J Mach Lear Res 3:1289–1306
Frame S, Jammalamadaka S (2007) Generalized mixture models, semi-supervised learning, and unknown class inference. Adv Data Anal Classif 1(1):23–38
Greenacre M (2007) Correspondence Analysis in Practice, 2nd edn. Chapman and Hall, CRC, London
Guyon I, Elissee A (2003) An introduction to variable and feature selection. J Mach Lear Res 3(3): 1157–1182
Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings WSDM-08, USA. doi:10.1145/1341531.1341560
Jindal N, Liu B, Lim EP (2010) Finding unusual review patterns using unexpected rules. In: Proceedings ACM-10, Canada. doi:10.1145/1871437.1871669
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, Germany, pp 137–142
Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions 2, 2nd edn. Wiley, New York
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2):119–127
Kim YH, Hahn SY, Zhang BT (2000) Text filtering by boosting naive Bayes classifiers. In: Proceedings of SIGIR-00, Greece. doi:10.1145/345508.345572
Le Thi H, Le H, Nguyen V, Pham Dinh T (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2(3):259–278
Najork M (2009) Web spam detection encyclopedia of database systems. Springer, Berlin
Rust SW, Fligner MA (1984) A modification of the Kruskal–Wallis statistic for the generalized Behrens–Fisher problem. Commun Stat Theor Meth 13(16):2013–2027
Siegel S, Castellan NJ Jr (1988) Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill, London
Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Lear Res 3:1399–1414
Wilcox RR (2005) Introduction to robust estimation and hypothesis testing, 2nd edn. Elsevier Academic Press, Burlington
Acknowledgments
The author thanks European Union for funding within the MUSING project (FP6/027097). This paper is the result of the close collaboration between the authors, however, it has been written by Paola Cerchiello under the supervision of Professor Paolo Giudici.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cerchiello, P., Giudici, P. Non parametric statistical models for on-line text classification. Adv Data Anal Classif 6, 277–288 (2012). https://doi.org/10.1007/s11634-012-0122-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-012-0122-2
Keywords
- Non parametric statistical models
- Kruskal–Wallis test
- Brunner–Dette–Munk test
- Text analysis
- Opinion spam detection