Non parametric statistical models for on-line text classification

Cerchiello, Paola; Giudici, Paolo

doi:10.1007/s11634-012-0122-2

Non parametric statistical models for on-line text classification

Regular Article
Published: 13 October 2012

Volume 6, pages 277–288, (2012)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Paola Cerchiello¹ &
Paolo Giudici¹

573 Accesses
8 Citations
Explore all metrics

Abstract

Social media, such as blogs and on-line forums, contain a huge amount of information that is typically unorganized and fragmented. An important issue, that has been raising importance so far, is to classify on-line texts in order to detect possible anomalies. For example on-line texts representing consumer opinions can be, not only very precious and profitable for companies, but can also represent a serious damage if they are negative or faked. In this contribution we present a novel statistical methodology rooted in the context of classical text classification, in order to address such issues. In the literature, several classifiers have been proposed, among them support vector machine and naive Bayes classifiers. These approaches are not effective when coping with the problem of classifying texts belonging to an unknown author. To this aim, we propose to employ a new method, based on the combination of classification trees with non parametric approaches, such as Kruskal–Wallis and Brunner–Dette–Munk test. The main application of what we propose is the capability to classify an author as a new one, that is potentially trustable, or as an old one, that is potentially faked.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Patterns of Using the Z-Score for Text Classification Purposes

Article 01 October 2022

Text Classification Using Novel “Anti-Bayesian” Techniques

Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media

Notes

http://www.daviddlewis.com/resources/testcollections/reuters21578/.

References

Andrews FC (1954) Asymptotic behavior of some rank test for analysis of variance. Ann Math Stat 25(4):724–736
Article MATH Google Scholar
Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval (Melbourne), pp 96–103
Benzecri J (1973) L’analyse des donnees. Dunod, Paris
Google Scholar
Boullé M (2009) Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3(1):39–61
Article MathSciNet MATH Google Scholar
Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Brunner E, Dette H, Munk A (1997) Box-type approximations in nonparametric factorial designs. J Am Statist Assoc 92:1494–1502
Article MathSciNet MATH Google Scholar
Cerchiello P (2011) Statistical models to measure corporate reputation. In J Appl Quant Method 6(4):58–71
Conover WJ (1971) Practical nonparametric statistics. Wiley, New York
Google Scholar
Dagan I, Karov Y, Roth D (1997) Mistake driven learning in text categorization. In: Proceedings of EMNLP-97, second conference on empirical methods in natural language processing, Providence, pp 55–63
Forman G (2003) An Extensive empirical study of feature selection metrics for text classification. J Mach Lear Res 3:1289–1306
MATH Google Scholar
Frame S, Jammalamadaka S (2007) Generalized mixture models, semi-supervised learning, and unknown class inference. Adv Data Anal Classif 1(1):23–38
Article MathSciNet MATH Google Scholar
Greenacre M (2007) Correspondence Analysis in Practice, 2nd edn. Chapman and Hall, CRC, London
Book MATH Google Scholar
Guyon I, Elissee A (2003) An introduction to variable and feature selection. J Mach Lear Res 3(3): 1157–1182
Google Scholar
Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings WSDM-08, USA. doi:10.1145/1341531.1341560
Jindal N, Liu B, Lim EP (2010) Finding unusual review patterns using unexpected rules. In: Proceedings ACM-10, Canada. doi:10.1145/1871437.1871669
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, Germany, pp 137–142
Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions 2, 2nd edn. Wiley, New York
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2):119–127
Article Google Scholar
Kim YH, Hahn SY, Zhang BT (2000) Text filtering by boosting naive Bayes classifiers. In: Proceedings of SIGIR-00, Greece. doi:10.1145/345508.345572
Le Thi H, Le H, Nguyen V, Pham Dinh T (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2(3):259–278
Article MathSciNet Google Scholar
Najork M (2009) Web spam detection encyclopedia of database systems. Springer, Berlin
Google Scholar
Rust SW, Fligner MA (1984) A modification of the Kruskal–Wallis statistic for the generalized Behrens–Fisher problem. Commun Stat Theor Meth 13(16):2013–2027
Google Scholar
Siegel S, Castellan NJ Jr (1988) Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill, London
Google Scholar
Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Lear Res 3:1399–1414
MATH Google Scholar
Wilcox RR (2005) Introduction to robust estimation and hypothesis testing, 2nd edn. Elsevier Academic Press, Burlington

Download references

Acknowledgments

The author thanks European Union for funding within the MUSING project (FP6/027097). This paper is the result of the close collaboration between the authors, however, it has been written by Paola Cerchiello under the supervision of Professor Paolo Giudici.

Author information

Authors and Affiliations

University of Pavia, Pavia, Italy
Paola Cerchiello & Paolo Giudici

Authors

Paola Cerchiello
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Giudici
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paola Cerchiello.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cerchiello, P., Giudici, P. Non parametric statistical models for on-line text classification. Adv Data Anal Classif 6, 277–288 (2012). https://doi.org/10.1007/s11634-012-0122-2

Download citation

Received: 29 December 2011
Revised: 28 June 2012
Accepted: 01 August 2012
Published: 13 October 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s11634-012-0122-2

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non parametric statistical models for on-line text classification

Abstract

Access this article

Similar content being viewed by others

Patterns of Using the Z-Score for Text Classification Purposes

Text Classification Using Novel “Anti-Bayesian” Techniques

Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Non parametric statistical models for on-line text classification

Abstract

Access this article

Similar content being viewed by others

Patterns of Using the Z-Score for Text Classification Purposes

Text Classification Using Novel “Anti-Bayesian” Techniques

Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation