Filtering artificial texts with statistical machine learning techniques

Lavergne, Thomas; Urvoy, Tanguy; Yvon, François

doi:10.1007/s10579-009-9113-0

Filtering artificial texts with statistical machine learning techniques

Published: 16 January 2010

Volume 45, pages 25–43, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Thomas Lavergne^1,2,
Tanguy Urvoy¹ &
François Yvon³

313 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows us to detect these “hard” text generators with a high degree of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The “Really Simple Site Generator Modified” (RSSGM) is a good example of a freely available web scraping tool which combines texts patchworks and Markovian random text generators.
See web site http://rss2spam.com.
A hapax is a type which occurs only once in a given text.
Note that this word is not necessary the same as \(\mathop{\hbox{argmax}}\limits_{v}{P(v|h)}\) .

References

Baayen, R. H. (2001). Word frequency distributions. Amsterdam, The Netherlands: Kluwer.
Google Scholar
Brants, T., & Franz, A. (2006). Web 1T 5-gram corpus version 1.1. LDC ref: LDC2006T13.
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. In Computer networks (Vol. 29, pp. 1157–1166). Amsterdam: Elsevier.
Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Google Scholar
Bulhak, A. C. (1996). The dada engine. http://dev.null.org/dadaengine/.
Chen, S. F., & Goodman, J. T. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting of the association for computational linguistics (ACL) (pp. 310–318). Santa Cruz.
Croft, W. B., & Lafferty, J. (2003). Language modeling for information retrieval. Norwell, MA, USA: Kluwer.
Google Scholar
Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the SIAM international conference on data mining SDM 2006 (pp. 603–607). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.
Dalvi, N., Domingos, P., Mausam, Sanghai, S., & Verma, D. (2004). Adversarial classification. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04) (pp. 99–108). New York, NY, USA: ACM.
Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB’04 (pp. 1–6). New York, NY, USA.
Fetterly, D., Manasse, M., & Najork, M. (2005). Detecting phrase-level duplication on the world wide web. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 170–177). New York, NY, USA: ACM. doi:10.1145/1076034.107606.
Gray, A., Sallis, P., & MacDonell, S. (1997). Software forensics: Extending authorship analysis techniques to computer programs. In 3rd Biannual conference of international association of forensic linguists (IAFL ’97) (pp. 1–8).
Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the web (AIRWeb 2005).
Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustRank. In Proceedings of the conference on very large databases (VLDB’04) (pp. 576–587). Toronto, Canada: Morgan Kaufmann.
Heymann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Magazine on Internet Computing, 11(6), 36–45.
Article Google Scholar
Honoré, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.
Google Scholar
Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K. F. Lee (Eds.), Readings in speech recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: The MIT Press.
Google Scholar
Kołcz, A., & Chowdhury, A. (2007). Hardening fingerprinting by context. In CEAS’07. CA, USA: Mountain View.
Lavergne, T. (2008). Taxonomie de textes peu-naturels. In Actes des Journées Internationales d’Ananlyse des Données Textuelles (JADT’O8), 2, 679–689.
Lowd, D., & Meek, C. (2005). Adversarial learning. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’05) (pp. 641–647). New York, NY, USA: ACM.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.
Google Scholar
McEnery, T., & Oakes, M. (2000). Authorship identification and computational stylometry. In Handbook of natural language processing. New York: Marcel Dekker Inc.
Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on world wide web (pp. 83–92). New York, NY, USA: ACM. doi:10.1145/1135777.113579.
Quinlan, R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann.
Google Scholar
Seymore, K., & Rosenfeld, R. (1996). Scalable backoff language models. In Proceedings of the international conference on spoken language processing (ICSLP) (Vol. 1, pp. 232–235). Philadelphia, PA.
Sichel, H. (1975). On a distribution law for word frequencies. In Journal of the American Statistical Association, 70, 542–547.
Siivola, V., & Pellom, B. (2005). Growing an n-gram model. In Proceedings of the 9th international conference on speech technologies INTERSPEECH (pp. 1309–1312). Lisbon, Portugal.
Simpson, E. H. (1949). Measurement of diversity. Nature, 163,168.
Stein, B., zu Eissen, S. M., & Potthast, M. (2007). Strategies for retrieving plagiarized documents. In ACM SIGIR (pp. 825–826). New York, NY, USA.
Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proceedings of the DARPA broadcast news transcription and understanding workshop (pp. 270–274). Lansdowne, VA.
Stolcke, A. (2002). SRILM—an extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing (ICSLP) (Vol. 2, pp. 901–904). Denver, CO.
Urvoy, T., Chauveau, E., Filoche, P., & Lavergne, T. (2008). Tracking web spam with HTML style similarities. ACM Transactions on the Web, 2(1), 1–28.
Article Google Scholar
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques with java implementations. San Francisco: Morgan Kaufmann
Google Scholar
Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley.
Google Scholar

Download references

Author information

Authors and Affiliations

Orange Labs, Lannion, France
Thomas Lavergne & Tanguy Urvoy
Telecom ParisTech, Paris, France
Thomas Lavergne
Univ Paris Sud 11 & LIMSI/CNRS, Orsay cedex, France
François Yvon

Authors

Thomas Lavergne
View author publications
You can also search for this author in PubMed Google Scholar
Tanguy Urvoy
View author publications
You can also search for this author in PubMed Google Scholar
François Yvon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to François Yvon.

Additional information

Work supported by MADSPAM 2.0 ANR project.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavergne, T., Urvoy, T. & Yvon, F. Filtering artificial texts with statistical machine learning techniques. Lang Resources & Evaluation 45, 25–43 (2011). https://doi.org/10.1007/s10579-009-9113-0

Download citation

Published: 16 January 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10579-009-9113-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Filtering artificial texts with statistical machine learning techniques

Abstract

Access this article

Similar content being viewed by others

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

GPT-3: Its Nature, Scope, Limits, and Consequences

Fake news detection based on news content and social contexts: a transformer-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Filtering artificial texts with statistical machine learning techniques

Abstract

Access this article

Similar content being viewed by others

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

GPT-3: Its Nature, Scope, Limits, and Consequences

Fake news detection based on news content and social contexts: a transformer-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation