Skip to main content
Log in

Filtering artificial texts with statistical machine learning techniques

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows us to detect these “hard” text generators with a high degree of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The “Really Simple Site Generator Modified” (RSSGM) is a good example of a freely available web scraping tool which combines texts patchworks and Markovian random text generators.

  2. See web site http://rss2spam.com.

  3. A hapax is a type which occurs only once in a given text.

  4. Note that this word is not necessary the same as \(\mathop{\hbox{argmax}}\limits_{v}{P(v|h)}\) .

References

  • Baayen, R. H. (2001). Word frequency distributions. Amsterdam, The Netherlands: Kluwer.

    Google Scholar 

  • Brants, T., & Franz, A. (2006). Web 1T 5-gram corpus version 1.1. LDC ref: LDC2006T13.

  • Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. In Computer networks (Vol. 29, pp. 1157–1166). Amsterdam: Elsevier.

  • Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.

    Google Scholar 

  • Bulhak, A. C. (1996). The dada engine. http://dev.null.org/dadaengine/.

  • Chen, S. F., & Goodman, J. T. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting of the association for computational linguistics (ACL) (pp. 310–318). Santa Cruz.

  • Croft, W. B., & Lafferty, J. (2003). Language modeling for information retrieval. Norwell, MA, USA: Kluwer.

    Google Scholar 

  • Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the SIAM international conference on data mining SDM 2006 (pp. 603–607). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.

  • Dalvi, N., Domingos, P., Mausam, Sanghai, S., & Verma, D. (2004). Adversarial classification. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04) (pp. 99–108). New York, NY, USA: ACM.

  • Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB’04 (pp. 1–6). New York, NY, USA.

  • Fetterly, D., Manasse, M., & Najork, M. (2005). Detecting phrase-level duplication on the world wide web. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 170–177). New York, NY, USA: ACM. doi:10.1145/1076034.107606.

  • Gray, A., Sallis, P., & MacDonell, S. (1997). Software forensics: Extending authorship analysis techniques to computer programs. In 3rd Biannual conference of international association of forensic linguists (IAFL ’97) (pp. 1–8).

  • Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the web (AIRWeb 2005).

  • Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustRank. In Proceedings of the conference on very large databases (VLDB’04) (pp. 576–587). Toronto, Canada: Morgan Kaufmann.

  • Heymann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Magazine on Internet Computing, 11(6), 36–45.

    Article  Google Scholar 

  • Honoré, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.

    Google Scholar 

  • Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K. F. Lee (Eds.), Readings in speech recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.

  • Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: The MIT Press.

    Google Scholar 

  • Kołcz, A., & Chowdhury, A. (2007). Hardening fingerprinting by context. In CEAS’07. CA, USA: Mountain View.

  • Lavergne, T. (2008). Taxonomie de textes peu-naturels. In Actes des Journées Internationales d’Ananlyse des Données Textuelles (JADT’O8), 2, 679–689.

  • Lowd, D., & Meek, C. (2005). Adversarial learning. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’05) (pp. 641–647). New York, NY, USA: ACM.

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.

    Google Scholar 

  • McEnery, T., & Oakes, M. (2000). Authorship identification and computational stylometry. In Handbook of natural language processing. New York: Marcel Dekker Inc.

  • Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on world wide web (pp. 83–92). New York, NY, USA: ACM. doi:10.1145/1135777.113579.

  • Quinlan, R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Seymore, K., & Rosenfeld, R. (1996). Scalable backoff language models. In Proceedings of the international conference on spoken language processing (ICSLP) (Vol. 1, pp. 232–235). Philadelphia, PA.

  • Sichel, H. (1975). On a distribution law for word frequencies. In Journal of the American Statistical Association, 70, 542–547.

  • Siivola, V., & Pellom, B. (2005). Growing an n-gram model. In Proceedings of the 9th international conference on speech technologies INTERSPEECH (pp. 1309–1312). Lisbon, Portugal.

  • Simpson, E. H. (1949). Measurement of diversity. Nature, 163,168.

  • Stein, B., zu Eissen, S. M., & Potthast, M. (2007). Strategies for retrieving plagiarized documents. In ACM SIGIR (pp. 825–826). New York, NY, USA.

  • Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proceedings of the DARPA broadcast news transcription and understanding workshop (pp. 270–274). Lansdowne, VA.

  • Stolcke, A. (2002). SRILM—an extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing (ICSLP) (Vol. 2, pp. 901–904). Denver, CO.

  • Urvoy, T., Chauveau, E., Filoche, P., & Lavergne, T. (2008). Tracking web spam with HTML style similarities. ACM Transactions on the Web, 2(1), 1–28.

    Article  Google Scholar 

  • Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques with java implementations. San Francisco: Morgan Kaufmann

    Google Scholar 

  • Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Yvon.

Additional information

Work supported by MADSPAM 2.0 ANR project.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavergne, T., Urvoy, T. & Yvon, F. Filtering artificial texts with statistical machine learning techniques. Lang Resources & Evaluation 45, 25–43 (2011). https://doi.org/10.1007/s10579-009-9113-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9113-0

Keywords

Navigation