Detecting Computer-Generated Text Using Fluency and Noise Features

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 781)

Abstract

Computer-generated text plays a pivotal role in various applications, but the quality of the generated text is much lower than that of human-generated text. The use of artificially generated “machine text” can thus negatively affect such practical applications as website generation and text corpora collection. A method for distinguishing computer- and human-generated text is thus needed. Previous methods extract fluency features from a limited internal corpus and use them to identify the generated text. We have extended this approach to also estimate fluency using an enormous external corpus. We have also developed a method for extracting and distinguishing the noises characteristically created by a person or a machine. For example, people frequently use spoken noise words (2morrow, wanna, etc.) and misspelled ones (comin, hapy, etc.) while machines frequently generate incorrect expressions (such as untranslated phrases). A method combining these fluency and noise features was evaluated using 1000 original English messages and 1000 artificial English ones translated from Spanish. The results show that this combined method had the highest accuracy (80.35%) and the lowest equal error rate (19.44%) compared with one of state-of-the-art methods, which uses syntactic parser. Moreover, experiments using texts in other languages produced similar results, demonstrated that our proposed method works consistently across various languages.

Keywords

Computer-generated text Fluency feature Noise features Spoken word Misspelled word Untranslated word 

Notes

Acknowledgments

This work was supported by JSPS KAKENHI Grants (JP16H06302 and JP15H01686).

References

  1. 1.
    Arase, Y., Zhou, M.: Machine translation detection from monolingual web-texts. In: Proceedings of the 51st Annual Meeting on Association for Computational Linguistics, pp. 1597–1607 (2013)Google Scholar
  2. 2.
    Brants, T., Franz, A.: Web 1T 5-gram version 1. Linguistic Data Consortium (2006)Google Scholar
  3. 3.
    Chae, J., Nenkova, A.: Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 139–147 (2009)Google Scholar
  4. 4.
    Freitas, C., Benevenuto, F., Ghosh, S., Veloso, A.: Reverse engineering socialbot infiltration strategies in Twitter. In: Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 25–32 (2015)Google Scholar
  5. 5.
    Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 421–432 (2012)Google Scholar
  6. 6.
    Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A persona-based neural conversation model. In: Proceedings of the 54th Annual Meeting on Association for Computational Linguistics, pp. 994–1003 (2016)Google Scholar
  7. 7.
    Li, Y., Wang, R., Zhai, H.: A machine learning method to distinguish machine translation from human translation. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, pp. 354–360 (2015)Google Scholar
  8. 8.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting on Association for Computational Linguistics-System Demonstrations, pp. 55–60 (2014)Google Scholar
  9. 9.
    Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC-2011 microblog track. In: Proceeddings of the 21st Text REtrieval Conference. vol. 32, p. 20 (2012)Google Scholar
  10. 10.
    Strube, M., Rapp, S., Müller, C.: The influence of minimum edit distance on reference resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 312–319 (2002)Google Scholar
  11. 11.
    Sun, R., Zhang, Y., Zhang, M., Ji, D.: Event-driven headline generation. In: Proceedings of the 53rd Annual Meeting on Association for Computational Linguistics, pp. 462–472 (2015)Google Scholar
  12. 12.
    Zhou, T.C., Lyu, M.R., King, I.: A classification-based approach to question routing in community question answering. In: Proceedings of the 21st International Conference on World Wide Web, pp. 783–790. ACM (2012)Google Scholar
  13. 13.
    Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1353–1361 (2010)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.National Institute of InformaticsTokyoJapan
  2. 2.Graduate University for Advanced StudiesHayamaJapan

Personalised recommendations