Skip to main content

Recursive Style Breach Detection with Multifaceted Ensemble Learning

  • Conference paper
  • First Online:
Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2018)

Abstract

We present a supervised approach for style change detection, which aims at predicting whether there are changes in the style in a given text document, as well as at finding the exact positions where such changes occur. In particular, we combine a TF.IDF representation of the document with features specifically engineered for the task, and we make predictions via an ensemble of diverse classifiers including SVM, Random Forest, AdaBoost, MLP, and LightGBM. Whenever the model detects that style change is present, we apply it recursively, looking to find the specific positions of the change. Our approach powered the winning system for the PAN@CLEF 2018 task on Style Change Detection.

D. Kopev, D. Zlatkova, K. Mitov and A. Atanasov—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://pan.webis.de/clef18/pan18-web/author-identification.html.

  2. 2.

    http://pan.webis.de/clef17/pan17-web/author-identification.html.

  3. 3.

    http://semanticsimilarity.files.wordpress.com/2013/08/jim-oshea-fwlist-277.pdf.

  4. 4.

    http://www.sequencepublishing.com/1/academic.html.

  5. 5.

    http://www.edu.uwo.ca/faculty-profiles/docs/other/webb/essential-word-list.pdf.

  6. 6.

    http://norvig.com/google-books-common-words.txt.

  7. 7.

    http://github.com/shivam5992/textstat.

  8. 8.

    http://pan.webis.de/clef17/pan17-web/author-identification.html.

  9. 9.

    In this dataset, style change also means switch of authorship.

References

  1. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  2. Meyer zu Eissen, S., Stein, B., Kulig, M.: Plagiarism detection without reference collections. In: Decker, R., Lenz, H.J. (eds.) Advances in Data Analysis, pp. 359–366. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-70981-7_40

    Chapter  Google Scholar 

  3. Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2017: safety evaluation revisited. In: Working Notes Papers of the CLEF 2017 Evaluation Labs, CLEF 2017, vol. 1866 (2017)

    Google Scholar 

  4. Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P.: The case for being average: a mediocrity approach to style masking and author obfuscation. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 173–185. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_18

    Chapter  Google Scholar 

  5. Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: author clustering and style breach detection-notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)

    Google Scholar 

  6. Ke, G., et al: LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2017, Long Beach, California, pp. 3146–3154 (2017)

    Google Scholar 

  7. Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France (2018)

    Google Scholar 

  8. Khan, J.: Style breach detection: an unsupervised detection model–notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)

    Google Scholar 

  9. Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for intrinsic plagiarism detection and author diarization–notebook for PAN at CLEF 2016. In: CLEF 2016 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2016, Évora, Portugal (2016)

    Google Scholar 

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)

    MathSciNet  Google Scholar 

  11. Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, Philadelphia, Pennsylvania, pp. 63–70 (2002)

    Google Scholar 

  12. Mihaylova, T., Karadjov, G., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P.: SU@PAN’2016: author obfuscation. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, CLEF 2016, Évora, Portugal, pp. 956–969 (2016)

    Google Scholar 

  13. Pervaz, I., Ameer, I., Sittar, A., Nawab, R.: Identification of author personality traits using stylistic features–notebook for PAN at CLEF 2015. In: CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2015, Toulouse, France (2015)

    Google Scholar 

  14. Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)

    Article  Google Scholar 

  15. Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CLEF 2016, Évora, Portugal (2016)

    Google Scholar 

  16. Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings–notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)

    Google Scholar 

  17. Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2012, Montreal, Canada, pp. 362–366 (2012)

    Google Scholar 

  18. Sittar, A., Iqbal, H., Nawab, R.: Author diarization using cluster-distance approach-notebook for PAN at CLEF 2016. In: CLEF 2016 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2016, Évora, Portugal (2016)

    Google Scholar 

  19. Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes Papers of the CLEF 2017 Evaluation Labs, CLEF 2017, Dublin, Ireland (2017)

    Google Scholar 

  20. Zlatkova, D., et al.: An ensemble-rich multi-aspect approach towards robust style change detection: notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France (2018)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Bulgarian National Scientific Fund within the project no. DN 12/9, and by the Scientific Fund of the Sofia University within project no. 80-10-162/25.04.2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrina Zlatkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kopev, D. et al. (2018). Recursive Style Breach Detection with Multifaceted Ensemble Learning. In: Agre, G., van Genabith, J., Declerck, T. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2018. Lecture Notes in Computer Science(), vol 11089. Springer, Cham. https://doi.org/10.1007/978-3-319-99344-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99344-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99343-0

  • Online ISBN: 978-3-319-99344-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics