Recursive Style Breach Detection with Multifaceted Ensemble Learning

Kopev, Daniel; Zlatkova, Dimitrina; Mitov, Kristiyan; Atanasov, Atanas; Hardalov, Momchil; Koychev, Ivan; Nakov, Preslav

doi:10.1007/978-3-319-99344-7_12

Daniel Kopev¹⁶,
Dimitrina Zlatkova¹⁶,
Kristiyan Mitov¹⁶,
Atanas Atanasov¹⁶,
Momchil Hardalov¹⁶,
Ivan Koychev¹⁶ &
…
Preslav Nakov¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11089))

Included in the following conference series:

International Conference on Artificial Intelligence: Methodology, Systems, and Applications

922 Accesses
3 Citations
1 Altmetric

Abstract

We present a supervised approach for style change detection, which aims at predicting whether there are changes in the style in a given text document, as well as at finding the exact positions where such changes occur. In particular, we combine a TF.IDF representation of the document with features specifically engineered for the task, and we make predictions via an ensemble of diverse classifiers including SVM, Random Forest, AdaBoost, MLP, and LightGBM. Whenever the model detects that style change is present, we apply it recursively, looking to find the specific positions of the change. Our approach powered the winning system for the PAN@CLEF 2018 task on Style Change Detection.

D. Kopev, D. Zlatkova, K. Mitov and A. Atanasov—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://pan.webis.de/clef18/pan18-web/author-identification.html.
2.
http://pan.webis.de/clef17/pan17-web/author-identification.html.
3.
http://semanticsimilarity.files.wordpress.com/2013/08/jim-oshea-fwlist-277.pdf.
4.
http://www.sequencepublishing.com/1/academic.html.
5.
http://www.edu.uwo.ca/faculty-profiles/docs/other/webb/essential-word-list.pdf.
6.
http://norvig.com/google-books-common-words.txt.
7.
http://github.com/shivam5992/textstat.
8.
http://pan.webis.de/clef17/pan17-web/author-identification.html.
9.
In this dataset, style change also means switch of authorship.

References

Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Meyer zu Eissen, S., Stein, B., Kulig, M.: Plagiarism detection without reference collections. In: Decker, R., Lenz, H.J. (eds.) Advances in Data Analysis, pp. 359–366. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-70981-7_40
Chapter Google Scholar
Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2017: safety evaluation revisited. In: Working Notes Papers of the CLEF 2017 Evaluation Labs, CLEF 2017, vol. 1866 (2017)
Google Scholar
Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P.: The case for being average: a mediocrity approach to style masking and author obfuscation. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 173–185. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_18
Chapter Google Scholar
Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: author clustering and style breach detection-notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)
Google Scholar
Ke, G., et al: LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2017, Long Beach, California, pp. 3146–3154 (2017)
Google Scholar
Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France (2018)
Google Scholar
Khan, J.: Style breach detection: an unsupervised detection model–notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)
Google Scholar
Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for intrinsic plagiarism detection and author diarization–notebook for PAN at CLEF 2016. In: CLEF 2016 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2016, Évora, Portugal (2016)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)
MathSciNet Google Scholar
Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, Philadelphia, Pennsylvania, pp. 63–70 (2002)
Google Scholar
Mihaylova, T., Karadjov, G., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P.: SU@PAN’2016: author obfuscation. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, CLEF 2016, Évora, Portugal, pp. 956–969 (2016)
Google Scholar
Pervaz, I., Ameer, I., Sittar, A., Nawab, R.: Identification of author personality traits using stylistic features–notebook for PAN at CLEF 2015. In: CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2015, Toulouse, France (2015)
Google Scholar
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)
Article Google Scholar
Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CLEF 2016, Évora, Portugal (2016)
Google Scholar
Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings–notebook for PAN at CLEF 2017. In: CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2017, Dublin, Ireland (2017)
Google Scholar
Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2012, Montreal, Canada, pp. 362–366 (2012)
Google Scholar
Sittar, A., Iqbal, H., Nawab, R.: Author diarization using cluster-distance approach-notebook for PAN at CLEF 2016. In: CLEF 2016 Evaluation Labs and Workshop - Working Notes Papers, CLEF 2016, Évora, Portugal (2016)
Google Scholar
Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes Papers of the CLEF 2017 Evaluation Labs, CLEF 2017, Dublin, Ireland (2017)
Google Scholar
Zlatkova, D., et al.: An ensemble-rich multi-aspect approach towards robust style change detection: notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France (2018)
Google Scholar

Download references

Acknowledgements

This work was supported by the Bulgarian National Scientific Fund within the project no. DN 12/9, and by the Scientific Fund of the Sofia University within project no. 80-10-162/25.04.2018.

Author information

Authors and Affiliations

FMI, Sofia University “St. Kliment Ohridski”, Sofia, Bulgaria
Daniel Kopev, Dimitrina Zlatkova, Kristiyan Mitov, Atanas Atanasov, Momchil Hardalov & Ivan Koychev
Qatar Computing Research Institute, HBKU, Doha, Qatar
Preslav Nakov

Authors

Daniel Kopev
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrina Zlatkova
View author publications
You can also search for this author in PubMed Google Scholar
Kristiyan Mitov
View author publications
You can also search for this author in PubMed Google Scholar
Atanas Atanasov
View author publications
You can also search for this author in PubMed Google Scholar
Momchil Hardalov
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Koychev
View author publications
You can also search for this author in PubMed Google Scholar
Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrina Zlatkova .

Editor information

Editors and Affiliations

Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
Gennady Agre
Universität des Saarlandes, Saarbrücken, Germany
Josef van Genabith
DFKI GmbH, Saarbrücken, Germany
Thierry Declerck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kopev, D. et al. (2018). Recursive Style Breach Detection with Multifaceted Ensemble Learning. In: Agre, G., van Genabith, J., Declerck, T. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2018. Lecture Notes in Computer Science(), vol 11089. Springer, Cham. https://doi.org/10.1007/978-3-319-99344-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-99344-7_12
Published: 29 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99343-0
Online ISBN: 978-3-319-99344-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics