A Comparative Study of Language Models for Book and Author Recognition

Uzuner, Özlem; Katz, Boris

doi:10.1007/11562214_84

Özlem Uzuner²² &
Boris Katz²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1614 Accesses
21 Citations

Abstract

Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexander, D., Kunz, W.J.: Some Classes of Verbs in English. Linguistics Research Project. Indiana University (1964)
Google Scholar
Baker, J.C.: A Test of Authorship Based on the Rate at which New Words Enter an Author’s Text. Journal of the Association for Literary and Linguistic Computing 3(1), 36–39 (1988)
Article Google Scholar
Biber, D.: A Typology of English Texts. Language 27, 3–43 (1989)
Google Scholar
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Google Scholar
Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)
Google Scholar
Diab, M., Schuster, J., Bock, P.: A Preliminary Statistical Investigation into the Impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification. In: Proceedings of Sixth International Conference on Artificial Intelligence Applications (1998)
Google Scholar
Brinegar, C.S.: Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. Journal of the American Statistical Association 58, 85–96 (1963)
Article Google Scholar
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Article MATH Google Scholar
Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: Sharples, M., van der Geest, T. (eds.) The new writing environment: Writers at work in a world of technology. Springer, London (1996)
Google Scholar
Halliday, M., Hasan, R.: Cohesion in English. Longman, London (1976)
Google Scholar
Halliday, M.: An introduction to functional grammar. Edward Arnold, London (1985)
Google Scholar
Hatzivassiloglou, V., Klavans, J., Eskin, E.: Detecting Similarity by Applying Learning over Indicators. In: 37th Annual Meeting of the ACL (1999)
Google Scholar
Hatzivassiloglou, V., Klavans, J., Holcombe, M., Barzilay, R., Kan, M.Y., McKeown, K.R.: SimFinder: A Flexible Clustering Tool for Summarization. In: NAACL 2001 Automatic Summarization Workshop (2001)
Google Scholar
Holmes, D.I.: Authorship Attribution. Computers and the Humanities 28, 87–106 (1994)
Article Google Scholar
Katz, B.: Using English for Indexing and Retrieving. In: Winston, P.H., Shellard, S.A. (eds.) Artificial Intelligence at MIT: Expanding Frontiers. MIT Press, Cambridge (1990)
Google Scholar
Katz, B., Levin, B.: Exploiting Lexical Regularities in Designing Natural Language Systems. In: Proceedings of the 12th International Conference on Computational Linguistics, COLING (1988)
Google Scholar
Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)
Article Google Scholar
Kjetsaa, G.: The Authorship of the Quiet Don. International Specialized Book Service Inc. (1984) ISBN 0391029487
Google Scholar
Koppel, M., Akiva, N., Dagan, I.: A Corpus-Independent Feature Set for Style-Based Text Categorization. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis (2003)
Google Scholar
Kukushkina, O.V., Polikarpov, A.A., Khemelev, D.V.: Using Literal and Grammatical Statistics for Authorship Attribution. Published in Problemy Peredachi Informatsii, vol. 37(2), 96–108 (2000); Translated in “Problems of Information Transmission”, 172–184
Google Scholar
Levin, B.: English Verb Classes and Alternations. A Preliminary Investigation. University of Chicago Press, Chicago (1993) ISBN 0-226-47533-6
Google Scholar
Mendenhall, T.C.: Characteristic Curves of Composition. Science 11, 237–249 (1887)
Article Google Scholar
Miller, G.A., Newman, E.B., Friedman, E.A.: Length-Frequency Statistics for Written English. Information and Control 1(4), 370–389 (1958)
Article Google Scholar
Morton, A.Q.: The Authorship of Greek Prose. Journal of the Royal Statistical Society (A) 128, 169–233 (1965)
Google Scholar
Mosteller, F., Wallace, D.L.: Inference in an authorship Problem. Journal of the American Statistical Association 58(302), 275–309 (1963)
Article MATH Google Scholar
Peng, R.D., Hengartner, H.: Quantitative Analysis of Literary Styles. The American Statistician 56(3), 175–185 (2002)
Article MathSciNet Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1998)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Schapire, R.E.: The Boosting Approach to Machine Learning. In: MSRI Workshop on Nonlinear Estimation and Classification (2002)
Google Scholar
Sichel, H.S.: On a Distribution Representing Sentence-Length in Written Prose. Journal of the Royal Statistical Society (A) 137, 25–34 (1974)
Google Scholar
Smith, M.W.A.: Recent Experience and New Developments of Methods for the Determination of Authorship. Association for Literary and Linguistic Computing Bulletin 11, 73–82 (1983)
Google Scholar
Tallentire, D.R.: An Appraisal of Methods and Models in Computational Stylistics, with Particular Reference to Author Attribution. PhD Thesis. University of Cambridge (1972)
Google Scholar
Thisted, R., Efron, B.: Did Shakespeare Write a Newly-discovered Poem? Biometrika 74, 445–455 (1987)
Article MATH MathSciNet Google Scholar
Uzuner, Ö.: Identifying Expression Fingerprints using Linguistic Information. Ph.D. Dissertation. Massachusetts Institute of Technology (2005)
Google Scholar
Uzuner, Ö., Katz, B.: Capturing Expression Using Linguistic Information. In: Proceedings of the 20th National Conference on Artificial Intelligence, AAAI-2005 (2005)
Google Scholar
Uzuner, Ö., Katz, B., Nahnsen, T.: Using Syntactic Information to Identify Plagiarism. In: Proceedings of the Association for Computational Linguistics Workshop on Educational Applications, ACL 2005 (2005)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Yule, G.U.: On Sentence-Length as a Statistical Characteristic of Style in Prose, with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)
Google Scholar
Wilkinson, J., diMarco, C.: Automated Multi-purpose Text Processing. In: Proceedings of IEEE Fifth Annual Dual-Use Technologies and Applications Conference (1995)
Google Scholar
Williams, C.B.: Mendenhall’s Studies of Word-Length Distribution in the Works of Shakespeare and Bacon. Biometrika 62(1), 207–212 (1975)
Article MATH Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, 02139
Özlem Uzuner & Boris Katz

Authors

Özlem Uzuner
View author publications
You can also search for this author in PubMed Google Scholar
Boris Katz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uzuner, Ö., Katz, B. (2005). A Comparative Study of Language Models for Book and Author Recognition. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_84

Download citation

DOI: https://doi.org/10.1007/11562214_84
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics