Skip to main content

A Comparative Study of Language Models for Book and Author Recognition

  • Conference paper
Natural Language Processing – IJCNLP 2005 (IJCNLP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

Abstract

Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexander, D., Kunz, W.J.: Some Classes of Verbs in English. Linguistics Research Project. Indiana University (1964)

    Google Scholar 

  2. Baker, J.C.: A Test of Authorship Based on the Rate at which New Words Enter an Author’s Text. Journal of the Association for Literary and Linguistic Computing 3(1), 36–39 (1988)

    Article  Google Scholar 

  3. Biber, D.: A Typology of English Texts. Language 27, 3–43 (1989)

    Google Scholar 

  4. Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  5. Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)

    Google Scholar 

  6. Diab, M., Schuster, J., Bock, P.: A Preliminary Statistical Investigation into the Impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification. In: Proceedings of Sixth International Conference on Artificial Intelligence Applications (1998)

    Google Scholar 

  7. Brinegar, C.S.: Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. Journal of the American Statistical Association 58, 85–96 (1963)

    Article  Google Scholar 

  8. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    Article  MATH  Google Scholar 

  9. Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: Sharples, M., van der Geest, T. (eds.) The new writing environment: Writers at work in a world of technology. Springer, London (1996)

    Google Scholar 

  10. Halliday, M., Hasan, R.: Cohesion in English. Longman, London (1976)

    Google Scholar 

  11. Halliday, M.: An introduction to functional grammar. Edward Arnold, London (1985)

    Google Scholar 

  12. Hatzivassiloglou, V., Klavans, J., Eskin, E.: Detecting Similarity by Applying Learning over Indicators. In: 37th Annual Meeting of the ACL (1999)

    Google Scholar 

  13. Hatzivassiloglou, V., Klavans, J., Holcombe, M., Barzilay, R., Kan, M.Y., McKeown, K.R.: SimFinder: A Flexible Clustering Tool for Summarization. In: NAACL 2001 Automatic Summarization Workshop (2001)

    Google Scholar 

  14. Holmes, D.I.: Authorship Attribution. Computers and the Humanities 28, 87–106 (1994)

    Article  Google Scholar 

  15. Katz, B.: Using English for Indexing and Retrieving. In: Winston, P.H., Shellard, S.A. (eds.) Artificial Intelligence at MIT: Expanding Frontiers. MIT Press, Cambridge (1990)

    Google Scholar 

  16. Katz, B., Levin, B.: Exploiting Lexical Regularities in Designing Natural Language Systems. In: Proceedings of the 12th International Conference on Computational Linguistics, COLING (1988)

    Google Scholar 

  17. Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)

    Article  Google Scholar 

  18. Kjetsaa, G.: The Authorship of the Quiet Don. International Specialized Book Service Inc. (1984) ISBN 0391029487

    Google Scholar 

  19. Koppel, M., Akiva, N., Dagan, I.: A Corpus-Independent Feature Set for Style-Based Text Categorization. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis (2003)

    Google Scholar 

  20. Kukushkina, O.V., Polikarpov, A.A., Khemelev, D.V.: Using Literal and Grammatical Statistics for Authorship Attribution. Published in Problemy Peredachi Informatsii, vol. 37(2), 96–108 (2000); Translated in “Problems of Information Transmission”, 172–184

    Google Scholar 

  21. Levin, B.: English Verb Classes and Alternations. A Preliminary Investigation. University of Chicago Press, Chicago (1993) ISBN 0-226-47533-6

    Google Scholar 

  22. Mendenhall, T.C.: Characteristic Curves of Composition. Science 11, 237–249 (1887)

    Article  Google Scholar 

  23. Miller, G.A., Newman, E.B., Friedman, E.A.: Length-Frequency Statistics for Written English. Information and Control 1(4), 370–389 (1958)

    Article  Google Scholar 

  24. Morton, A.Q.: The Authorship of Greek Prose. Journal of the Royal Statistical Society (A) 128, 169–233 (1965)

    Google Scholar 

  25. Mosteller, F., Wallace, D.L.: Inference in an authorship Problem. Journal of the American Statistical Association 58(302), 275–309 (1963)

    Article  MATH  Google Scholar 

  26. Peng, R.D., Hengartner, H.: Quantitative Analysis of Literary Styles. The American Statistician 56(3), 175–185 (2002)

    Article  MathSciNet  Google Scholar 

  27. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1998)

    Article  Google Scholar 

  28. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  29. Schapire, R.E.: The Boosting Approach to Machine Learning. In: MSRI Workshop on Nonlinear Estimation and Classification (2002)

    Google Scholar 

  30. Sichel, H.S.: On a Distribution Representing Sentence-Length in Written Prose. Journal of the Royal Statistical Society (A) 137, 25–34 (1974)

    Google Scholar 

  31. Smith, M.W.A.: Recent Experience and New Developments of Methods for the Determination of Authorship. Association for Literary and Linguistic Computing Bulletin 11, 73–82 (1983)

    Google Scholar 

  32. Tallentire, D.R.: An Appraisal of Methods and Models in Computational Stylistics, with Particular Reference to Author Attribution. PhD Thesis. University of Cambridge (1972)

    Google Scholar 

  33. Thisted, R., Efron, B.: Did Shakespeare Write a Newly-discovered Poem? Biometrika 74, 445–455 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  34. Uzuner, Ö.: Identifying Expression Fingerprints using Linguistic Information. Ph.D. Dissertation. Massachusetts Institute of Technology (2005)

    Google Scholar 

  35. Uzuner, Ö., Katz, B.: Capturing Expression Using Linguistic Information. In: Proceedings of the 20th National Conference on Artificial Intelligence, AAAI-2005 (2005)

    Google Scholar 

  36. Uzuner, Ö., Katz, B., Nahnsen, T.: Using Syntactic Information to Identify Plagiarism. In: Proceedings of the Association for Computational Linguistics Workshop on Educational Applications, ACL 2005 (2005)

    Google Scholar 

  37. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  38. Yule, G.U.: On Sentence-Length as a Statistical Characteristic of Style in Prose, with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)

    Google Scholar 

  39. Wilkinson, J., diMarco, C.: Automated Multi-purpose Text Processing. In: Proceedings of IEEE Fifth Annual Dual-Use Technologies and Applications Conference (1995)

    Google Scholar 

  40. Williams, C.B.: Mendenhall’s Studies of Word-Length Distribution in the Works of Shakespeare and Bacon. Biometrika 62(1), 207–212 (1975)

    Article  MATH  Google Scholar 

  41. Witten, I.H., Frank, E.: Data Mining: Practical machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Uzuner, Ö., Katz, B. (2005). A Comparative Study of Language Models for Book and Author Recognition. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_84

Download citation

  • DOI: https://doi.org/10.1007/11562214_84

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29172-5

  • Online ISBN: 978-3-540-31724-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics