Skip to main content

On sentence length distribution as an authorship attribute

  • Conference paper
  • First Online:
Information Science and Applications

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 339))

Abstract

Understanding what makes written texts sound like they are written by their author has been an unsolved problem for hundreds of years. The attributes of authorship are often clumped together as an attempt to solve the case of an unknown author while the practice of investigating a single attribute by eliminating the effect of all others has been paid little attention. One of the debated attributes is the size of the text segments which authors use to group words together. Texts consist of these segments — sentences — which are of different lengths, the values being distributed in ways that are assumed to be characteristic of the author. Comparing the statistics of paired text samples, we can show that differences in the statistics in fact indicate difference in the authorship of the texts. However, certain choices of metrics and units easily lead to random and meaningless results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Briscoe, T.: The syntax and semantics of punctuation and its use in interpretation. In: Proceedings of the Association for Computational Linguistics Workshop on Punctuation. pp. 1–7 (1996)

    Google Scholar 

  2. Encyclopaedia Britannica. Encyclopaedia Britannica, Inc. (1768–2014), https://www.britannica.com

  3. Ghaeini, M.: Intrinsic author identification using modified weighted knn. In: Notebook for PAN at CLEF 2013 (2013)

    Google Scholar 

  4. Grieve, J.W.: Quantitative authorship attribution: a history and an evaluation of techniques. Master’s thesis, Simon Fraser University, British Columbia, Canada (2005)

    Google Scholar 

  5. Holmes, D.: The analysis of literary style — a review. Statistical Society A 148, 328–341 (1985)

    Google Scholar 

  6. Khonji, M., Iraqi, Y.: A slightly-modified gi-based author-verifier with lots of features (asgalf). In: Notebook for PAN at CLEF 2014 (2014)

    Google Scholar 

  7. Mendenhall, T.C.: The characteristic curves of composition. Science 11, 237–249 (1887)

    Google Scholar 

  8. Parker, H.A.: Curves of literary style. Science 13(321), 245 (1890)

    Google Scholar 

  9. Pearson, E.S., Hartley, H.O.: Biometrika tables for statisticians. vol. 2. University Press, Cambridge (1972), http://opac.inria.fr/record=b1080107

  10. Rygl, J.: Automatic adaptation of authors stylometric features to document types. In: Proceedings of 17th International Conference, TSD 2014: Text, Speech and Dialogue. pp. 53–61. Springer (2014)

    Google Scholar 

  11. Simard, R., L’Ecuyer, P.: Computing the two-sided Kolmogorov-Smirnov distribution. Journal of Statistical Software 39(11), 1–18 (3 2011), http://www.jstatsoft.org/v39/i11

  12. Smith, W.B.: Curves of pauline and pseudo-pauline style i-ii. Unitarian Review 30, 452–460, 539–546 (1888)

    Google Scholar 

  13. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009), http://dx.doi.org/10.1002/asi.21001

  14. Stamatatos, E., Daelemans, W., Verhoeven, B., Potthast, M., Stein, B., Juola, P., Sanchez-Perez, M.A., Barrόn-Cedeño, A.: Overview of the Author Identification Task at PAN 2014. Analysis 13, 31 (2014)

    Google Scholar 

  15. Williams, C.B.: A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31, 363–390 (1940)

    Google Scholar 

  16. Yule, G.U.: On sentence length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika 30(3-4), 363–390 (1939), http://biomet.oxfordjournals.org/content/30/3-4/363.short

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miro Lehtonen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lehtonen, M. (2015). On sentence length distribution as an authorship attribute. In: Kim, K. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46578-3_96

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46578-3_96

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46577-6

  • Online ISBN: 978-3-662-46578-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics