, Volume 115, Issue 2, pp 1071–1085 | Cite as

Writing styles in different scientific disciplines: a data science approach

  • Amnah Alluqmani
  • Lior Shamir


We quantified several different elements that reflect writing styles of scientific papers in four related disciplines: physics, astrophysics, mathematics, and computer science. Text descriptors such as the use of punctuation characters, the use of upper case letters, use of quotations, and other descriptors that are not based on the words used in the papers were extracted from each document. Based on these features alone an automatic classifier was able to identify the discipline of the paper with accuracy much higher than mere chance, showing that different disciplines can be differentiated by their writing styles, and without using their content directly as reflected by common words used in the papers. The study showed statistically significant differences between the different disciplines such as use of acronyms, sentence length, word length, and more. Our findings also show changes in writing styles in specific disciplines over time. For instance, mathematicians and computer scientists began to use less acronyms starting from 2006, and there is a dramatic decrease of the average of punctuation characters in mathematics papers. These observations suggest that even in closely related disciplines there are differences in the scientific communication expressed through writing styles, demonstrating the existence of a “signature” writing style developed in each discipline. These findings should also be taken into account when a multidisciplinary group of collaborators assign writing duties on a joint scientific manuscript.


Scientific writing Scientific communication Text analysis Data science 



The study was funded in part by NSF Grant IIS-1546079, and HHMI Grant 52008705. We would like to thank the anonymous reviewers for the insightful comments that helped to improve the paper.


  1. Aldous, D. (1991). The continuum random tree. I. The Annals of Probability, 19(1), 1–28.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Argamon, S., Dodick, J., & Chase, P. (2008). Language use reflects scientific methodology: A corpus-based study of peer-reviewed journal articles. Scientometrics, 75(2), 203–238.CrossRefGoogle Scholar
  3. Burns, C. S., & Fox, C. W. (2017). Language and socioeconomics predict geographic variation in peer review outcomes at an ecology journal. Scientometrics, 113(2), 1113–1127.CrossRefGoogle Scholar
  4. Burton, L., & Morgan, C. (2000). Mathematicians writing. Journal for Research in Mathematics Education, 31(4), 429–453.CrossRefGoogle Scholar
  5. Coffin, C., & Hewings, A. (2003). Writing for different disciplines. In C. Coffin, M. J. Curry, S. Goodman, A. Hewings, T. Lillis & J. Swann (Eds.), Teaching academic writing: A toolkit for higher education (pp. 45–72). London: Routledge.Google Scholar
  6. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.CrossRefzbMATHGoogle Scholar
  7. Degaetano-Ortlieb, S., Fankhauser, P., Kermes, H., Lapshinova-Koltunski, E., Ordan, N., & Teich, E. (2014). Data mining with shallow vs. linguistic features to study diversification of scientific registers. In LREC (pp. 1327–1334).Google Scholar
  8. Dougherty, B. J. (1996). The write way: A look at journal writing in first-year algebra. The Mathematics Teacher, 89(7), 556–560.Google Scholar
  9. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2–3), 131–163.CrossRefzbMATHGoogle Scholar
  10. Gordon, M. (1980). A critical reassessment of inferred relations between multiple authorship, scientific collaboration, the production of papers and their acceptance for publication. Scientometrics, 2(3), 193–201.CrossRefGoogle Scholar
  11. Hyland, K. (2009). Writing in the disciplines: Research evidence for specificity. Taiwan International ESP Journal, 1(1), 5–22.Google Scholar
  12. Hyland, K., & Bondi, M. (Eds.). (2006). Academic discourse across disciplines (Vol. 42). Frankfort: Peter Lang.Google Scholar
  13. Hyland, K., & Tse, P. (2007). Is there an academic vocabulary? TESOL Quarterly, 41(2), 235–253.CrossRefGoogle Scholar
  14. Kohavi, R. (1995). The power of decision tables. In European conference on machine learning (pp. 174–189).Google Scholar
  15. Lei, L. (2016). When science meets cluttered writing: Adjectives and adverbs in academia revisited. Scientometrics, 107(3), 1361–1372.CrossRefGoogle Scholar
  16. Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.Google Scholar
  17. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL system demonstrations (pp. 55–60).Google Scholar
  18. Mason, R. T., & McFeetors, P. J. (2002). Interactive writing in mathematics class: Getting started. The Mathematics Teacher, 95(7), 532.Google Scholar
  19. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. Accessed 17 Dec 2017.
  20. Morgan, C. (2005). Word, definitions and concepts in discourses of mathematics, teaching and learning. Language and Education, 19(2), 102–116.CrossRefGoogle Scholar
  21. Okulicz-Kozaryn, A. (2013). Cluttered writing: Adjectives and adverbs in academia. Scientometrics, 96(3), 679–681.CrossRefGoogle Scholar
  22. Quinlan, J. R. (2014). Programs for machine learning (Vol. C4, p. 5). Amsterdam: Elsevier.Google Scholar
  23. Ratnaparkhi, A. (1999). Learning to parse natural language with maximum entropy models. Machine Learning, 34(1–3), 151–175.CrossRefzbMATHGoogle Scholar
  24. Samraj, B. (2002). Introductions in research articles: Variations across disciplines. English for Specific Purposes, 21(1), 1–17.CrossRefGoogle Scholar
  25. Shamir, L., Diamond, D., & Wallin, J. (2016). Leveraging pattern recognition consistency estimation for crowdsourcing data analysis. IEEE Transactions on Human Machine Systems, 46(3), 474–480.CrossRefGoogle Scholar
  26. Shamir, L., Orlov, N., Eckley, D. M., Macura, T., Johnston, J., & Goldberg, I. G. (2008). Wndchrm-an open source utility for biological image analysis. Source Code for Biology and Medicine, 3(1), 13.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2018

Authors and Affiliations

  1. 1.Lawrence Technological UniversitySouthfieldUSA

Personalised recommendations