Writing styles in different scientific disciplines: a data science approach
We quantified several different elements that reflect writing styles of scientific papers in four related disciplines: physics, astrophysics, mathematics, and computer science. Text descriptors such as the use of punctuation characters, the use of upper case letters, use of quotations, and other descriptors that are not based on the words used in the papers were extracted from each document. Based on these features alone an automatic classifier was able to identify the discipline of the paper with accuracy much higher than mere chance, showing that different disciplines can be differentiated by their writing styles, and without using their content directly as reflected by common words used in the papers. The study showed statistically significant differences between the different disciplines such as use of acronyms, sentence length, word length, and more. Our findings also show changes in writing styles in specific disciplines over time. For instance, mathematicians and computer scientists began to use less acronyms starting from 2006, and there is a dramatic decrease of the average of punctuation characters in mathematics papers. These observations suggest that even in closely related disciplines there are differences in the scientific communication expressed through writing styles, demonstrating the existence of a “signature” writing style developed in each discipline. These findings should also be taken into account when a multidisciplinary group of collaborators assign writing duties on a joint scientific manuscript.
KeywordsScientific writing Scientific communication Text analysis Data science
The study was funded in part by NSF Grant IIS-1546079, and HHMI Grant 52008705. We would like to thank the anonymous reviewers for the insightful comments that helped to improve the paper.
- Coffin, C., & Hewings, A. (2003). Writing for different disciplines. In C. Coffin, M. J. Curry, S. Goodman, A. Hewings, T. Lillis & J. Swann (Eds.), Teaching academic writing: A toolkit for higher education (pp. 45–72). London: Routledge.Google Scholar
- Degaetano-Ortlieb, S., Fankhauser, P., Kermes, H., Lapshinova-Koltunski, E., Ordan, N., & Teich, E. (2014). Data mining with shallow vs. linguistic features to study diversification of scientific registers. In LREC (pp. 1327–1334).Google Scholar
- Dougherty, B. J. (1996). The write way: A look at journal writing in first-year algebra. The Mathematics Teacher, 89(7), 556–560.Google Scholar
- Hyland, K. (2009). Writing in the disciplines: Research evidence for specificity. Taiwan International ESP Journal, 1(1), 5–22.Google Scholar
- Hyland, K., & Bondi, M. (Eds.). (2006). Academic discourse across disciplines (Vol. 42). Frankfort: Peter Lang.Google Scholar
- Kohavi, R. (1995). The power of decision tables. In European conference on machine learning (pp. 174–189).Google Scholar
- Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.Google Scholar
- Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL system demonstrations (pp. 55–60).Google Scholar
- Mason, R. T., & McFeetors, P. J. (2002). Interactive writing in mathematics class: Getting started. The Mathematics Teacher, 95(7), 532.Google Scholar
- McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. Accessed 17 Dec 2017.
- Quinlan, J. R. (2014). Programs for machine learning (Vol. C4, p. 5). Amsterdam: Elsevier.Google Scholar