Skip to main content
Log in

Studying the difference between natural and programming language corpora

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Code corpora, as observed in large software systems, are now known to be far more repetitive and predictable than natural language corpora. But why? Does the difference simply arise from the syntactic limitations of programming languages? Or does it arise from the differences in authoring decisions made by the writers of these natural and programming language texts? We conjecture that the differences are not entirely due to syntax, but also from the fact that reading and writing code is un-natural for humans, and requires substantial mental effort; so, people prefer to write code in ways that are familiar to both reader and writer. To support this argument, we present results from two sets of studies: 1) a first set aimed at attenuating the effects of syntax, and 2) a second, aimed at measuring repetitiveness of text written in other settings (e.g. second language, technical/specialized jargon), which are also effortful to write. We find that this repetition in source code is not entirely the result of grammar constraints, and thus some repetition must result from human choice. While the evidence we find of similar repetitive behavior in technical and learner corpora does not conclusively show that such language is used by humans to mitigate difficulty, it is consistent with that theory. This discovery of “non-syntactic” repetitive behaviour is actionable, and can be leveraged for statistically significant improvements on the code suggestion task. We discuss this finding, and other future implications on practice, and for research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Precise details on the datasets and language models will be presented later their respective sections.

  2. While this category is very rarely updated, there could be unusual and significant changes in the language – for instance a new preposition or conjunction in English.

  3. Indeed, it is not clear how to even design such an experiment.

  4. Bidirectional LSTMs can make use of context both before and after a token.

  5. A good explanation of the details of LSTM cell structure can be found at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

  6. https://help.github.com/articles/about-stars/

  7. One exception for Java is the Eclipse project, which was not hosted on GitHub, but is selected for significance within the Java community.

  8. http://pygments.org/

  9. http://commoncrawl.org

  10. https://www.ets.org/toeic

  11. https://www.gutenberg.org/

  12. http://uscode.house.gov/download/download.shtml

  13. We add , proc, forall, mdo, family, data, and type.

  14. We add __ENCODING__, __END__, __FILE__, and __LINE__.

  15. We add recur, set!, moniter-enter, moniter-exit, throw, try, catch, finally, and /, along with some operators Pygments had classified as Token.Names.

  16. In particular, we called these primitives types open category to be consistent with how other programming languages like Haskell treat their types.

  17. Additionally, in our experience, tweaking the boundaries of these categories results only in slight changes in repetition.

  18. (e.g. ADVP-TMP reflects that the adverbial phrase serves a temporal function).

  19. https://help.eclipse.org/neon/index.jsp?topic=%2Forg.eclipse.jdt.doc.isv%2Freference%2Fapi%2Forg%2Feclipse%2Fjdt%2Fcore%2Fdom%2FASTParser.html

  20. Indeed, when running LSTMs over just the nonterminals, we see that the Java grammar is more predictable than the English grammar.

  21. The size of the entropy difference between English and Code open category words is less, though still larger than between all tokens

  22. In the simplified parse tree in the case of English.

  23. We note that an independent study on commit message entropy and build failure found similar ranges (Santos and Hindle 2016)

  24. (Allamanis et al. 2017) have extensively surveyed such applications.

  25. See Tobe et al., PNAS, 144(22), May 2017.

  26. This framework can be found at https://github.com/SLP-team/SLP-Core.

References

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283

  • Allamanis M, Sutton C (2013) Mining source code repositories at massive scale using language modeling. In: 2013 10th IEEE working conference on mining software repositories (MSR), pp 207–216. https://doi.org/10.1109/MSR.2013.6624029

  • Allamanis M, Sutton C (2014) Mining idioms from source code. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, NY, USA, FSE 2014, pp 472–483. https://doi.org/10.1145/2635868.2635901

  • Allamanis M, Barr ET, Devanbu P, Sutton C (2017) A survey of machine learning for big code and naturalness. arXiv:170906182

  • Andor D, Alberti C, Weiss D, Severyn A, Presta A, Ganchev K, Petrov S, Collins M (2016) Globally normalized transition-based neural networks. arXiv:160306042

  • Bachmann A, Bernstein A (2009) Software process data quality and characteristics: a historical view on open and closed source projects. In: Proceedings of the joint international and annual ERCIM workshops on principles of software evolution (IWPSE) and software evolution (Evol) workshops. ACM, New York, NY, USA, IWPSE-Evol ’09, pp 119–128. https://doi.org/10.1145/1595808.1595830

  • Baxter G, Frean M, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: ACM Sigplan notices. ACM vol 41, pp 397–412

  • Bird S (2006) Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on interactive presentation sessions. Association for Computational Linguistics, pp 69–72

  • Bradley DC (1978) Computational distinctions of vocabulary type. PhD thesis, Massachusetts Institute of Technology

  • Bright W (2017) Social factors in language change. In: The handbook of sociolinguistics. Wiley-Blackwell, chap 5, pp 81–91. https://doi.org/10.1002/9781405166256.ch5

  • Busjahn T, Bednarik R, Begel A, Crosby M, Paterson JH, Schulte C, Sharif B, Tamm S (2015) Eye movements in code reading: relaxing the linear order. In: 2015 IEEE 23rd international conference on program comprehension (ICPC). IEEE, pp 255–265

  • Ferrer i Cancho R, Solé R V (2001) Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited? Journal of Quantitative Linguistics 8(3):165–173

    Article  Google Scholar 

  • Carlstrom B, Price N (2013) Gachon learner corpus

  • Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005

  • Chen SF, Goodman J (1998) An empirical study of smoothing techniques for language modeling. In: Harvard computer science group technical report TR-10-98

  • Chomsky N (2002) An interview on minimalism. In: Belletti A, Rizzi L (eds) On nature and language, chap 4. Cambridge University Press, Cambridge, pp 92–161. https://doi.org/10.1017/CBO9780511613876.005

  • Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10):687–708

    Article  Google Scholar 

  • Conklin K, Schmitt N (2008) Formulaic sequences: are they processed more quickly than nonformulaic language by native and nonnative speakers? Appl Linguis 29(1):72–89. https://doi.org/10.1093/applin/amm022

    Article  Google Scholar 

  • Danet B (1980) Language in the legal process. Law Soc Rev 14(3):445–564. http://www.jstor.org/stable/3053192

    Article  Google Scholar 

  • De Cock S (2000) Repetitive phrasal chunkiness and advanced efl speech and writing. Lang Comput 33: 51–68

    Google Scholar 

  • De Marneffe MC, Manning CD (2008) The stanford typed dependencies representation. In: Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation. Association for Computational Linguistics, pp 1–8

  • De Marneffe MC, MacCartney B, Manning CD, et al. (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol 6. Genoa Italy, pp 449–454

  • Demberg V, Keller F (2008) Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 109(2):193–210. https://doi.org/10.1016/j.cognition.2008.07.008. http://www.sciencedirect.com/science/article/pii/S0010027708001741

    Article  Google Scholar 

  • Dig D, Johnson R (2005) The role of refactorings in api evolution. In: Null. IEEE, pp 389–398

  • Field A (2009) Discovering statistics using SPSS. Sage Publications

  • Frank S (2013) Uncertainty reduction as a measure of cognitive load in sentence comprehension. Top Cogn Sci 5(3):475–494

    Article  Google Scholar 

  • Gerlach M, Altmann EG (2013) Stochastic model for the vocabulary growth in natural languages. Phys Rev X 3(2):021006

    Google Scholar 

  • Ginter F, Hajič J, Luotolahti J, Straka M, Zeman D (2017) CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. http://hdl.handle.net/11234/1-1989 LINDAT/CLARIN digital library at the institute of formal and applied linguistics (ÚFAL) faculty of mathematics and physics, Charles University

  • Gopstein D, Iannacone J, Yan Y, Delong LA, Zhuang Y, Yeh MKC, Cappos J (2017) Understanding misunderstandings in source code. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering. ACM, pp 129–139

  • Gopstein D, Zhou HH, Frankl P, Cappos J (2018) Prevalence of confusing code in software projects: atoms of confusion in the wild. In: Proceedings of the 15th international conference on mining software repositories. ACM, pp 281–291. https://doi.org/10.1145/3196398.3196432

  • Gotti M (2011) Investigating specialized discourse. Peter Lang

  • Gousios G, Spinellis D (2012) GHTorrent: Github’s data from a firehose. In: MSR. IEEE, pp 12–21

  • Hale J (2003) The information conveyed by words in sentences. J Psycholinguist Res 32(2):101–123

    Article  Google Scholar 

  • Harker SD, Eason KD, Dobson JE (1993) The change and evolution of requirements as a challenge to the practice of software engineering. In: Proceedings of IEEE international symposium on requirements engineering, 1993. IEEE, pp 266–272

  • Hathhorn C, Ellison C, Roşu G (2015) Defining the undefinedness of c. In: ACM SIGPLAN notices, vol 50. ACM, pp 336–345

  • Hayes JH, Dekhtyar A, Sundaram SK (2005) Improving after-the-fact tracing and mapping: supporting software quality predictions. IEEE Soft 22(6):30–37

    Article  Google Scholar 

  • Heafield K (2011) Kenlm: faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation, association for computational linguistics, pp 187–197

  • Hellendoorn VJ, Devanbu P (2017) Are deep neural networks the best choice for modeling source code?. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ser. ESEC/FSE, pp 763–773

  • Hellendoorn VJ, Devanbu PT, Bacchelli A (2015) Will they like this?: evaluating code contributions with language models. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, Piscataway, NJ, USA, MSR ’15, pp 157–167 . http://dl.acm.org/citation.cfm?id=2820518.2820539

  • Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, NJ, USA, ICSE ’12, pp 837–847. http://dl.acm.org/citation.cfm?id=2337223.2337322

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Hoffmann L (1984) Seven roads to lsp. Fachsprache 6:1–2

    Google Scholar 

  • Hothorn T, Hornik K, van de Wiel MA, Zeileis A (2006) A lego system for conditional inference. Am Stat 60(3):257–263

    Article  MathSciNet  Google Scholar 

  • Jbara A, Feitelson DG (2017) How programmers read regular code: a controlled experiment using eye tracking. Empir Softw Eng 22(3):1440–1477

    Article  Google Scholar 

  • Kavaler D, Sirovica S, Hellendoorn V, Aranovich R, Filkov V (2017) Perceived language complexity in github issue discussions and their effect on issue resolution. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. IEEE Press, Piscataway, NJ, USA, ASE 2017, pp 72–83 . http://dl.acm.org/citation.cfm?id=3155562.3155576

  • Khanh Dam H, Tran T, Pham T (2016) A deep language model for software code. arXiv:160802715

  • Kim M, Cai D, Kim S (2011) An empirical investigation into the role of api-level refactorings during software evolution. In: Proceedings of the 33rd international conference on software engineering. ACM, pp 151–160

  • Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: ICASSP-95., 1995 international conference on acoustics, speech, and signal processing, 1995, vol 1. IEEE, pp 181–184

  • Knuth DE (1984) Literate programming. Comput J 27(2):97–111

    Article  MATH  Google Scholar 

  • Kučera H, Francis WN (1967) Computational analysis of present-day American English. Dartmouth Publishing Group

  • Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076. https://doi.org/10.1109/PROC.1980.11805

    Article  Google Scholar 

  • Lehman MM (1996) Laws of software evolution revisited. In: European workshop on software process technology. Springer, pp 108–124

  • Levy R (2008) Expectation-based syntactic comprehension. Cognition 106 (3):1126–1177. https://doi.org/10.1016/j.cognition.2007.05.006, http://www.sciencedirect.com/science/article/pii/S0010027707001436

    Article  Google Scholar 

  • Liu H, Sun C, Su Z, Jiang Y, Gu M, Sun J (2017) Stochastic optimization of program obfuscation. In: Proceedings of the 39th international conference on software engineering. IEEE Press, pp 221– 231

  • Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodolog (TOSEM) 18(1):2

    Google Scholar 

  • Mandelbrot B (1953) An informational theory of the statistical structure of language. Commun Theory 84:486–502

    Google Scholar 

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  • Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330. http://dl.acm.org/citation.cfm?id=972470.972475

    Google Scholar 

  • Michael L (2014) Social dimensions of language change. In: The Routledge handbook of historical linguistics, Routledge, chap 22. https://doi.org/10.4324/9781315794013.ch22, https://www.routledgehandbooks.com/doi/10.4324/9781315794013.ch22

  • Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3

  • Mitzenmacher M (2004) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2):226–251

    Article  MathSciNet  MATH  Google Scholar 

  • Norvig P (2009) Natural language corpus data. In: Segaran T, Hammerbacher J (eds) Beautiful data: the stories behind elegant data solutions, chap 14. O’Reilly Media, pp 219–242

  • Paquot M, Granger S (2012) Formulaic language in learner corpora. Annual Review of Applied Linguistics 32:130–149. https://search.proquest.com/docview/1289774805?accountid=14505, copyright - Copyright Cambridge University Press 2012; Last updated - 2015-05-30

    Article  Google Scholar 

  • Petersen AM, Tenenbaum JN, Havlin S, Stanley HE, Perc M (2012) Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci Rep 2:943

    Article  Google Scholar 

  • Petrov S (2016) Announcing syntaxnet: the world’s most accurate parser goes open source. Google Research Blog

  • Petrov S, Barrett L, Thibaux R, Klein D (2006) Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, pp 433–440

  • Piantadosi ST (2014) Zipf’s word frequency law in natural language: a critical review and future directions. Psychon Bull Rev 21(5):1112–1130

    Article  Google Scholar 

  • Piantadosi ST, Tily H, Gibson E (2012) The communicative function of ambiguity in language. Cognition 122(3):280–291

    Article  Google Scholar 

  • Pierret D, Poshyvanyk D (2009) An empirical exploration of regularities in open-source software lexicons. In: IEEE 17th international conference on program comprehension, 2009. ICPC’09. IEEE, pp 228–232

  • Salager F (1983) Compound nominal phrases in scientific-technical literature: proportion and rationale. ERIC

  • Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I, Torralba A (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Santos EA, Hindle A (2016) Judging a commit by its cover: correlating commit message entropy with build status on travis-ci. In: Proceedings of the 13th international conference on mining software repositories. ACM, New York, NY, USA, MSR ’16, pp 504–507. https://doi.org/10.1145/2901739.2903493

  • Scalabrino S, Bavota G, Vendome C, Linares-Vásquez M, Poshyvanyk D, Oliveto R (2017) Automatically assessing code understandability: how far are we?. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. IEEE Press, pp 417–427

  • Schmitt N, Carter R (2004) Formulaic sequences in action. Formulaic Sequences: Acquisition, Processing and Use: 1–22

  • Shannon CE (1948) A mathematical theory of communication, part i, part ii. Bell Syst Tech J 27:623–656

    Article  MATH  Google Scholar 

  • Shannon CE (1951) Prediction and entropy of printed english. Bell Labs Tech J 30(1):50–64

    Article  MATH  Google Scholar 

  • Siegmund J, Kästner C, Apel S, Parnin C, Bethmann A, Leich T, Saake G, Brechmann A (2014) Understanding understanding source code with functional magnetic resonance imaging. In: Proceedings of the 36th international conference on software engineering. ACM, pp 378–389

  • Stefik A, Ladner R (2017) The quorum programming language. In: Proceedings of the 2017 ACM SIGCSE technical symposium on computer science education. ACM, pp 641–641

  • Stefik A, Siebert S (2013) An empirical investigation into programming language syntax. ACM Trans Comput Educ (TOCE) 13(4):19

    Google Scholar 

  • Sundaram SK, Hayes JH, Dekhtyar A (2005) Baselines in requirements tracing. In: ACM SIGSOFT Software engineering notes, ACM, vol 30, pp 1–6

  • Sundermeyer M, Schlüter R, Ney H (2012) Lstm neural networks for language modeling. In: Thirteenth annual conference of the international speech communication association

  • Trockman A, Cates K, Mozina M, Nguyen T, Kästner C, Vasilescu B (2018) “Automatically assessing code understandability” reanalyzed: combined metrics matter. In: International conference on mining software repositories. ACM, pp 314–318. https://doi.org/10.1145/3196398.3196441

  • Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering. ACM, pp 356–366

  • Tu Z, Su Z, Devanbu P (2014) On the localness of software. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, NY, USA, FSE 2014, pp 269–280. https://doi.org/10.1145/2635868.2635875

  • Varantola K (1986) Special language and general language: linguistic and didactic aspects. Unesco Alsed-LSP Newsletter (1977-2000) 9(2)

  • Vinyals O, Kaiser Ł, Koo T, Petrov S, Sutskever I, Hinton G (2015) Grammar as a foreign language. In: Advances in neural information processing systems, pp 2773–2781

  • Wasow T, Perfors A, Beaver D (2005) The puzzle of ambiguity. Morphology and the web of grammar: essays in memory of Steven G Lapointe: 265–282

  • Weintrop D, Wilensky U (2015) Using commutative assessments to compare conceptual understanding in blocks-based and text-based programs. In: Proceedings of the eleventh annual international conference on international computing education research, ICER ’15. ACM, New York, NY, USA, pp 101–110. https://doi.org/10.1145/2787622.2787721

  • White M, Vendome C, Linares-Vásquez M, Poshyvanyk D (2015) Toward deep learning software repositories. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 334–345

  • Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, Berlin. http://ggplot2.org

    Book  MATH  Google Scholar 

  • Xue X (2015) Ten thousand english compositions of chinese learners (the teccl corpus), version 1.1

  • Zhang H (2008) Exploring regularity in source code: software science and zipf’s law. In: WCRE’08. 15th working conference on reverse engineering, 2008. IEEE, pp 101–110

  • Zhi J, Garousi-Yusifoğlu V, Sun B, Garousi G, Shahnewaz S, Ruhe G (2015) Cost, benefits and quality of software development documentation: a systematic mapping. J Syst Softw 99:175–198

    Article  Google Scholar 

  • Zipf G (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambody Mus Am Arch and Ethnol(Harvard Univ) Papers 19:1–125

Download references

Acknowledgements

We would like to thank Professors C. Sutton, Z. Su, V. Filkov, and R. Aranovich, along with the UC Davis DECAL and NLP Reading groups for comments and feedback on this research. We also would like to especially thank V. Hellendoorn for his feedback and input on our experiment between parse trees in Java and English. We also acknowledge support from NSF Grant #1414172, Exploiting the Naturalness of Software. Finally we are grateful to the reviewers and editors of this journal for their thoughtful comments, which were very helpful in improving the work presented in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Casey Casalnuovo.

Additional information

Communicated by: Audris Mockus

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Casalnuovo, C., Sagae, K. & Devanbu, P. Studying the difference between natural and programming language corpora. Empir Software Eng 24, 1823–1868 (2019). https://doi.org/10.1007/s10664-018-9669-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9669-7

Keywords

Navigation