Skip to main content

Natural Language Insights from Code Reviews that Missed a Vulnerability

A Large Scale Study of Chromium

  • Conference paper
  • First Online:
Engineering Secure Software and Systems (ESSoS 2017)


Engineering secure software is challenging. Software development organizations leverage a host of processes and tools to enable developers to prevent vulnerabilities in software. Code reviewing is one such approach which has been instrumental in improving the overall quality of a software system. In a typical code review, developers critique a proposed change to uncover potential vulnerabilities. Despite best efforts by developers, some vulnerabilities inevitably slip through the reviews. In this study, we characterized linguistic features—inquisitiveness, sentiment and syntactic complexity—of conversations between developers in a code review, to identify factors that could explain developers missing a vulnerability. We used natural language processing to collect these linguistic features from 3,994,976 messages in 788,437 code reviews from the Chromium project. We collected 1,462 Chromium vulnerabilities to empirically analyze the linguistic features. We found that code reviews with lower inquisitiveness, higher sentiment, and lower complexity were more likely to miss a vulnerability. We used a Naïve Bayes classifier to assess if the words (or lemmas) in the code reviews could differentiate reviews that are likely to miss vulnerabilities. The classifier used a subset of all lemmas (over 2 million) as features and their corresponding TF-IDF scores as values. The average precision, recall, and F-measure of the classifier were 14%, 73%, and 23%, respectively. We believe that our linguistic characterization will help developers identify problematic code reviews before they result in a vulnerability being missed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.


  1. Reuters-21578, Distribution 1.0.

  2. Baddeley, A.: Recent developments in working memory. Curr. Opin. Neurobiol. 8(2), 234–238 (1998)

    Article  Google Scholar 

  3. Baddeley, A.: Working memory and language: an overview. J. Commun. Disord. 36(3), 189–208 (2003)

    Article  Google Scholar 

  4. Baysal, O., Kononenko, O., Holmes, R., Godfrey, M.W.: The influence of non-technical factors on code review. In: 2013 20th Working Conference on Reverse Engineering (WCRE), pp. 122–131, October 2013

    Google Scholar 

  5. Beller, M., Bacchelli, A., Zaidman, A., Juergens, E.: Modern code reviews in open-source projects: which problems do they fix? In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, New York, NY, USA, pp. 202–211. ACM, New York (2014)

    Google Scholar 

  6. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc, Sebastopol (2009)

    MATH  Google Scholar 

  7. Bosu, A., Carver, J.C.: Peer code review to prevent security vulnerabilities: an empirical evaluation. In: 2013 IEEE Seventh International Conference on Software Security and Reliability Companion, pp. 229–230, June 2013

    Google Scholar 

  8. Bosu, A., Greiler, M., Bird, C.: Characteristics of useful code reviews: an empirical study at microsoft. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 146–156, May 2015

    Google Scholar 

  9. Bosu, A., Carver, J.C., Hafiz, M., Hilley, P., Janni, D.: Identifying the characteristics of vulnerable code changes: an empirical study. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, New York, NY, pp. 257–268. ACM, New York (2014)

    Google Scholar 

  10. Brown, C., Snodgrass, T., Kemper, S.J., Herman, R., Covington, M.A.: Automatic measurement of propositional idea density from part-of-speech tagging. Behav. Res. Methods 40(2), 540–545 (2008)

    Article  Google Scholar 

  11. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  12. Chomsky, N.: Syntactic Structures. Mouton, The Hague (1957)

    MATH  Google Scholar 

  13. Chromium: Chromium OS developer’s guide (2017).

  14. Ciolkowski, M., Laitenberger, O., Biffl, S.: Software reviews: the state of the practice. IEEE Software 20(6), 46–51 (2003)

    Article  Google Scholar 

  15. Czerwonka, J., Greiler, M., Tilford, J.: Code reviews do not find bugs: how the current code review best practice slows us down. In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, vol. 2, pp. 27–28. IEEE Press, Piscataway (2015).

  16. Edmundson, A., Holtkamp, B., Rivera, E., Finifter, M., Mettler, A., Wagner, D.: An empirical study on the effectiveness of security code review. In: Jürjens, J., Livshits, B., Scandariato, R. (eds.) ESSoS 2013. LNCS, vol. 7781, pp. 197–212. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36563-8_14

    Chapter  Google Scholar 

  17. Francis, W.N., Kucera, H.: A standard corpus of present-day edited American English, for use with digital computers. Coll. Engl. 26(4), 267 (1965)

    Article  Google Scholar 

  18. Frazier, L.: Syntactic complexity. In: Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.) Natural Language Parsing, pp. 129–189. Cambridge University Press (CUP), Cambridge (1985)

    Chapter  Google Scholar 

  19. Frazier, L.: Sentence Processing: A Tutorial Review (1987)

    Google Scholar 

  20. Frazier, L.: syntactic processing: evidence from Dutch. Nat. Lang. Linguist. Theor. 5(4), 519–559 (1987)

    Article  Google Scholar 

  21. Frazier, L., Taft, L., Roeper, T., Clifton, C., Ehrlich, K.: Parallel structure: a source of facilitation in sentence comprehension. Mem. Cogn. 12(5), 421–430 (1984)

    Article  Google Scholar 

  22. Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, NY, pp. 352–355. ACM, New York (2014)

    Google Scholar 

  23. Hart, M.S., Austen, J., Blake, W., Burgess, T.W., Bryant, S.C., Carroll, L., Chesterton, G.K., Edgeworth, M., Melville, H., Milton, J., Shakespeare, W., Whitman, W., Bible, K.J.: Project Gutenberg Selections. Freely available as a Corpus in the Natural Language ToolKit.

  24. Hinkle, D.E., Wiersma, W., Jurs, S.G.: Applied Statistics for the Behavioral Sciences. Houghton Mifflin, Boston (2002)

    Google Scholar 

  25. Lipner, S.: The trustworthy computing security development lifecycle. In: 20th Annual Computer Security Applications Conference, pp. 2–13, December 2004

    Google Scholar 

  26. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  27. Mäntylä, M.V., Lassenius, C.: What types of defects are really discovered in code reviews? IEEE Trans. Software Eng. 35(3), 430–448 (2009)

    Article  Google Scholar 

  28. Mayer, R.E., Moreno, R.: Nine ways to reduce cognitive load in multimedia learning. Educ. Psychol. 38(1), 43–52 (2003)

    Article  Google Scholar 

  29. McGraw, G.: Software security. IEEE Secur. Priv. 2(2), 80–83 (2004)

    Article  Google Scholar 

  30. Meneely, A., Srinivasan, H., Musa, A., Tejeda, A.R., Mokary, M., Spates, B.: When a patch goes bad: exploring the properties of vulnerability-contributing commits. In: 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 65–74, October 2013

    Google Scholar 

  31. Menzies, T., Menzies, A., Distefano, J., Greenwald, J.: Problems with precision: a response to “comments on ‘data mining static code attributes to learn defect predictors”’. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007). doi:10.1109/TSE.2007.70721. ISSN: 0098-5589

    Article  Google Scholar 

  32. Meyers, B.S.: Speech processing & linguistic analysis tool (SPLAT).

  33. Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theor. 2(3), 129–137 (1956)

    Article  Google Scholar 

  34. Miller, J.F., Chapman, R.S.: The relation between age and mean length of utterance in morphemes. J. Speech Lang. Hear. Res. 24(2), 154–161 (1981)

    Article  Google Scholar 

  35. Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: sentiment analysis of security discussions on GitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, NY, pp. 348–351. ACM, New York (2014)

    Google Scholar 

  36. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2015).

  37. Roark, B., Mitchell, M., Hosom, J., Hollingshead, K., Kaye, J.: Spoken language derived measures for detecting mild cognitive impairment. Trans. Audio Speech Lang. Proc. 19(7), 2081–2090 (2011)

    Article  Google Scholar 

  38. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  39. Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, October 2013

    Google Scholar 

  40. Sweller, J., Chandler, P.: Evidence for cognitive load theory. Cogn. Instr. 8(4), 351–362 (1991)

    Article  Google Scholar 

  41. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)

    Google Scholar 

  42. Yngve, V.H.: A Model and an Hypothesis for Language Structure, vol. 104, pp. 444–466. American Philosophical Society (1960)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Nuthan Munaiah .

Editor information

Editors and Affiliations

A Comparing Distribution of Inquisitiveness, Sentiment and Complexity Metrics

A Comparing Distribution of Inquisitiveness, Sentiment and Complexity Metrics

The comparison of the distribution of inquisitiveness, sentiment and complexity metrics for neutral and missed vulnerability code reviews is shown in Fig. 3.

Fig. 3.
figure 3

Comparing the distribution of inquisitiveness, sentiment and complexity metrics for neutral and missed vulnerability code reviews

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Munaiah, N. et al. (2017). Natural Language Insights from Code Reviews that Missed a Vulnerability. In: Bodden, E., Payer, M., Athanasopoulos, E. (eds) Engineering Secure Software and Systems. ESSoS 2017. Lecture Notes in Computer Science(), vol 10379. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62104-3

  • Online ISBN: 978-3-319-62105-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics