Abstract
Recent and past studies indicate that source code lexicon plays an important role in program comprehension. Developers often compose source code identifiers with abbreviated words and acronyms, and do not always use consistent mechanisms and explicit separators when creating identifiers. Such choices and inconsistencies impede the work of developers that must understand identifiers by decomposing them into their component terms, and mapping them onto dictionary, application or domain words. When software documentation is scarce, outdated or simply not available, developers must therefore use the available contextual information to understand the source code. This paper aims at investigating how developers split and expand source code identifiers, and, specifically, the extent to which different kinds of contextual information could support such a task. In particular, we consider (i) an internal context consisting of the content of functions and source code files in which the identifiers are located, and (ii) an external context involving external documentation. We conducted a family of two experiments with 63 participants, including bachelor, master, Ph.D. students, and post-docs. We randomly sampled a set of 50 identifiers from a corpus of open source C programs and we asked participants to split and expand them with the availability (or not) of internal and external contexts. We report evidence on the usefulness of contextual information for identifier splitting and acronym/abbreviation expansion. We observe that the source code files are more helpful than just looking at function source code, and that the application-level contextual information does not help any further. The availability of external sources of information only helps in some circumstances. Also, in some cases, we observe that participants better expanded acronyms than abbreviations, although in most cases both exhibit the same level of accuracy. Finally, results indicated that the knowledge of English plays a significant effect in identifier splitting/expansion. The obtained results confirm the conjecture that contextual information is useful in program comprehension, including when developers split and expand identifiers to understand them. We hypothesize that the integration of identifier splitting and expansion tools with IDE could help to improve developers’ productivity.
Similar content being viewed by others
Notes
Significant p-values are highlighted in bold face here and in all other tables.
References
Anquetil N, Lethbridge T (1998) Assessing the relevance of identifier names in a legacy software system. In: Proceedings of CASCON, pp 213–222
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28:970–983
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Baker RD (1995) Modern permutation test software. In: Edgington EG (ed) Randomization tests. Marcel Decker
Basili V, Caldiera G, Rombach DH (1994) The goal question metric paradigm encyclopedia of software engineering. John Wiley and Sons
Binkley D, Davis M, Lawrie D, Morrell C (2009) To camelcase or under_score. In: The 17th IEEE international conference on program comprehension, ICPC 2009. Vancouver, British Columbia, Canada, May 17–19, 2009. IEEE Computer Society, pp 158–167
Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 2(18):219–276
Caprile B, Tonella P (1999) Nomen est omen: analyzing the language of function identifiers. In: Proc. of the working conference on reverse engineering (WCRE). Atlanta, Georgia, USA, pp 112–122
Caprile B, Tonella P (2000) Restructuring program identifier names. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 97–107
Deißenböck F, Pizka M (2005) Concise and consistent naming. In: Proc. of the International Workshop on Program Comprehension (IWPC)
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: Proc. of the International Conference on Program Comprehension (ICPC). Kingston, pp 11–20
Enslen E, Hill E, Pollock LL, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th international working conference on mining software repositories, MSR 2009. Vancouver, BC, Canada, May 16–17, pp 71–80
Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum Associates
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2013) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):569–661
Holm A (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70
Kersten M, Murphy GC (2006) Using task context to improve programmer productivity. In: SIGSOFT ’06/FSE-14: proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM Press, Portland, Oregon, pp 1–11
Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 113–122
Lawrie D, Feild H, Binkley D (2006a) Syntactic identifier conciseness and consistency. In: 6th IEEE international workshop on source code analysis and manipulation. Philadelphia, Pennsylvania, USA, pp 139–148
Lawrie D, Morrell C, Feild H, Binkley D (2006b) What’s in a name? A study of identifiers. In: Proceedings of 14th IEEE international conference on program comprehension. IEEE CS Press, Athens, pp 3–12
Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318
Lawrie DJ, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: Proc. of the Working Conference on Reverse Engineering (WCRE), pp 112–122
Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering. ACM, New York, NY, pp 234–243
Madani N, Guerrouj L, Di Penta M, Guéhéneuc Y-G, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: Proceedings of the conference on software maintenance and reengineering. IEEE, pp 69–78
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proc. of 23rd international conference on software engineering. Toronto, pp 103–112
Marc E, Alfred A, Giuliano A, Guéhéneuc Y-G (2008) Cerberus: tracing requirements to source code using information retrieval dynamic analysis and program analysis. In: ICPC ’08: Proceedings of the 2008 the 16th IEEE international conference on program comprehension. IEEE Computer Society, Washington DC pp 53–62
Marcus A, Maletic JI, Sergeyev A (2005) Recovery of traceability links between software documentation and source code. Int J Softw Eng Knowl Eng 15(5):811–836
Merlo E, McAdam I, De Mori R (2003) Feed-forward and recurrent neural networks for source code informal information analysis. J Softw Maint 15(4):205–244
Ney H (1984) The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Trans Acoust Speech Signal Process 32(2):263–271
Poshyvanyk D, Guéhéneuc Y-G, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Software Eng 33(6):420–432
R Core Team (2012) R: a language and environment for statistical computing. Vienna, Austria. ISBN 3-900051-07-0
Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2010) How developers’ experience and ability influence web application comprehension tasks supported by uml stereotypes: a series of four experiments. IEEE Trans Softw Eng 36(1):96–118
Robillard MP, Coelho W, Murphy GC (2004) How effective developers investigate source code: anexploratory study. IEEE Trans Softw Eng 30(12):889–903
Sharif B, Maletic JI (2010) An eye tracking study on camelcase and under_score identifier styles. In: Proceedings of the international conference on program comprehension, pp 196–205
Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall
Sillito J, Murphy GC, De Volder K (2008) Asking and answering questions during a programming change task. IEEE Trans Softw Eng 34:434–451
Soloway E, Bonar J, Ehrlich K (1983) Cognitive strategies and looping constructs: an empirical study. Commun ACM 26(11):853–860
Storey MAD (1998) A cognitive framework for describing and evaluating software exploration tools. PhD thesis Simon Fraser University
Takang A, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167
von Mayrhauser A, Vans AM (1995) Program comprehension during software maintenance and evolution. IEEE Comput 28(8):44–55
Wohlin C, Runeson P, Host M, Ohlsson MC, Regnell B, Wesslen A (2000) Experimentation in software engineering—an introduction. Kluwer Academic Publishers
Acknowledgements
Special thanks to all the participants as this work would not be ossible without your time. Many thanks also to all the reviewers for their thorough and well considered reviews.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Martin Robillard
Appendices
Appendix
A Detailed Study Settings
Table 20 reports the characteristics of the 34 applications from which the 50 identifiers used in our study were sampled.
Table 21 reports the expansions of all the identifiers used in the experiments. The column Separator indicates whether underscore or Camel Case separators are used. The columns Abbr., Acro. and Plain report the number of abbreviations, acronyms and plain English words composing each identifier.
B Detailed Results
This Appendix reports figures detailing results presented and discussed in Section 3. Specifically, Figs. 7 and 8 show boxplots of Precision and Recall for the different levels of context, respectively.
Rights and permissions
About this article
Cite this article
Guerrouj, L., Di Penta, M., Guéhéneuc, YG. et al. An experimental investigation on the effects of context on source code identifiers splitting and expansion. Empir Software Eng 19, 1706–1753 (2014). https://doi.org/10.1007/s10664-013-9260-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-013-9260-1