An experimental investigation on the effects of context on source code identifiers splitting and expansion

Guerrouj, Latifa; Di Penta, Massimiliano; Guéhéneuc, Yann-Gaël; Antoniol, Giuliano

doi:10.1007/s10664-013-9260-1

An experimental investigation on the effects of context on source code identifiers splitting and expansion

Published: 03 July 2013

Volume 19, pages 1706–1753, (2014)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Latifa Guerrouj¹,
Massimiliano Di Penta²,
Yann-Gaël Guéhéneuc¹ &
…
Giuliano Antoniol¹

643 Accesses
8 Citations
Explore all metrics

Abstract

Recent and past studies indicate that source code lexicon plays an important role in program comprehension. Developers often compose source code identifiers with abbreviated words and acronyms, and do not always use consistent mechanisms and explicit separators when creating identifiers. Such choices and inconsistencies impede the work of developers that must understand identifiers by decomposing them into their component terms, and mapping them onto dictionary, application or domain words. When software documentation is scarce, outdated or simply not available, developers must therefore use the available contextual information to understand the source code. This paper aims at investigating how developers split and expand source code identifiers, and, specifically, the extent to which different kinds of contextual information could support such a task. In particular, we consider (i) an internal context consisting of the content of functions and source code files in which the identifiers are located, and (ii) an external context involving external documentation. We conducted a family of two experiments with 63 participants, including bachelor, master, Ph.D. students, and post-docs. We randomly sampled a set of 50 identifiers from a corpus of open source C programs and we asked participants to split and expand them with the availability (or not) of internal and external contexts. We report evidence on the usefulness of contextual information for identifier splitting and acronym/abbreviation expansion. We observe that the source code files are more helpful than just looking at function source code, and that the application-level contextual information does not help any further. The availability of external sources of information only helps in some circumstances. Also, in some cases, we observe that participants better expanded acronyms than abbreviations, although in most cases both exhibit the same level of accuracy. Finally, results indicated that the knowledge of English plays a significant effect in identifier splitting/expansion. The obtained results confirm the conjecture that contextual information is useful in program comprehension, including when developers split and expand identifiers to understand them. We hypothesize that the integration of identifier splitting and expansion tools with IDE could help to improve developers’ productivity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Shorter identifier names take longer to comprehend

Article 26 April 2018

An Approach to Retrieving Similar Source Codes by Control Structure and Method Identifiers

Effect of Identifier Tokenization on Automatic Source Code Documentation

Article 12 September 2021

Notes

http://www.gnu.org
http://www.linux.org
http://www.freebsd.org
http://www.acronymfinder.com
Significant p-values are highlighted in bold face here and in all other tables.

References

Anquetil N, Lethbridge T (1998) Assessing the relevance of identifier names in a legacy software system. In: Proceedings of CASCON, pp 213–222
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28:970–983
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Baker RD (1995) Modern permutation test software. In: Edgington EG (ed) Randomization tests. Marcel Decker
Basili V, Caldiera G, Rombach DH (1994) The goal question metric paradigm encyclopedia of software engineering. John Wiley and Sons
Binkley D, Davis M, Lawrie D, Morrell C (2009) To camelcase or under_score. In: The 17th IEEE international conference on program comprehension, ICPC 2009. Vancouver, British Columbia, Canada, May 17–19, 2009. IEEE Computer Society, pp 158–167
Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 2(18):219–276
Article Google Scholar
Caprile B, Tonella P (1999) Nomen est omen: analyzing the language of function identifiers. In: Proc. of the working conference on reverse engineering (WCRE). Atlanta, Georgia, USA, pp 112–122
Caprile B, Tonella P (2000) Restructuring program identifier names. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 97–107
Deißenböck F, Pizka M (2005) Concise and consistent naming. In: Proc. of the International Workshop on Program Comprehension (IWPC)
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: Proc. of the International Conference on Program Comprehension (ICPC). Kingston, pp 11–20
Enslen E, Hill E, Pollock LL, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th international working conference on mining software repositories, MSR 2009. Vancouver, BC, Canada, May 16–17, pp 71–80
Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum Associates
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2013) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):569–661
Article Google Scholar
Holm A (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70
MathSciNet MATH Google Scholar
Kersten M, Murphy GC (2006) Using task context to improve programmer productivity. In: SIGSOFT ’06/FSE-14: proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM Press, Portland, Oregon, pp 1–11
Chapter Google Scholar
Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 113–122
Lawrie D, Feild H, Binkley D (2006a) Syntactic identifier conciseness and consistency. In: 6th IEEE international workshop on source code analysis and manipulation. Philadelphia, Pennsylvania, USA, pp 139–148
Lawrie D, Morrell C, Feild H, Binkley D (2006b) What’s in a name? A study of identifiers. In: Proceedings of 14th IEEE international conference on program comprehension. IEEE CS Press, Athens, pp 3–12
Chapter Google Scholar
Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318
Article Google Scholar
Lawrie DJ, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: Proc. of the Working Conference on Reverse Engineering (WCRE), pp 112–122
Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering. ACM, New York, NY, pp 234–243
Google Scholar
Madani N, Guerrouj L, Di Penta M, Guéhéneuc Y-G, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: Proceedings of the conference on software maintenance and reengineering. IEEE, pp 69–78
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proc. of 23rd international conference on software engineering. Toronto, pp 103–112
Marc E, Alfred A, Giuliano A, Guéhéneuc Y-G (2008) Cerberus: tracing requirements to source code using information retrieval dynamic analysis and program analysis. In: ICPC ’08: Proceedings of the 2008 the 16th IEEE international conference on program comprehension. IEEE Computer Society, Washington DC pp 53–62
Google Scholar
Marcus A, Maletic JI, Sergeyev A (2005) Recovery of traceability links between software documentation and source code. Int J Softw Eng Knowl Eng 15(5):811–836
Article Google Scholar
Merlo E, McAdam I, De Mori R (2003) Feed-forward and recurrent neural networks for source code informal information analysis. J Softw Maint 15(4):205–244
Article Google Scholar
Ney H (1984) The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Trans Acoust Speech Signal Process 32(2):263–271
Article Google Scholar
Poshyvanyk D, Guéhéneuc Y-G, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Software Eng 33(6):420–432
Article Google Scholar
R Core Team (2012) R: a language and environment for statistical computing. Vienna, Austria. ISBN 3-900051-07-0
Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2010) How developers’ experience and ability influence web application comprehension tasks supported by uml stereotypes: a series of four experiments. IEEE Trans Softw Eng 36(1):96–118
Article Google Scholar
Robillard MP, Coelho W, Murphy GC (2004) How effective developers investigate source code: anexploratory study. IEEE Trans Softw Eng 30(12):889–903
Article Google Scholar
Sharif B, Maletic JI (2010) An eye tracking study on camelcase and under_score identifier styles. In: Proceedings of the international conference on program comprehension, pp 196–205
Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall
Sillito J, Murphy GC, De Volder K (2008) Asking and answering questions during a programming change task. IEEE Trans Softw Eng 34:434–451
Article Google Scholar
Soloway E, Bonar J, Ehrlich K (1983) Cognitive strategies and looping constructs: an empirical study. Commun ACM 26(11):853–860
Article Google Scholar
Storey MAD (1998) A cognitive framework for describing and evaluating software exploration tools. PhD thesis Simon Fraser University
Takang A, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167
Google Scholar
von Mayrhauser A, Vans AM (1995) Program comprehension during software maintenance and evolution. IEEE Comput 28(8):44–55
Article Google Scholar
Wohlin C, Runeson P, Host M, Ohlsson MC, Regnell B, Wesslen A (2000) Experimentation in software engineering—an introduction. Kluwer Academic Publishers

Download references

Acknowledgements

Special thanks to all the participants as this work would not be ossible without your time. Many thanks also to all the reviewers for their thorough and well considered reviews.

Author information

Authors and Affiliations

SOCCER Lab., DGIGL, École Polytechnique de Montréal, QC, Canada
Latifa Guerrouj, Yann-Gaël Guéhéneuc & Giuliano Antoniol
University of Sannio, Sannio, Italy
Massimiliano Di Penta

Authors

Latifa Guerrouj
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Di Penta
View author publications
You can also search for this author in PubMed Google Scholar
Yann-Gaël Guéhéneuc
View author publications
You can also search for this author in PubMed Google Scholar
Giuliano Antoniol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Latifa Guerrouj.

Additional information

Communicated by: Martin Robillard

Appendices

Appendix

A Detailed Study Settings

Table 20 reports the characteristics of the 34 applications from which the 50 identifiers used in our study were sampled.

Table 20 Applications from which the 50 identifiers were sampled

Full size table

Table 21 reports the expansions of all the identifiers used in the experiments. The column Separator indicates whether underscore or Camel Case separators are used. The columns Abbr., Acro. and Plain report the number of abbreviations, acronyms and plain English words composing each identifier.

Table 21 Splitting/expansion oracle and kinds of terms composing identifiers

Full size table

B Detailed Results

This Appendix reports figures detailing results presented and discussed in Section 3. Specifically, Figs. 7 and 8 show boxplots of Precision and Recall for the different levels of context, respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerrouj, L., Di Penta, M., Guéhéneuc, YG. et al. An experimental investigation on the effects of context on source code identifiers splitting and expansion. Empir Software Eng 19, 1706–1753 (2014). https://doi.org/10.1007/s10664-013-9260-1

Download citation

Published: 03 July 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10664-013-9260-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An experimental investigation on the effects of context on source code identifiers splitting and expansion

Abstract

Access this article

Similar content being viewed by others

Shorter identifier names take longer to comprehend

An Approach to Retrieving Similar Source Codes by Control Structure and Method Identifiers

Effect of Identifier Tokenization on Automatic Source Code Documentation

Notes

References

Acknowledgements