Empirical Software Engineering

, Volume 19, Issue 6, pp 1754–1780 | Cite as

An empirical study of identifier splitting techniques

  • Emily Hill
  • David Binkley
  • Dawn Lawrie
  • Lori Pollock
  • K. Vijay-Shanker
Article

Abstract

Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.

Keywords

Software engineering tools Program comprehension Identifier names Source code text analysis 

References

  1. Atkinson K (2004) Spell checking oriented word lists (scowl). http://wordlist.sourceforge.net/. Accessed 13 July 2013
  2. Binkley D, Davis M, Lawrie D, Maletic J, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 18:219–276. doi:10.1007/s10664-012-9201-4 CrossRefGoogle Scholar
  3. Brants T, Franz A: Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, PhiladelphiaGoogle Scholar
  4. Butler S, Wermelinger M, Yu Y, Sharp H (2011) Improving the tokenisation of identifier names. In: Proceedings of the 25th European conference on object-oriented programming, ECOOP’11. Springer-Verlag, Berlin, Heidelberg, pp 130–154. http://dl.acm.org/citation.cfm?id=2032497.2032507
  5. Caprile B, Tonella P (1999) Nomen est omen: Analyzing the language of function identifiers. In: WCRE ’99: Proceedings of the 6th working conference on reverse engineering, pp 112–122Google Scholar
  6. Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM ’00: Proceedings of the International Conference on Software Maintenance (ICSM’00). IEEE Computer Society, Washington, DC, USA, p 97Google Scholar
  7. Corazza A, Martino SD, Maggio V (2012) Linsen: An approach to split identifiers and expand abbreviations with linear complexity. In: Proceedings of the 2012 IEEE International Conference on Software Maintenance, ICSM ’12. IEEE Computer Society, Washington, DC, USAGoogle Scholar
  8. Deissenboeck F, Pizka M (2006) Concise and consistent naming. J Soft Quality Control 14(3):261–282. doi:10.1007/s11219-006-9219-1 CrossRefGoogle Scholar
  9. Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: 2011 IEEE 19th International Conference on Program Comprehension (ICPC), pp 11–20. doi:10.1109/ICPC.2011.47
  10. Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009, 71–80. doi:10.1109/MSR.2009.5069482
  11. Feild H, Binkley D, Lawrie D (2006) An empirical comparison of techniques for extracting concept abbreviations from identifiers. In: Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06)Google Scholar
  12. Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2011) Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software Maintenance and Evolution: Research and Practice. doi:10.1002/smr.539 Google Scholar
  13. Hill E, Fry ZP, Boyd H, Sridhara G, Novikova Y, Pollock L, Vijay-Shanker K (2008) AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In: MSR ’08: Proceedings of the 5th international working conference on mining software repositories. IEEE Computer Society, Washington, DC, USAGoogle Scholar
  14. Lawrie D, Binkley D (2011) Expanding identifiers to normalizing source code vocabulary. In: ICSM ’11: Proceedings of the 27th IEEE international conference on software maintenanceGoogle Scholar
  15. Lawrie D, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: 17th Working Conference on Reverse Engineering (WCRE), pp 3–12. doi:10.1109/WCRE.2010.10
  16. Lawrie D, Feild H, Binkley D (2007a) Extracting meaning from abbreviated identifiers. In: SCAM ’07: Proceedings of the 7th IEEE International working conference on Source Code Analysis and Manipulation (SCAM 2007), pp 213–222. doi:10.1109/SCAM.2007.9
  17. Lawrie D, Feild H, Binkley D (2007b) Quantifying identifier quality: an analysis of trends. J Emp Soft Eng 12(4):359–388CrossRefGoogle Scholar
  18. Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th Annual Psychology of Programming WorkshopGoogle Scholar
  19. Madani N, Guerrouj L, Di Penta M, Gueheneuc Y, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), pp 68–77. doi:10.1109/CSMR.2010.31
  20. Nie J, Gao J, He H, Chen W, Zhou M (2002) Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: SIGIR ’02: Proceedings of the 2002 SIGIR. ACM, New York, NY, USAGoogle Scholar
  21. Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis, 5th edn. DuxburyGoogle Scholar
  22. Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE ’07: Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, Washington, DC, USA, pp 499–510. doi:10.1109/ICSE.2007.32

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Emily Hill
    • 1
  • David Binkley
    • 2
  • Dawn Lawrie
    • 2
  • Lori Pollock
    • 3
  • K. Vijay-Shanker
    • 3
  1. 1.Department of Computer ScienceMontclair State UniversityMontclairUSA
  2. 2.Department of Computer ScienceLoyola University MarylandBaltimoreUSA
  3. 3.Department of Computer and Information SciencesUniversity of DelawareNewarkUSA

Personalised recommendations