Automatic identifier inconsistency detection using code dictionary

Kim, Suntae; Kim, Dongsun

doi:10.1007/s10664-015-9369-5

Automatic identifier inconsistency detection using code dictionary

Published: 07 March 2015

Volume 21, pages 565–604, (2016)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Suntae Kim¹ &
Dongsun Kim²

845 Accesses
25 Citations
1 Altmetric
Explore all metrics

Abstract

Inconsistent identifiers make it difficult for developers to understand source code. In particular, large software systems written by several developers can be vulnerable to identifier inconsistency. Unfortunately, it is not easy to detect inconsistent identifiers that are already used in source code. Although several techniques have been proposed to address this issue, many of these techniques can result in false alarms since such techniques do not accept domain words and idiom identifiers that are widely used in programming practice. This paper proposes an approach to detecting inconsistent identifiers based on a custom code dictionary. It first automatically builds a Code Dictionary from the existing API documents of popular Java projects by using an Natural Language Processing (NLP) parser. This dictionary records domain words with dominant part-of-speech (POS) and idiom identifiers. This set of domain words and idioms can improve the accuracy when detecting inconsistencies by reducing false alarms. The approach then takes a target program and detects inconsistent identifiers of the program by leveraging the Code Dictionary. We provide CodeAmigo, a GUI-based tool support for our approach. We evaluated our approach on seven Java based open-/proprietary- source projects. The results of the evaluations show that the approach can detect inconsistent identifiers with 85.4 % precision and 83.59 % recall values. In addition, we conducted an interview with developers who used our approach, and the interview confirmed that inconsistent identifiers frequently and inevitably occur in most software projects. The interviewees then stated that our approach can help to better detect inconsistent identifiers that would have been missed through manual detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Testing of detection tools for AI-generated text

Article Open access 25 December 2023

Applying NLP techniques to malware detection in a practical environment

Article Open access 06 June 2021

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

http://goo.gl/p6Gzmd and http://goo.gl/7cCV8n
http://www.dlib.vt.edu/projects/MarianJava/edu/vt/marian/server/status.java
https://github.com/tangmatt/word-scramble/blob/master/system/Status.java
Note that an identifier can include multiple inconsistencies. The total number of unique identifiers containing at least one inconsistency is 1,952.
https://sites.google.com/site/detectinginconsistency/
Apache Directory Project: https://issues.apache.org/jira/browse/DIRSERVER-1140
Apache Commons Math: https://issues.apache.org/jira/browse/MATH-707
Synonyms Definition: http://en.wikipedia.org/wiki/Synonym
Oxford Dictionary, http://www.oxforddictionaries.com/
Collins Cobuild Dictionary: http://www.collinsdictionary.com/dictionary/english
Dictionary.com: http://dictionary.reference.com/
To define this map, any English dictionary can be used. In this paper, we used WordNet (2014) as described in Section 3.2.2.
https://bugs.eclipse.org/bugs/show_bug.cgi?id=369942
https://github.com/Chassis/memcache/issues/2
https://bugs.eclipse.org/bugs/show_bug.cgi?id=108384
https://github.com/scrom/Experiments/issues/32
https://github.com/scrom/Experiments/commit/04dfbf7818626f9818379eb20e4c87e755407687
https://github.com/morrisonlevi/Ardent/issues/17
Although there are some of the researches on POS-tagging of source code elements (Abebe and Tonella 2010; Binkley et al. 2011; Guapa et al. 2013), they are not publicly available or also used natural language parser such as Minipar (2014), Stanford Log-linear Part-Of-Speech Tagger Toutanova et al. (2003). In this paper, we have adopted Stanford Parser (2014) because it is highly accurate for parsing natural language sentences and broadly used for NLP. In addition, it is publicly available, well-documented and stable.
Decision of this threshold is carried out in the preliminary study.
https://sites.google.com/site/detectinginconsistency/
The Stanford Parser: A statistical parser (2014) has 86 % parsing precision for a sentence consisting of 40 English words.
https://issues.apache.org/jira/browse/HBASE-584
Oxford Dictionary, http://www.oxforddictionaries.com/
Collins Cobuild Dictionary: http://www.collinsdictionary.com/dictionary/english
SCOWL: http://wordlist.aspell.net/
Lexicon BadSmell Wiki: http://selab.fbk.edu/LexiconBadSmellWiki

References

Deiβenböck F, Pizka M (2005) Concise and Consistent Naming. In: Proceedings of International Workshop on Program Comprehension(IWPC), St. Louis, pp 261–282
Lawrie D, Field H, Binkley D (2006) Syntactic Identifier Conciseness and Consistency. In: Proceedings of IEEE International Workshop on Source Code Analysis and Manipulation(SCAM). Philadelphia, Pennsylvania, pp 139–148
Martin RC (2008) Clean Code: A Handbook of Agile Software Craftsmanship, 1st edn. Prentice Hall
Higo Y, Kusumoto S (2012) How Often Do Unintended Inconsistencies Happen?-Deriving Modification Pattern and Detecting Overlooked Code Fragments-. In: Proceedings of the 28th international conference on software maintenance, Trento, pp 222–231
Abebe SF, Haiduc S, Tonella P, Marcus A (2008) Lexicon Bad Smells in Software. In: Proceedings of working conference on reverse engineering, Antwerp Belgium, pp 95–99
Hughes E (2004) Checking Spelling in Source Code. IEEE Software, ACM SIGPLAN Not 39(12):32–38
Article Google Scholar
Delorey DP, Kutson CD, Davies M (2009) Mining Programming Language Vocabularies from Source Code. In: Proceedings of the 21st conference of the psychology of programming group(PPIG), London
Lawire D, Binkley D, Morrel C (2010) Normalizaing Source Code Vocabulary. In: Proceedings of the 17th working conference on reverse engineering, Boston, pp 3–12
Abebe SL, Tonella P (2010) Natural Language Parsing of Program Element Names for Concept Extraction. In: proceedings of international conference on program comprehension(ICPC), Minho, pp 156–159
Falleri J, Lafourcade M, Nebut C, Prince V, Dao M (2010) Automatic Extraction of a WordNet-like Identifier Network from Software. In: Proceedings of international conference on Program comprehension(ICPC), Minho, pp 4–13
Abebe S, Tonella P (2013) Automated identifier completion and replacement. In: Proceedings of the european conference on software maintenance and reengineering (CSMR), Genova, pp 263–272
Host EW, Ostvold BM (2009) Debugging Method Names, Proceedings of the 23rd European Conference on Object-Oriented Programming. Lect. Notes Comput. Sci 5653(1):294–317
Article Google Scholar
Lee S, Kim S, Kim J, Park S (2012) Detecting Inconsistent Names of Source Code Using NLP. Computer Applications for Database, Education, and Ubiquitous Computing Communications in Computer and Information Science 352(1):111–115
Article Google Scholar
Code Conventions for the Java Programming Language: Why Have Code Conventions Sun Microsystems (1999). http://www.oracle.com/technetwork/java/index-135089.html
Lawrie D, Feild H, Binkley D (2007) Quantifying identifier quality: an analysis of trends. Empir Softw Eng 12(4):359–388
Article Google Scholar
Madani N, Guerroju L, Penta MD, Gueheneuc Y, Antoniol G (2010) Recognizing Words from Source Code Identifiers using Speech Recognition Techniques. In: Proceedings of 14th european conference on software maintenance and reengineering(CSMR), Madrid, pp 68–77
Goodliffe P (2006) Code Craft: The Practice of Writing Excellent Code. No Starch Press
WordNet: A lexical database for English Home page (2014). http://wordnet.princeton.edu/
Haber RN, Schindler RM (1981) Errors in proofreading: Evidence of Syntactic Control of Letter Processing. J Exp Psychol Hum Percept Perform 7(1):573–579
Article Google Scholar
Monk AF, Hulme C (1983) Errors in proofreading: Evidence for the Use of Word Shape in Word Recognition. Mem Cogn 11(1):16–23
Article Google Scholar
Caprile B, Tonella P (1999) Nomen Est Omen: Analyzing the Language of Funtion Identifiers. In: Proceedings of working conference on reverse engineering, Altanta, pp 112–122
The Stanford Parser: A statistical parser Home page (2014). http://nlp.stanford.edu/software/lex-parser.shtml
Apache OpenNLP Homepage (2014). http://opennlp.apache.org/
Binkley D, Hearn M, Lawrie D (2011) Improving Identifier Informativeness using Part of Speech Information. In: Proceedings of the 8th working conference on mining software repositories, New York, pp 203–2006
Guapa S, Malik S, Pollock L, Vijay-Shanker K (2013) Part-of-Speech Tagging of Program Identifiers for Improved Text-Based Software Engineering Tools. In: Proceedings of 21st international conference on program comprehension (ICPC), San Francisco, pp 3–12
MINIPAR Homepage (2014). http://webdocs.cs.ualberta.ca/lindek/minipar.htm
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL, pp 252–259
The Penn Treebank Project (2013). http://www.cis.upenn.edu/treebank/
Budanitsky A, Hirst G (2006) Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput Linguis 32(1):13–47
Article MATH Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Doklady 10(8):707–710
MathSciNet MATH Google Scholar
Frakes WB, Baeza-Yates R (1992) Information Retrival : Data Structures and Algorithms. J.J.: Prentice-Hall, Englewood Cliffs
Google Scholar
Apache Lucene Homegage (2013). http://lucene.apache.org/core/
Apache Ant Homepage (2013). http://ant.apache.org/
Apache JMeter Homepage (2013). http://jmeter.apache.org/
JUnit Homepage (2013). http://www.junit.org/
JHotDraw 7 Homepage (2013). http://www.randelshofer.ch/oop/jhotdraw/
Sweet Home 3D Homepage (2013). http://sourceforge.net/projects/sweethome3d
Klein D, Manning CD (2003) Accurate Unlexicalized Parsing. In: Proceedings of the meeting of the association for computational linguistics, Sapporo, pp 423–430
Code Amigo Validation WebPage (2014). http://54.250.194.210/
Powers DM (2011) Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. J Mach Learn Technol 1(1):37–63
Google Scholar
Eclipse-CS Check Style Homepage (2013). http://eclipse-cs.sourceforge.net/
Find Bugs in Java Programs Homepage (2013). http://findbugs.sourceforge.net/
Bloch J (2001) Effective Java Programming Language Guide. Sun Microsystems
Bolch J (2008) Effective Java (2nd Edition), 2nd edn. Addison-Wesley
Arnaoudova V, Penta MD, Antoniol G, Gueheneuc Y (2013) A New Family of Software Anti-Patterns: Linguistic Anti-Patterns. In: Proceedings of the european conference on software maintenance and reengineering (CSMR), Genova, pp 187–196

Download references

Acknowledgments

This paper was supported by research funds of Chonbuk National University in 2014. This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014M3C4A7030505).

Author information

Authors and Affiliations

Department of Software Engineering, Chonbuk National University, 567 Baekje-daero, Deokjin-gu, Jeollabuk-do, 561-756, Jeonju-si, Republic of Korea
Suntae Kim
Computer Science and Communications Research Unit, Faculty of Science, Technology and Communication, and Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, 4 rue Alphonse Weicker, L-2721, Luxembourg-Ville, Luxembourg
Dongsun Kim

Authors

Suntae Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dongsun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongsun Kim.

Additional information

Communicated by: Giulio Antoniol

Appendix A: List of Domain Word POSes and Idioms

Table 13 Domain words with the dominant POS information extracted from the API document of projects with the parameter T _{W
O} = 100 and T _{P
R} = 0.8 ( indicates a word evaluated as invalid in the preliminary study. The precision is computed as 176/191 = 0.921)

Table 14 Idiom identifiers extracted from the API document of projects listed in Table 1, where T(F O _{f
m
w}) = 2, T(F O _{c
l
s}) = 2, T(F O _{a
t
t}) = 2, and T(F O _{m
e
t}) = 10

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Kim, D. Automatic identifier inconsistency detection using code dictionary. Empir Software Eng 21, 565–604 (2016). https://doi.org/10.1007/s10664-015-9369-5

Download citation

Published: 07 March 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10664-015-9369-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic identifier inconsistency detection using code dictionary

Abstract

Access this article

Similar content being viewed by others

Testing of detection tools for AI-generated text

Applying NLP techniques to malware detection in a practical environment

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: List of Domain Word POSes and Idioms

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic identifier inconsistency detection using code dictionary

Abstract

Access this article

Similar content being viewed by others

Testing of detection tools for AI-generated text

Applying NLP techniques to malware detection in a practical environment

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: List of Domain Word POSes and Idioms

Appendix A: List of Domain Word POSes and Idioms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation