Coherence of comments and method implementations: a dataset and an empirical investigation

Corazza, Anna; Maggio, Valerio; Scanniello, Giuseppe

doi:10.1007/s11219-016-9347-1

Coherence of comments and method implementations: a dataset and an empirical investigation

Published: 07 November 2016

Volume 26, pages 751–777, (2018)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Anna Corazza¹,
Valerio Maggio² &
Giuseppe Scanniello³

893 Accesses
23 Citations
Explore all metrics

Abstract

In this paper, we present the results of a manual assessment on the coherence between the comments and the implementation of 3636 methods in three open source software applications (for one of these applications, we considered two different subsequent versions) implemented in Java. The results of this assessment have been collected in a dataset we made publicly available on the Web. The creation of this dataset is based on a protocol that is detailed in this paper. We present that protocol to let researchers evaluate the goodness of our dataset and to ease its future possible extensions. Another contribution of this paper consists in preliminarily investigating on the effectiveness of adopting a Vector Space Model (VSM) with the tf-idf schema to discriminate coherent and non-coherent methods. We observed that the lexical similarity alone is not sufficient for this distinction, while encouraging results have been obtained by applying an Support Vector Machine (SVM) classifier on the whole vector space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What is the Message About? Automatic Multi-label Classification of Open Source Repository Messages into Content Types

Completing Function Documentation Comments Using Structural Information

Article Open access 23 May 2023

A Dataset for Analysis of Quality Code and Toxic Comments

Notes

¹ The comment right before/after of the definition of a method, a class, abstract class and so on.
www2.unibas.it/gscanniello/coherence/
³ In our case, an annotator is a person that produces annotations to software associating coherence information to methods.
⁴ http://sphinx-doc.org
⁵ agile.csc.ncsu.edu/SEMaterials/tutorials/coffee_maker/
⁶ www.jfree.org/jfreechart/
⁷ www.jhotdraw.org/
⁸ https://goo.gl/5oEys8
⁹ http://www.jfree.org/jfreechart/api/javadoc/org/jfree/data/general/AbstractSeriesDataset.html#getSeriesCount
¹⁰ Such approach is usually referred to as macroaveraging (Manning et al. 2008).
¹¹ https://github.com/leriomaggio/code-coherence-analysis

References

Antoniol, G., Canfora, G., Casazza, G., & De Lucia, A. (2000). Information retrieval models for recovering traceability links between code and documentation. In Proceedings of the international conference on software maintenance (pp. 40–51): IEEE Computer Society.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.
MathSciNet MATH Google Scholar
Binkley, D., Lawrie, D., Pollock, L., Hill, E., & Vijay-Shanker, K. (2013). A dataset for evaluating identifier splitters, IEEE Computer Society.
Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics), Springer-Verlag New York, Inc., Secaucus.
Campbell, I., & Yiming, Y. (2011). Learning with support vector machines, Morgan and Claypool.
Caprile, B., & Tonella, P. (2000). Restructuring program identifier names. In Proceedings of international conference on software maintenance (pp. 97–107): IEEE Computer Society.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.
Article Google Scholar
Corazza, A., Di Martino, S., & Maggio, V. (2012). LINSEN: an efficient approach to split identifiers and expand abbreviations. In Proceedings of international conference on software maintenance (pp. 233–242): IEEE Computer Society.
Corazza, A., Di Martino, S., Maggio, V., & Scanniello, G. (2011). Investigating the use of lexical information for software system clustering. In Proceedings of European conference on software maintenance and reengineering (pp. 35–44): IEEE Computer Society.
Corazza, A., Maggio, V., & Scanniello, G. (2015). On the coherence between comments and implementations in source code. In Proceedings of EUROMICRO conference on software engineering and advanced applications (pp. 76–83): IEEE Computer Society.
de Souza, S. C. B., Anquetil, N., & de Oliveira, K. M. (2005). A study of the documentation essential to software maintenance. In Proceedings of the international conference on design of communication: documenting & designing for pervasive information (pp. 68–75): ACM.
DeLine, R., Khella, A., Czerwinski, M., & Robertson, G. (2005). Towards understanding programs through wear-based filtering. In Proceedings of the 2005 ACM symposium on Software visualization, SoftVis ’05 (pp. 183–192): ACM.
Dit, B., Revelle, M., Gethers, M., & Poshyvanyk, D. (2013). Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process, 25 (1), 53–95.
Google Scholar
Fluri, B., Wursch, M., & Gall, H. (2007). Do code and comments co-evolve? on the relation between source code and comment changes. In Proceedings of the working conference on reverse engineering (pp. 70–79): IEEE Computer Society.
Fowler, M. (1999). Refactoring: improving the design of existing code. Boston: Addison-Wesley Longman Publishing Co., Inc.
MATH Google Scholar
Freund, R. J., & Wilson, W. J. (2003). Statistical methods, 2nd edn. Academic Press.
Jiang, Z. M., & Hassan, A. E. (2006). Examining the evolution of code comments in postgresql. In Diehl, S., Gall, H., & Hassan, A. E. (Eds.) Proceedings of mining software repositories (pp. 179–180. ACM).
Keyes, J. (2002). Software engineering handbook: Taylor & Francis.
Kuhn, A., Ducasse, S., & Gîrba, T. (2007). Semantic clustering identifying topics in source code. Information & Software Technology, 49(3), 230–243.
Article Google Scholar
LaToza, T. D., Venolia, G., & DeLine, R. (2006). Maintaining mental models: a study of developer work habits. In Proceedings of the 28th international conference on software engineering, ICSE ’06 (pp. 492–501): ACM.
Lawrie, D., Binkley, D., & Morrell, C. (2010). Normalizing source code vocabulary. In Proceedings of working conference on reverse engineering (pp. 3–12): IEEE Computer Society.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.
Book MATH Google Scholar
McMillan, C., Grechanik, M., Poshyvanyk, D., Fu, C., & Xie, Q. (2012). Exemplar: a source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5), 1069–1087.
Article Google Scholar
Robillard, M. P., Coelho, W., & code, G. C. Murphy. (2004). How effective developers investigate source. An exploratory study. IEEE Transactions on Software Engineering, 30(12), 889–903.
Article Google Scholar
Roehm, T., Tiarks, R., Koschke, R., & Maalej, W. (2012). How do professional developers comprehend software?. In Proceedings of the 2012 international conference on software engineering, ICSE 2012 (pp. 255–265). Piscataway, NJ, USA: IEEE Press.
Salviulo, F., & Scanniello, G. (2014). Dealing with identifiers and comments in source code comprehension and maintenance: Results from an ethnographically-informed study with students and professionals. In Proceedings of International Conference on Evaluation and Assessment in Software Engineering (pp. 423–432): ACM Press.
Scanniello, G., Marcus, A., & Pascale, D. (2015). Link analysis algorithms for static concept location: an empirical assessment. Empirical Software Engineering, 20 (6), 1666–1720.
Article Google Scholar
Singer, J., Lethbridge, T., Vinson, N., & Anquetil, N. (1997). An examination of software engineering work practices. In Proceedings of the conference of the centre for advanced studies on collaborative research (p. 21): IBM Press.
Soloway, E., & Ehrlich, K. (1984). Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, 10(5), 595–609.
Article Google Scholar
Steidl, D., Hummel, B., & Jürgens, E. (2013). Quality analysis of source code comments. In Proceedings of international conference on program comprehension (pp. 83–92): IEEE Computer Society.
Tan, L., Yuan, D., Krishna, G., & Zhou, Y. (2007). iComment: Bugs or bad comments? ACM.
Tan, S. H., Marinov, D., Tan, L., & Leavens, G. T. (2012). @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In Proceedings of international conference on software testing (pp. 260–269): IEEE Computer Society.
Van Der Maaten, L. (2014). Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15(1), 3221–3245.
MathSciNet MATH Google Scholar
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Book MATH Google Scholar
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M., Regnell, B., & Wesslén, A. (2012). Experimentation in software engineering. Computer science: Springer.

Download references

Acknowledgment

We would like to thank the annotators of our dataset and the reviewers for their precious and constructive comments and suggestions.

Author information

Authors and Affiliations

Department of Electrical Engineering and Information Technologies, University of Naples “Federico II”, Naples, Italy
Anna Corazza
Fondazione Bruno Kessler, Trento, Italy
Valerio Maggio
Department of Mathematics, Information Technology, and Economics, University of Basilicata, Potenza, Italy
Giuseppe Scanniello

Authors

Anna Corazza
View author publications
You can also search for this author in PubMed Google Scholar
Valerio Maggio
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Corazza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Corazza, A., Maggio, V. & Scanniello, G. Coherence of comments and method implementations: a dataset and an empirical investigation. Software Qual J 26, 751–777 (2018). https://doi.org/10.1007/s11219-016-9347-1

Download citation

Published: 07 November 2016
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11219-016-9347-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coherence of comments and method implementations: a dataset and an empirical investigation

Abstract

Access this article

Similar content being viewed by others

What is the Message About? Automatic Multi-label Classification of Open Source Repository Messages into Content Types

Completing Function Documentation Comments Using Structural Information

A Dataset for Analysis of Quality Code and Toxic Comments

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Coherence of comments and method implementations: a dataset and an empirical investigation

Abstract

Access this article

Similar content being viewed by others

What is the Message About? Automatic Multi-label Classification of Open Source Repository Messages into Content Types

Completing Function Documentation Comments Using Structural Information

A Dataset for Analysis of Quality Code and Toxic Comments

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation