Code coverage differences of Java bytecode and source code instrumentation tools

Horváth, Ferenc; Gergely, Tamás; Beszédes, Árpád; Tengeri, Dávid; Balogh, Gergő; Gyimóthy, Tibor

doi:10.1007/s11219-017-9389-z

Code coverage differences of Java bytecode and source code instrumentation tools

Published: 04 December 2017

Volume 27, pages 79–123, (2019)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Ferenc Horváth¹,
Tamás Gergely ORCID: orcid.org/0000-0001-7504-3580¹,
Árpád Beszédes¹,
Dávid Tengeri¹,
Gergő Balogh² &
…
Tibor Gyimóthy^1,2

1508 Accesses
14 Citations
Explore all metrics

Abstract

Many software testing fields, like white-box testing, test case generation, test prioritization, and fault localization, depend on code coverage measurement. If used as an overall completeness measure, the minor inaccuracies of coverage data reported by a tool do not matter that much; however, in certain situations, they can lead to serious confusion. For example, a code element that is falsely reported as covered can introduce false confidence in the test. This work investigates code coverage measurement issues for the Java programming language. For Java, the prevalent approach to code coverage measurement is using bytecode instrumentation due to its various benefits over source code instrumentation. As we have experienced, bytecode instrumentation-based code coverage tools produce different results than source code instrumentation-based ones in terms of the reported items as covered. We report on an empirical study to compare the code coverage results provided by tools using the different instrumentation types for Java coverage measurement on the method level. In particular, we want to find out how much a bytecode instrumentation approach is inaccurate compared to a source code instrumentation method. The differences are systematically investigated both in quantitative (how much the outputs differ) and in qualitative terms (what the causes for the differences are). In addition, the impact on test prioritization and test suite reduction—a possible application of coverage measurement—is investigated in more detail as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How different are different diff algorithms in Git?

Article Open access 11 September 2019

An empirical study of automated unit test generation for Python

Article Open access 31 January 2023

Automatic software refactoring: a systematic literature review

Article 03 December 2019

Notes

http://eclemma.org/jacoco/.
https://www.atlassian.com/software/clover/.
http://www.sonarqube.org/.
http://cobertura.github.io/cobertura/.
http://www.semdesigns.com/Products/TestCoverage.
http://emma.sourceforge.net/.
http://soda.sed.hu
http://junit.org/.
https://github.com/.
https://maven.apache.org/.
http://www.semdesigns.com/Products/TestCoverage/JavaTestCoverage.html.
https://wiki.openjdk.java.net/display/CodeTools/jcov/.
https://www.eclipse.org/.
https://netbeans.org/.
https://www.jetbrains.com/idea/.
https://jenkins.io/.
https://gradle.org/.
https://www.eclipse.org/collections/.
http://projects.spring.io/spring-framework/.
http://checkstyle.sourceforge.net/.
https://github.com/sed-szeged/soda-jacoco-maven-plugin.
https://www.sourcemeter.com/.
There was one exception to this: in Cobertura, we disabled the feature of skipping the analysis of generated source code, as this was not implemented in the other two tools.

References

Alemerien, K., & Magel, K. (2014). Examining the effectiveness of testing coverage tools: an empirical study. International Journal of Software Engineering and Its Applications, 8(5), 139–162.
Google Scholar
Binder, W., Hulaas, J., Moret, P. (2007). Advanced Java bytecode instrumentation. In Proceedings of the 5th international symposium on Principles and practice of programming in Java (pp. 135–144). New York : ACM.
Black, R., van Veenendaal, E., Graham, D. (2012). Foundations of software testing: ISTQB certification. Cengage Learning.
Emanuelsson, P., & Nilsson, U. (2008). A comparative study of industrial static analysis tools. Electronic Notes in Theoretical Computer Science, 217, 5–21.
Article Google Scholar
Fontana, F.A., Mariani, E., Mornioli, A., Sormani, R., Tonello, A. (2011). An experience report on using code smells detection tools. In Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, IEEE Computer Society, ICSTW ’11 (pp. 450–457).
Fraser, G., & Arcuri, A. (2011). Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11 (pp. 416–419). New York: ACM.
Gonzalez-Sanchez, A., Abreu, R., Gross, H.G., van Gemund, A.J.C. (2011). Prioritizing tests for fault localization through ambiguity group reduction. In Alexander, P., Pasareanu, C.S., Hosking, J.G. (Eds.) 2011 26th IEEE/ACM international conference on Automated software engineering (ASE) (pp. 83–92). Piscataway: IEEE.
Graves, T.L., Harrold, M.J., Kim, J.M., Porter, A., Rothermel, G. (2001). An empirical study of regression test selection techniques. ACM Transactions on Software Engineering and Methodology, 10(2), 184–208.
Article MATH Google Scholar
Harrold, M.J., Rothermel, G., Wu, R., Yi, L. (1998). An empirical investigation of program spectra. In Proceedings of the 1998 ACM SIGPLAN-SIGSOFT workshop PASTE ’98 (pp. 83–90). New York : ACM.
Inozemtseva, L., & Holmes, R. (2014). Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the international conference on software engineering.
Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37(5), 649–678.
Article Google Scholar
Jones, J.A., & Harrold, M.J. (2005). Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of international conference on automated software engineering (pp. 273–282). New York: ACM.
Kajo-Mece, E., & Tartari, M. (2012 ). An evaluation of Java code coverage testing tools. In Proceedings of the 2012 balkan conference in informatics (BCI’12), Faculty of Sciences, University of Novi Sad (pp. 72–75).
Kessis, M., Ledru, Y., Vandome, G. (2005). Experiences in coverage testing of a Java middleware. In Proceedings of the 5th international workshop on software engineering and middleware, SEM ’05 (pp. 39–45). New York: ACM.
Li, N., Meng, X., Offutt, J., Deng, L. (2013). Is bytecode instrumentation as good as source code instrumentation: An empirical study with industrial tools (experience report). In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) (pp. 380–389).
Lingampally, R., Gupta, A., Jalote, P. (2007). A multipurpose code coverage tool for Java. In 40th Annual Hawaii international conference on system sciences, 2007. HICSS 2007 (pp. 261b–261b). Washington: IEEE.
Ntafos, S.C. (1988). A comparison of some structural testing strategies. IEEE Transactions on Software Engineering, 14(6), 868–874.
Article Google Scholar
Offutt, A.J., Pan, J., Voas, J.M. (1995). Procedures for reducing the size of coverage-based test sets. In Proceedings Twelfth international conference testing computer software (pp. 111–123).
Ostrand, T. (2002). White-box testing. Encyclopedia of Software Engineering.
Perez, A., & Abreu, R. (2014). A diagnosis-based approach to software comprehension. In Proceedings of the 22nd international conference on program comprehension, ICPC 2014 (pp. 37–47). New York: ACM.
Pinto, L.S., Sinha, S., Orso, A. (2012). Understanding myths and realities of test-suite evolution. In Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering (pp. 33:1–33:11). New York: ACM.
Raulamo-Jurvanen, P. (2017). Decision support for selecting tools for software test automation. SIGSOFT Softw Eng Notes, 41(6), 1–5.
Article Google Scholar
Rayadurgam, S., & Heimdahl, M. (2001). Coverage based test-case generation using model checkers. In Eighth Annual IEEE international conference and workshop on the engineering of computer based systems, 2001. ECBS 2001 Proceedings (pp. 83–91).
Rothermel, G., Untch, R.J., Chu, C. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10), 929–948.
Article Google Scholar
Rothermel, G., Harrold, M.J., von Ronne, J., Hong, C. (2002). Empirical studies of test-suite reduction. Software testing. Verification and Reliability, 12(4), 219–249.
Article Google Scholar
Tengeri, D., Beszédes, Á., Havas, D., Gyimóthy, T. (2014). Toolset and program repository for code coverage-based test suite analysis and manipulation. In Proceedings of the 14th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM’14) (pp. 47–52).
Tengeri, D., Horváth, F., Beszédes, Á., Gergely, T., Gyimóthy, T. (2016). Negative effects of bytecode instrumentation on Java source code coverage. In Proceedings of the 23rd IEEE international conference on software analysis, evolution, and reengineering (SANER 2016) (pp. 225–235).
Usaola, M.P., & Mateo, P.R. (2010). Mutation testing cost reduction techniques: a survey. IEEE Software, 27(3), 80–86.
Article Google Scholar
Vidács, L., Beszédes, Á., Tengeri, D., Siket, I., Gyimóthy, T. (2014). Test suite reduction for fault detection and localization: A combined approach. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE) (pp. 204–213).
Yang, Q., Li, J.J., Weiss, D.M. (2009). A survey of coverage-based testing tools. The Computer Journal, 52(5), 589–597.
Article Google Scholar
Yoo, S., & Harman, M. (2012). Regression testing minimization, selection and prioritization: a survey. Verification and Reliability Software Testing, 22(2), 67–120.
Article Google Scholar
Yoo, S., Harman, M., Clark, D. (2011). Flint: fault localisation using information theory. Tech. rep., University College London.
Yoo, S., Harman, M., Clark, D. (2013). Fault localization prioritization: comparing information-theoretic and coverage-based approaches. ACM Transactions on Software Engineering and Methodology, 22(3), 19.
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and by the EU-funded Hungarian national grant GINOP-2.3.2-15-2016-00037 titled “Internet of Living Things.” We are grateful to Semantic Designs, Inc. for providing a free research license for their tool.

Author information

Authors and Affiliations

Department of Software Engineering, University of Szeged, Szeged, Hungary
Ferenc Horváth, Tamás Gergely, Árpád Beszédes, Dávid Tengeri & Tibor Gyimóthy
MTA-SZTE Research Group on Artificial Intelligence, University of Szeged, Szeged, Hungary
Gergő Balogh & Tibor Gyimóthy

Authors

Ferenc Horváth
View author publications
You can also search for this author in PubMed Google Scholar
Tamás Gergely
View author publications
You can also search for this author in PubMed Google Scholar
Árpád Beszédes
View author publications
You can also search for this author in PubMed Google Scholar
Dávid Tengeri
View author publications
You can also search for this author in PubMed Google Scholar
Gergő Balogh
View author publications
You can also search for this author in PubMed Google Scholar
Tibor Gyimóthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ferenc Horváth.

Appendices

Appendix A: Comparison results of source code-based tools

In this appendix we present additional results of comparing the results of the two methods employing source code instrumentation, C lover and T est Coverage.

As these two tools generate instrumented source code, it is possible to compare their instrumentation algorithms on a high level by investigating the instrumented code itself. To do so, we manually checked the instrumented sources and investigated the probe points, i.e., the locations where extra code were injected into the original source code. We found that the two tools identified and instrumented exactly the same source code elements. The main technical difference between the two tools is that while T est Coverage uses boolean vectors to store coverage data, C lover has a complex mechanism for calculating which part of the production code is exercised (this enables per-test coverage measurement as well). Thus, in general, T est Coverage inserts less extra code into the original source. Another difference is the handling of such code which do not have a conventional form of a Java method but will be included in the bytecode as a special method (e.g., static initialization code of the class or anonymous methods). Both tools recognize and instrument these parts of the source code, but C lover reports them as methods, while T est Coverage includes them in the class coverage only.

In the next step, we calculated the raw overall coverages for our subject systems with these two tools in order to see how much their results differ. Table 14 shows the associated results. Columns 2 and 3 show the overall coverage ratios as produced by the tools, while the last column includes the percentage difference. The numbers in the last row represent the averages of the absolute differences.

Table 14 Comparison of the overall coverages computed by the source code tools (C lover and T est coverage)

Full size table

We can observe that the above mentioned differences of the tools cause small differences in the overall coverage results. Note, that we used Clover^glob which is the value measured globally, including cross-submodule coverage for submodule-based systems, and this is how T est Coverage works too. Unfortunately, we were not able to produce per-test coverage values using T est Coverage, as this would have required the individual execution of the test cases and consequently large-scale modifications in the build environments. Hence, we could not perform such a comparison of the tools.

Appendix[[spicolon]] B: Comparison results of bytecode-based tools

In this appendix we present additional results of comparing the results of the three tools employing bytecode instrumentation: J aCoCo, J Cov, and Cobertura.

We calculated the raw overall coverages for our subject systems with these tools in order to see how much their results differ. Table 15 shows the associated results. Columns 2–4 are the overall coverage ratios as produced by the tools “out of the box” (i.e., without any modifications made to them,^{Footnote 23} only the necessary parameters have been set). The last three columns show the pairwise differences in the percentage, while the numbers in the last row represent the averages of the absolute differences.

Table 15 Comparison of the overall coverages computed by the bytecode tools (Cobertura, J aCoCo and J Cov)

Full size table

Quantitatively, the differences between these tools were at most 4%, and the behavior of the tools was different on the different subjects. We made some deeper but basically still quantitative analysis: we compared the per-test coverage results of the tools. Unfortunately, we were not able to produce such results using Cobertura, so we compared J aCoCo and J Cov in this respect.

In Fig. 7, we present the differences in test case coverage vectors. For each test case, we use a coverage vector in which each element corresponds to a single code element. We compared such vector pairs for J aCoCo and J Cov for each test case using the Hamming distance measure and normalized the result by the length of the vectors. Figure 7 shows the corresponding data in form of histograms.

For the first four subject programs, data shows that most of the vector pairs are the same and the difference is less than 1% for the others. For m apdb and n etty, there are very few vector pairs that match exactly, but most of them are still close to each other. In the case of o ryx and o rientdb, about half of the vector pairs matches exactly, the difference in the majority of the cases is less than 1%, but there are differences as high as 5% or even 14%. After the manual investigations, most of these high differences are found to be tolerable outliers.

Similarly to the previous set of experiments, we performed per-method comparison of the coverages. Here the vectors were assigned to code elements and a vector element corresponded to a test case. Figure 8 shows the differences in these vectors for J aCoCo and J Cov in form of histograms.

We got the same results as in case of test case vectors for the first four subject programs, namely, most of the vector pairs are the same and the rest of them differ at most 1%. Here, the last four programs are similar to each other: a lower part of the vector pairs matches exactly, but there are some higher differences as well. Later, these high differences are turned out to be explainable outliers.

To find the cause of the differences (especially of the high Hamming distances) we observed from these experiments, we investigated the detailed coverage of the three tools manually as well. Our conclusion was that the main cause of the differences was mostly due to the slightly different handling of compiler-generated methods and nested classes in the bytecode (such as the methods generated for nested classes). Since the overall quantitative differences were at most 4% and they were concerned mostly with generated methods, which are less important for code coverage analysis, and the high individual Hamming distances could also be traced back to these methods, we concluded that one representative tool of the three should be sufficient for further experiments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Horváth, F., Gergely, T., Beszédes, Á. et al. Code coverage differences of Java bytecode and source code instrumentation tools. Software Qual J 27, 79–123 (2019). https://doi.org/10.1007/s11219-017-9389-z

Download citation

Published: 04 December 2017
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11219-017-9389-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Code coverage differences of Java bytecode and source code instrumentation tools

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

An empirical study of automated unit test generation for Python

Automatic software refactoring: a systematic literature review

Notes

References

Acknowledgements