Skip to main content
Log in

Code coverage differences of Java bytecode and source code instrumentation tools

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

Many software testing fields, like white-box testing, test case generation, test prioritization, and fault localization, depend on code coverage measurement. If used as an overall completeness measure, the minor inaccuracies of coverage data reported by a tool do not matter that much; however, in certain situations, they can lead to serious confusion. For example, a code element that is falsely reported as covered can introduce false confidence in the test. This work investigates code coverage measurement issues for the Java programming language. For Java, the prevalent approach to code coverage measurement is using bytecode instrumentation due to its various benefits over source code instrumentation. As we have experienced, bytecode instrumentation-based code coverage tools produce different results than source code instrumentation-based ones in terms of the reported items as covered. We report on an empirical study to compare the code coverage results provided by tools using the different instrumentation types for Java coverage measurement on the method level. In particular, we want to find out how much a bytecode instrumentation approach is inaccurate compared to a source code instrumentation method. The differences are systematically investigated both in quantitative (how much the outputs differ) and in qualitative terms (what the causes for the differences are). In addition, the impact on test prioritization and test suite reduction—a possible application of coverage measurement—is investigated in more detail as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://eclemma.org/jacoco/.

  2. https://www.atlassian.com/software/clover/.

  3. http://www.sonarqube.org/.

  4. http://cobertura.github.io/cobertura/.

  5. http://www.semdesigns.com/Products/TestCoverage.

  6. http://emma.sourceforge.net/.

  7. http://soda.sed.hu

  8. http://junit.org/.

  9. https://github.com/.

  10. https://maven.apache.org/.

  11. http://www.semdesigns.com/Products/TestCoverage/JavaTestCoverage.html.

  12. https://wiki.openjdk.java.net/display/CodeTools/jcov/.

  13. https://www.eclipse.org/.

  14. https://netbeans.org/.

  15. https://www.jetbrains.com/idea/.

  16. https://jenkins.io/.

  17. https://gradle.org/.

  18. https://www.eclipse.org/collections/.

  19. http://projects.spring.io/spring-framework/.

  20. http://checkstyle.sourceforge.net/.

  21. https://github.com/sed-szeged/soda-jacoco-maven-plugin.

  22. https://www.sourcemeter.com/.

  23. There was one exception to this: in Cobertura, we disabled the feature of skipping the analysis of generated source code, as this was not implemented in the other two tools.

References

  • Alemerien, K., & Magel, K. (2014). Examining the effectiveness of testing coverage tools: an empirical study. International Journal of Software Engineering and Its Applications, 8(5), 139–162.

    Google Scholar 

  • Binder, W., Hulaas, J., Moret, P. (2007). Advanced Java bytecode instrumentation. In Proceedings of the 5th international symposium on Principles and practice of programming in Java (pp. 135–144). New York : ACM.

  • Black, R., van Veenendaal, E., Graham, D. (2012). Foundations of software testing: ISTQB certification. Cengage Learning.

  • Emanuelsson, P., & Nilsson, U. (2008). A comparative study of industrial static analysis tools. Electronic Notes in Theoretical Computer Science, 217, 5–21.

    Article  Google Scholar 

  • Fontana, F.A., Mariani, E., Mornioli, A., Sormani, R., Tonello, A. (2011). An experience report on using code smells detection tools. In Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, IEEE Computer Society, ICSTW ’11 (pp. 450–457).

  • Fraser, G., & Arcuri, A. (2011). Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11 (pp. 416–419). New York: ACM.

  • Gonzalez-Sanchez, A., Abreu, R., Gross, H.G., van Gemund, A.J.C. (2011). Prioritizing tests for fault localization through ambiguity group reduction. In Alexander, P., Pasareanu, C.S., Hosking, J.G. (Eds.) 2011 26th IEEE/ACM international conference on Automated software engineering (ASE) (pp. 83–92). Piscataway: IEEE.

  • Graves, T.L., Harrold, M.J., Kim, J.M., Porter, A., Rothermel, G. (2001). An empirical study of regression test selection techniques. ACM Transactions on Software Engineering and Methodology, 10(2), 184–208.

    Article  MATH  Google Scholar 

  • Harrold, M.J., Rothermel, G., Wu, R., Yi, L. (1998). An empirical investigation of program spectra. In Proceedings of the 1998 ACM SIGPLAN-SIGSOFT workshop PASTE ’98 (pp. 83–90). New York : ACM.

  • Inozemtseva, L., & Holmes, R. (2014). Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the international conference on software engineering.

  • Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37(5), 649–678.

    Article  Google Scholar 

  • Jones, J.A., & Harrold, M.J. (2005). Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of international conference on automated software engineering (pp. 273–282). New York: ACM.

  • Kajo-Mece, E., & Tartari, M. (2012 ). An evaluation of Java code coverage testing tools. In Proceedings of the 2012 balkan conference in informatics (BCI’12), Faculty of Sciences, University of Novi Sad (pp. 72–75).

  • Kessis, M., Ledru, Y., Vandome, G. (2005). Experiences in coverage testing of a Java middleware. In Proceedings of the 5th international workshop on software engineering and middleware, SEM ’05 (pp. 39–45). New York: ACM.

  • Li, N., Meng, X., Offutt, J., Deng, L. (2013). Is bytecode instrumentation as good as source code instrumentation: An empirical study with industrial tools (experience report). In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) (pp. 380–389).

  • Lingampally, R., Gupta, A., Jalote, P. (2007). A multipurpose code coverage tool for Java. In 40th Annual Hawaii international conference on system sciences, 2007. HICSS 2007 (pp. 261b–261b). Washington: IEEE.

  • Ntafos, S.C. (1988). A comparison of some structural testing strategies. IEEE Transactions on Software Engineering, 14(6), 868–874.

    Article  Google Scholar 

  • Offutt, A.J., Pan, J., Voas, J.M. (1995). Procedures for reducing the size of coverage-based test sets. In Proceedings Twelfth international conference testing computer software (pp. 111–123).

  • Ostrand, T. (2002). White-box testing. Encyclopedia of Software Engineering.

  • Perez, A., & Abreu, R. (2014). A diagnosis-based approach to software comprehension. In Proceedings of the 22nd international conference on program comprehension, ICPC 2014 (pp. 37–47). New York: ACM.

  • Pinto, L.S., Sinha, S., Orso, A. (2012). Understanding myths and realities of test-suite evolution. In Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering (pp. 33:1–33:11). New York: ACM.

  • Raulamo-Jurvanen, P. (2017). Decision support for selecting tools for software test automation. SIGSOFT Softw Eng Notes, 41(6), 1–5.

    Article  Google Scholar 

  • Rayadurgam, S., & Heimdahl, M. (2001). Coverage based test-case generation using model checkers. In Eighth Annual IEEE international conference and workshop on the engineering of computer based systems, 2001. ECBS 2001 Proceedings (pp. 83–91).

  • Rothermel, G., Untch, R.J., Chu, C. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10), 929–948.

    Article  Google Scholar 

  • Rothermel, G., Harrold, M.J., von Ronne, J., Hong, C. (2002). Empirical studies of test-suite reduction. Software testing. Verification and Reliability, 12(4), 219–249.

    Article  Google Scholar 

  • Tengeri, D., Beszédes, Á., Havas, D., Gyimóthy, T. (2014). Toolset and program repository for code coverage-based test suite analysis and manipulation. In Proceedings of the 14th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM’14) (pp. 47–52).

  • Tengeri, D., Horváth, F., Beszédes, Á., Gergely, T., Gyimóthy, T. (2016). Negative effects of bytecode instrumentation on Java source code coverage. In Proceedings of the 23rd IEEE international conference on software analysis, evolution, and reengineering (SANER 2016) (pp. 225–235).

  • Usaola, M.P., & Mateo, P.R. (2010). Mutation testing cost reduction techniques: a survey. IEEE Software, 27(3), 80–86.

    Article  Google Scholar 

  • Vidács, L., Beszédes, Á., Tengeri, D., Siket, I., Gyimóthy, T. (2014). Test suite reduction for fault detection and localization: A combined approach. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE) (pp. 204–213).

  • Yang, Q., Li, J.J., Weiss, D.M. (2009). A survey of coverage-based testing tools. The Computer Journal, 52(5), 589–597.

    Article  Google Scholar 

  • Yoo, S., & Harman, M. (2012). Regression testing minimization, selection and prioritization: a survey. Verification and Reliability Software Testing, 22(2), 67–120.

    Article  Google Scholar 

  • Yoo, S., Harman, M., Clark, D. (2011). Flint: fault localisation using information theory. Tech. rep., University College London.

  • Yoo, S., Harman, M., Clark, D. (2013). Fault localization prioritization: comparing information-theoretic and coverage-based approaches. ACM Transactions on Software Engineering and Methodology, 22(3), 19.

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and by the EU-funded Hungarian national grant GINOP-2.3.2-15-2016-00037 titled “Internet of Living Things.” We are grateful to Semantic Designs, Inc. for providing a free research license for their tool.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ferenc Horváth.

Appendices

Appendix A: Comparison results of source code-based tools

In this appendix we present additional results of comparing the results of the two methods employing source code instrumentation, C lover and T est Coverage.

As these two tools generate instrumented source code, it is possible to compare their instrumentation algorithms on a high level by investigating the instrumented code itself. To do so, we manually checked the instrumented sources and investigated the probe points, i.e., the locations where extra code were injected into the original source code. We found that the two tools identified and instrumented exactly the same source code elements. The main technical difference between the two tools is that while T est Coverage uses boolean vectors to store coverage data, C lover has a complex mechanism for calculating which part of the production code is exercised (this enables per-test coverage measurement as well). Thus, in general, T est Coverage inserts less extra code into the original source. Another difference is the handling of such code which do not have a conventional form of a Java method but will be included in the bytecode as a special method (e.g., static initialization code of the class or anonymous methods). Both tools recognize and instrument these parts of the source code, but C lover reports them as methods, while T est Coverage includes them in the class coverage only.

In the next step, we calculated the raw overall coverages for our subject systems with these two tools in order to see how much their results differ. Table 14 shows the associated results. Columns 2 and 3 show the overall coverage ratios as produced by the tools, while the last column includes the percentage difference. The numbers in the last row represent the averages of the absolute differences.

Table 14 Comparison of the overall coverages computed by the source code tools (C lover and T est coverage)

We can observe that the above mentioned differences of the tools cause small differences in the overall coverage results. Note, that we used Cloverglob which is the value measured globally, including cross-submodule coverage for submodule-based systems, and this is how T est Coverage works too. Unfortunately, we were not able to produce per-test coverage values using T est Coverage, as this would have required the individual execution of the test cases and consequently large-scale modifications in the build environments. Hence, we could not perform such a comparison of the tools.

Appendix[[spicolon]] B: Comparison results of bytecode-based tools

In this appendix we present additional results of comparing the results of the three tools employing bytecode instrumentation: J aCoCo, J Cov, and Cobertura.

We calculated the raw overall coverages for our subject systems with these tools in order to see how much their results differ. Table 15 shows the associated results. Columns 2–4 are the overall coverage ratios as produced by the tools “out of the box” (i.e., without any modifications made to them,Footnote 23 only the necessary parameters have been set). The last three columns show the pairwise differences in the percentage, while the numbers in the last row represent the averages of the absolute differences.

Table 15 Comparison of the overall coverages computed by the bytecode tools (Cobertura, J aCoCo and J Cov)

Quantitatively, the differences between these tools were at most 4%, and the behavior of the tools was different on the different subjects. We made some deeper but basically still quantitative analysis: we compared the per-test coverage results of the tools. Unfortunately, we were not able to produce such results using Cobertura, so we compared J aCoCo and J Cov in this respect.

In Fig. 7, we present the differences in test case coverage vectors. For each test case, we use a coverage vector in which each element corresponds to a single code element. We compared such vector pairs for J aCoCo and J Cov for each test case using the Hamming distance measure and normalized the result by the length of the vectors. Figure 7 shows the corresponding data in form of histograms.

Fig. 7
figure 7

Relative Hamming distances of test case vectors (J aCoCo vs. J Cov)

For the first four subject programs, data shows that most of the vector pairs are the same and the difference is less than 1% for the others. For m apdb and n etty, there are very few vector pairs that match exactly, but most of them are still close to each other. In the case of o ryx and o rientdb, about half of the vector pairs matches exactly, the difference in the majority of the cases is less than 1%, but there are differences as high as 5% or even 14%. After the manual investigations, most of these high differences are found to be tolerable outliers.

Similarly to the previous set of experiments, we performed per-method comparison of the coverages. Here the vectors were assigned to code elements and a vector element corresponded to a test case. Figure 8 shows the differences in these vectors for J aCoCo and J Cov in form of histograms.

Fig. 8
figure 8

Relative Hamming distances of code element vectors (J aCoCo vs. J Cov)

We got the same results as in case of test case vectors for the first four subject programs, namely, most of the vector pairs are the same and the rest of them differ at most 1%. Here, the last four programs are similar to each other: a lower part of the vector pairs matches exactly, but there are some higher differences as well. Later, these high differences are turned out to be explainable outliers.

To find the cause of the differences (especially of the high Hamming distances) we observed from these experiments, we investigated the detailed coverage of the three tools manually as well. Our conclusion was that the main cause of the differences was mostly due to the slightly different handling of compiler-generated methods and nested classes in the bytecode (such as the methods generated for nested classes). Since the overall quantitative differences were at most 4% and they were concerned mostly with generated methods, which are less important for code coverage analysis, and the high individual Hamming distances could also be traced back to these methods, we concluded that one representative tool of the three should be sufficient for further experiments.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Horváth, F., Gergely, T., Beszédes, Á. et al. Code coverage differences of Java bytecode and source code instrumentation tools. Software Qual J 27, 79–123 (2019). https://doi.org/10.1007/s11219-017-9389-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-017-9389-z

Keywords

Navigation