Skip to main content

Quality assessment of crowdsourced test cases

Abstract

Various software-engineering problems have been solved by crowdsourcing. In many projects, the software outsourcing process is streamlined on cloud-based platforms. Among software engineering tasks, test-case development is particularly suitable for crowdsourcing, because a large number of test cases can be generated at little monetary cost. However, the numerous test cases harvested from crowdsourcing can be high- or low-quality. Owing to the large volume, distinguishing the high-quality tests by traditional techniques is computationally expensive. Therefore, crowdsourced testing would benefit from an efficient mechanism distinguishes the qualities of the test cases. This paper introduces an automated approach — TCQA — to evaluate the quality of test cases based on the onsite coding history. Quality assessment by TCQA proceeds through three steps: (1) modeling the code history as a time series, (2) extracting the multiple relevant features from the time series, and (3) building a model that classifies the test cases based on their qualities. Step (3) is accomplished by feature-based machine-learning techniques. By leveraging the onsite coding history, TCQA can assess the test-case quality without performing expensive source-code analysis or executing the test cases. Using the data of nine test-development tasks involving more than 400 participants, we evaluated TCQA from multiple perspectives. The TCQA approach assessed the quality of the test cases with higher precision, faster speed, and lower overhead than conventional test-case quality-assessment techniques. Moreover, TCQA provided yield real-time insights on test-case quality before the assessment was finished.

This is a preview of subscription content, access via your institution.

References

  1. Mao K, Capra L, Harman M, et al. A survey of the use of crowdsourcing in software engineering. J Syst Softw, 2017, 126: 57–84

    Article  Google Scholar 

  2. LaToza T D, Chen M, Jiang L X, et al. Borrowing from the crowd: a study of recombination in software design competitions. In: Proceedings of the 37th International Conference on Software Engineering, 2015. 551–562

  3. Musson R, Richards J, Fisher D, et al. Leveraging the crowd: how 48,000 users helped improve lync performance. IEEE Softw, 2013, 30: 38–45

    Article  Google Scholar 

  4. LaToza T D, Towne W B, Adriano C M, et al. Microtask programming: building software with a crowd. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, 2014. 43–54

  5. Inozemtseva L, Holmes R. Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 435–445

  6. Zhang J, Wang Z Y, Zhang L M, et al. Predictive mutation testing. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, 2016. 342–353

  7. Tsai W T, Wu W, Huhns M N. Cloud-based software crowdsourcing. IEEE Int Comput, 2014, 18: 78–83

    Article  Google Scholar 

  8. Park J, Park Y H, Kim S, et al. Eliph: effective visualization of code history for peer assessment in programming education. In: Proceedings of ACM Conference on Computer Supported Cooperative Work and Social Computing, 2017. 458–467

  9. Wang Y, Wagstrom P, Duesterwald E, et al. New opportunities for extracting insights from cloud based IDEs. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 408–411

  10. Wang Y. Characterizing developer behavior in cloud based IDEs. In: Proceedings of ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017. 48–57

  11. Negara S, Vakilian M, Chen N, et al. Is it dangerous to use version control histories to study source code evolution? In: Proceedings of European Conference on Object-Oriented Programming, 2012. 79–103

  12. LaToza T D, Myers B A. Hard-to-answer questions about code. In: Proceedings of Evaluation and Usability of Programming Languages and Tools, 2010

  13. Christ M, Kempa-Liehr A W, Feindt M. Distributed and parallel time series feature extraction for industrial big data applications. 2016. ArXiv:1610.07717

  14. Huhns M N, Li W, Tsai W T. Cloud-based software crowdsourcing (dagstuhl seminar 13362). Dagstuhl Rep, 2013, 3: 34–58

    Google Scholar 

  15. Fast E, Steffee D, Wang L, et al. Emergent, crowd-scale programming practice in the IDE. In: Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems, 2014. 2491–2500

  16. Bai X Y, Li M Y, Huang X F, et al. Vee@cloud: the virtual test lab on the cloud. In: Proceeding of the 8th International Workshop on Automation of Software Test (AST), 2013. 15–18

  17. Zhu H, Hall P A V, May J H R. Software unit test coverage and adequacy. ACM Comput Surv, 1997, 29: 366–427

    Article  Google Scholar 

  18. Young M. Software Testing and Analysis: Process, Principles, and Techniques. Hoboken: John Wiley & Sons, 2008

    Google Scholar 

  19. Rojas J M, Fraser G, Arcuri A. Automated unit test generation during software development: a controlled experiment and think-aloud observations. In: Proceedings of International Symposium on Software Testing and Analysis, 2015. 338–349

  20. Xiao X S, Xie T, Tillmann N, et al. Precise identification of problems for structural test generation. In: Proceedings of the 33rd International Conference on Software Engineering, 2011. 611–620

  21. Almasi M M, Hemmati H, Fraser G, et al. An industrial evaluation of unit test generation: finding real faults in a financial application. In: Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track, 2017. 263–272

  22. Shamshiri S, Just R, Rojas J M, et al. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015. 201–211

  23. Ying A T T, Murphy G C, Ng R, et al. Predicting source code changes by mining change history. IEEE Trans Softw Eng, 2004, 30: 574–586

    Article  Google Scholar 

  24. Keogh E J, Pazzani M J. Scaling up dynamic time warping for datamining applications. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000. 285–289

  25. Christ M, Kempa-Liehr A W, Feindt M. Distributed and parallel time series feature extraction for industrial big data applications. 2016. ArXiv:1610.07717

  26. Yekutieli D, Benjamini Y. The control of the false discovery rate in multiple testing under dependency. Ann Statist, 2001, 29: 1165–1188

    MathSciNet  Article  Google Scholar 

  27. Schreiber T, Schmitz A. Discrimination power of measures for nonlinearity in a time series. Phys Rev E, 1997, 55: 5443–5447

    Article  Google Scholar 

  28. Menzies T, Williams L, Zimmermann T. Perspectives on Data Science for Software Engineering. San Francisco: Morgan Kaufmann, 2016

    Book  Google Scholar 

  29. Breiman L. Random forests. Mach Learn, 2001, 45: 5–32

    Article  Google Scholar 

  30. Khoshgoftaar T M, Golawala M, van Hulse J. An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, 2007. 310–317

  31. Petitjean F, Forestier G, Webb G I, et al. Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowl Inf Syst, 2016, 47: 1–26

    Article  Google Scholar 

  32. Inozemtseva L, Holmes R. Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 435–445

  33. Jia Y, Harman M. An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng, 2011, 37: 649–678

    Article  Google Scholar 

  34. Pan S J, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng, 2010, 22: 1345–1359

    Article  Google Scholar 

  35. Pan S J, Tsang I W, Kwok J T, et al. Domain adaptation via transfer component analysis. IEEE Trans Neural Netw, 2011, 22: 199–210

    Article  Google Scholar 

  36. Nam J, Pan S J, Kim S. Transfer defect learning. In: Proceedings of International Conference on Software Engineering, 2013. 382–391

  37. Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manage, 2009, 45: 427–437

    Article  Google Scholar 

  38. Ye L, Keogh E. Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. 947–956

  39. Rokicki M, Zerr S, Siersdorfer S. Groupsourcing: team competition designs for crowdsourcing. In: Proceedings of the 24th International Conference on World Wide Web, 2015. 906–915

  40. Soffa M L, Mathur A P, Gupta N. Generating test data for branch coverage. In: Proceedings of the 15th IEEE International Conference on Automated Software Engineering, 2000. 219–227

  41. Gligoric M, Groce A, Zhang C, et al. Comparing non-adequate test suites using coverage criteria. In: Proceedings of International Symposium on Software Testing and Analysis, 2013. 302–313

  42. Gopinath R, Jensen C, Groce A. Code coverage for suite evaluation by developers. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 72–82

  43. Perry W E. Effective Methods for Software Testing: Includes Complete Guidelines, Checklists, and Templates. Hoboken: John Wiley & Sons, 2007

    Google Scholar 

  44. Namin A S, Andrews J H. The influence of size and coverage on test suite effectiveness. In: Proceedings of the 18th International Symposium on Software Testing and Analysis, 2009. 57–68

  45. Briand L, Pfahl D. Using simulation for assessing the real impact of test coverage on defect coverage. In: Proceedings of the 10th International Symposium on Software Reliability Engineering, 1999. 148–157

  46. Cai X, Lyu M R. The effect of code coverage on fault detection under different testing profiles. SIGSOFT Softw Eng Notes, 2005, 30: 1–7

    Google Scholar 

  47. Zhang Y C, Mesbah A. Assertions are strongly correlated with test suite effectiveness. In: Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, 2015. 214–224

  48. Wong W E, Mathur A P. Reducing the cost of mutation testing: an empirical study. J Syst Softw, 1995, 31: 185–196

    Article  Google Scholar 

  49. Offutt A J, Lee A, Rothermel G, et al. An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol, 1996, 5: 99–118

    Article  Google Scholar 

  50. Polo M, Piattini M, García-Rodríguez I. Decreasing the cost of mutation testing with second-order mutants. Softw Test Verif Reliab, 2009, 19: 111–131

    Article  Google Scholar 

  51. Jia Y, Harman M. Higher order mutation testing. Inf Softw Tech, 2009, 51: 1379–1393

    Article  Google Scholar 

  52. Harman M, Jia Y, Langdon W B. Strong higher order mutation-based test data generation. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011. 212–222

  53. Shi Q K, Chen Z Y, Fang C R, et al. Measuring the diversity of a test set with distance entropy. IEEE Trans Rel, 2016, 65: 19–27

    Article  Google Scholar 

  54. Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv, 2009, 41: 1–58

    Article  Google Scholar 

Download references

Acknowledgements

This work was partly supported by National Key Research and Development Program of China (Grant No. 2018YFB1403400) and National Natural Science Foundation of China (Grant Nos. 61690201, 61772014).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yang Feng, Yi Wang or Chunrong Fang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y., Feng, Y., Wang, Y. et al. Quality assessment of crowdsourced test cases. Sci. China Inf. Sci. 63, 190102 (2020). https://doi.org/10.1007/s11432-019-2859-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-019-2859-8

Keywords

  • crowdsourcing
  • onsite programming
  • test quality
  • programming behavior