Advertisement

Empirical Software Engineering

, Volume 15, Issue 2, pp 147–165 | Cite as

Fault-prone module detection using large-scale text features based on spam filtering

  • Hideaki Hata
  • Osamu MizunoEmail author
  • Tohru Kikuno
Article

Abstract

This paper proposes an approach using large-scale text features for fault-prone module detection inspired by spam filtering. The number of every text feature in the source code of a module is counted and used as data for training detection models. In this paper, we prepared a naive Bayes classifier and a logistic regression model as detection models. To show the effectiveness of our approaches, we conducted experiments with five open source projects and compared them with a well-known metrics set, thereby achieving higher detection results. The results imply that large-scale text features are useful in constructing practical detection models, and measuring sophisticated metrics is not always necessary for detecting fault-prone modules.

Keywords

Fault-prone module Large-scale Text feature Spam filtering Text mining Software repository 

Notes

Acknowledgements

The authors would like to express their thanks to the three anonymous reviewers and the editor for providing insightful and useful suggestions and comments. We also thank the developers of the CRM114 classifier. Without the CRM114, this work could not be conducted. We thank Tatsuya Miyake, Yoshiki Higo, and Katsuro Inoue who implemented a software metrics measurement tool. Finally, the authors also wish to thank the developers of Eclipse who have made the repository of Eclipse available for research.

References

  1. Aversano L, Cerulo L, Grosso CD (2007) Learning from bug-introducing changes to prevent fault prone code. In: Proc. of 9th international workshop on principles of software evolution. ACM, New York, pp 19–26Google Scholar
  2. Basili VR, Briand LC, Melo WL (1996) A validation of object oriented metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761CrossRefGoogle Scholar
  3. Bellini P, Bruno I, Nesi P, Rogai D (2005) Comparing fault-proneness estimation models. In: Proc. of 10th IEEE international conference on engineering of complex computer systems, pp 205–214Google Scholar
  4. Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720CrossRefGoogle Scholar
  5. Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493CrossRefGoogle Scholar
  6. Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc. of 24th international conference on software engineering, pp 241–251Google Scholar
  7. Fowler M, Beck K (1999) Refactoring: improving the design of existing code. Addison-Wesley Longman, BostonGoogle Scholar
  8. Graves TL, Karr AF, Marron J, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661CrossRefGoogle Scholar
  9. Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proc. of 18st international conference on automated software engineering, pp 249–252Google Scholar
  10. Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910CrossRefGoogle Scholar
  11. Halstead MH (1977) Elements of software science. Elsevier, AmsterdamzbMATHGoogle Scholar
  12. Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proc. of 21st IEEE international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 263–272CrossRefGoogle Scholar
  13. Herraiz I, German DM, Gonzalez-Barahona JM, Robles G (2008) Towards a simplification of the bug report form in eclipse. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 145–148CrossRefGoogle Scholar
  14. Higo Y, Murao K, Kusumoto S, Inoue K (2008) Predicting fault-prone modules based on metrics transitions. In: Proc. of 2008 workshop on defects in large software systems. ACM, New York, pp 6–10CrossRefGoogle Scholar
  15. Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical study. Empirical Software Engineering 9:229–257CrossRefGoogle Scholar
  16. Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proc. of 29th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 489–498Google Scholar
  17. Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196CrossRefGoogle Scholar
  18. Layman L, Kudrjavets G, Nagappan N (2008) Iterative identification of fault-prone binaries using in-process metrics. In: Proc. of 2nd international symposium on empirical software engineering and measurement. ACM, New York, pp 206–212CrossRefGoogle Scholar
  19. Li Z, Zhou Y (2005) PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 306–315Google Scholar
  20. Livshits B, Zimmermann T (2005) Dynamine: finding common error patterns by mining software revision histories. ACM SIGSOFT Softw Eng Notes 30(5):296–305CrossRefGoogle Scholar
  21. Madhavan J, Whitehead EJ Jr (2007) Predicting buggy changes inside an integrated development environment. In: Proc. of the 2007 OOPSLA workshop on eclipse technology exchange. ACM, New York, pp 36–40CrossRefGoogle Scholar
  22. Mäntylä M, Vanhanen J, Lassenius C (2003) A taxonomy and an initial empirical study of bad smells in code. In: Proc. of the international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 381–384Google Scholar
  23. McCabe TJ (1976) A complexity measure. In: Proc. of 2nd international conference on software engineering. IEEE Computer Society Press, Los Alamitos, p 407Google Scholar
  24. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13CrossRefGoogle Scholar
  25. Mileva YM, Zeller A (2008) Project-specific deletion patterns. In: Proc. of international workshop on recommendation systems for software engineering. ACM, New York, pp 41–42CrossRefGoogle Scholar
  26. Mizuno O, Kikuno T (2007) Training on errors experiment to detect fault-prone software modules by spam filter. In: Proc. of 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 405–414Google Scholar
  27. Mizuno O, Ikami S, Nakaichi S, Kikuno T (2007) Spam filter based approach for finding fault-prone software modules. In: Proc. of 4th international workshop on mining software repositoriesGoogle Scholar
  28. Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proc. of 27th International Conference on Software Engineering, pp 284–292Google Scholar
  29. Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proc. of 28th international conference on software engineering. ACM, New York, pp 452–461Google Scholar
  30. Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proc. of 14th ACM conference on computer and communications security. ACM, New York, pp 529–540CrossRefGoogle Scholar
  31. Ostrand T, Weyuker E, Bell R (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31:340–355CrossRefGoogle Scholar
  32. Pan K, Kim S, Whitehead EJ Jr (2009) Toward an understanding of bug fix patterns. Empir Softw Eng 14(3):286–315CrossRefGoogle Scholar
  33. Ratzinger J, Sigmund T, Gall H (2008) On the relation of refactorings and software defect prediction. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 35–38CrossRefGoogle Scholar
  34. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proc. of AAAI workshop on learning for text categorization. AAAI Technical Report WS-98-05Google Scholar
  35. Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proc. of ACM/IEEE international symposium on empirical software engineering. ACM, New York, pp 18–27CrossRefGoogle Scholar
  36. Seliya N, Khoshgoftaar TM, Zhong S (2005) Analyzing software quality with limited fault-proneness defect data. In: Proc. of 9th IEEE international symposium on high-assurance systems engineering, pp 89–98Google Scholar
  37. Śliwerski J, Zimmermann T, Zeller A (2005a) HATARI: raising risk awareness. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 107–110Google Scholar
  38. Śliwerski J, Zimmermann T, Zeller A (2005b) When do changes induce fixes? (on Fridays.) In: Proc. of 2nd international workshop on mining software repositories, pp 24–28Google Scholar
  39. Williams C, Hollingsworth J (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31:466–480CrossRefGoogle Scholar
  40. Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco.zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Graduate School of Information Science and TechnologyKyoto Institute of TechnologyKyotoJapan
  2. 2.Graduate School of Information Science and TechnologyOsaka UniversityOsakaJapan

Personalised recommendations