Journal of Computer Science and Technology

, Volume 22, Issue 3, pp 387–396 | Cite as

Improving Software Quality Prediction by Noise Filtering Techniques

Regular Paper

Abstract

Accuracy of machine learners is affected by quality of the data the learners are induced on. In this paper, quality of the training dataset is improved by removing instances detected as noisy by the Partitioning Filter. The fit dataset is first split into subsets, and different base learners are induced on each of these splits. The predictions are combined in such a way that an instance is identified as noisy if it is misclassified by a certain number of base learners. Two versions of the Partitioning Filter are used: Multiple-Partitioning Filter and Iterative-Partitioning Filter. The number of instances removed by the filters is tuned by the voting scheme of the filter and the number of iterations. The primary aim of this study is to compare the predictive performances of the final models built on the filtered and the un-filtered training datasets. A case study of software measurement data of a high assurance software project is performed. It is shown that predictive performances of models built on the filtered fit datasets and evaluated on a noisy test dataset are generally better than those built on the noisy (un-filtered) fit dataset. However, predictive performance based on certain aggressive filters is affected by presence of noise in the evaluation dataset.

Keywords

noise filtering data quality software quality classification expected cost of misclassification voting expert 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Taghi M Khoshgoftaar, Shi Zhong, Vedang Joshi. Noise elimination with ensemble-classifier filtering for software quality estimation. Intelligent Data Analysis, 2005, 9(1): 3–27.Google Scholar
  2. [2]
    Witten I H, Frank E. Data Mining, Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, 2005.Google Scholar
  3. [3]
    Khoshgoftaar T M, Seliya N. Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering Journal, December 2003, 8(4): 325–350.CrossRefGoogle Scholar
  4. [4]
    Khoshgoftaar T M, Allen E B. Logistic regression modeling of software quality. International Journal of Reliability, Quality, and Safety Engineering, 1999, 6(4): 303–317.CrossRefGoogle Scholar
  5. [5]
    Zhu X, Wu X, Chen Q. Eliminating class noise in large datasets. In Proc. the 20th Int. Conf. Machine Learning, Washington DC, August 2003, pp.920–927.Google Scholar
  6. [6]
    Owen D B. Data Quality Control: Theory and Pragmatics. New York: Marcel Dekker, NY, 1990.Google Scholar
  7. [7]
    Wang R Y, Storey V C, Firth C P. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, August 1995, 7(4): 623–639.CrossRefGoogle Scholar
  8. [8]
    Teng C M. A comparison of noise handling techniques. In Proc. the Int. Florida Artificial Intelligence Research Symposium, 2001, pp.269–273.Google Scholar
  9. [9]
    Gamberger D, Lavrač N, Džeroski S. Noise elimination in inductive concept learning: A case study in medical diagnosis. In Algorithmic Learning Theory: Proc. the 7th Int. Workshop, Sydney, Australia, LNCS 1160, Springer-Verlag, October, 1996, pp.199–212.Google Scholar
  10. [10]
    Teng C M. Evaluating noise correction. In Lecture Notes in Artificial Intelligence: Proc. the 6th Pacific Rim Int. Conf. Artificial Intelligence, Melbourne, Australia, Springer-Verlag, 2000, pp.188–198.Google Scholar
  11. [11]
    Brodley C E, Friedl M A. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11: 131–167.MATHGoogle Scholar
  12. [12]
    Rebours P. Partitioning filter approach to noise elimination: An empirical study in software quality classification [Thesis]. Florida Atlantic University, Boca Raton, FL, April 2004, Advised by Khoshgoftaar T M.Google Scholar
  13. [13]
    Khoshgoftarr T M, Allen E B. A practical classification rule for software quality models. IEEE Trans. Reliability, June 2000, 49(2): 209–216.CrossRefGoogle Scholar
  14. [14]
    Jain R. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, 1991.Google Scholar
  15. [15]
    Berenson M L, Levine D M, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs: Prentice Hall, NJ, 1983.Google Scholar
  16. [16]
    Christensen R. Analysis of Variance, Design and Regression. Applied Statistical Methods. 1st Edition, Chapman & Hall, 1996.Google Scholar
  17. [17]
    Fenton N E, Pfleeger S L. Software Metrics: A Rigorous and Practical Approach. 2nd Edition, Boston: PWS Publishing, MA, 1997.Google Scholar
  18. [18]
    Quinlan J R. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann, CA, 1993.Google Scholar
  19. [19]
    Holte R C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 1993, 11: 63–91.MATHCrossRefGoogle Scholar
  20. [20]
    Atkeson C G, Moore A W, Schaal S. Locally weighted learning. Artificial Intelligence Review, 1997, 11(1/5): 11–73.CrossRefGoogle Scholar
  21. [21]
    Cohen W W. Fast effective rule induction. In Proc. the 12th Int. Conf. Machine Learning, Priedities A, Russell S (eds.), Tahoe City: Morgan Kaufmann, CA, July 1995, pp.115–123.Google Scholar
  22. [22]
    Kolodner J. Case-Based Reasoning. San Mateo, CA: Morgan Kaufmann, 1993.Google Scholar

Copyright information

© Science Press, Beijing, China and Springer Science + Business Media, LLC, USA 2007

Authors and Affiliations

  1. 1.Empirical Software Engineering Laboratory, Department of Computer Science and EngineeringFlorida Atlantic UniversityBoca RatonUSA

Personalised recommendations