Skip to main content
Log in

Estimating the number of remaining links in traceability recovery

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.finmeccanica.com/en/home

  2. http://www.cs.wm.edu/semeru/tefse2011/Challenge.htm

  3. http://www.ing.unisannio.it/mdipenta/estimating-links.tgz

  4. https://github.com/apicchiani/estimationTool

  5. For LSA the integer after the technique, e.g., LSA 100, indicates the number of LSA concepts.

  6. http://www.falessi.com/EMSE-Traceability2016.zip

  7. This is a common problem in IR-based traceability link recovery (De Lucia et al. 2011).

References

  • Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: The 16th IEEE international conference on program comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10–13, 2008. IEEE CS, pp 103–112

  • Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185

    MathSciNet  Google Scholar 

  • Antoniol G, Canfora G, Casazza G, De Lucia A (2000) Identifying the starting impact set of a maintenance request: a case study. In: European conference on software maintenance and reengineering, CSMR, pp 227–230

  • Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983

    Article  Google Scholar 

  • Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - volume 1, ICSE 2010, Cape Town, South Africa, 1–8 May 2010, pp 95–104

  • Athanasiadis I (2007) The fuzzy lattice reasoning (flr) classifier for mining environmental data. In: Kaburlasos V, Ritter G (eds) Computational intelligence based on lattice theory, studies in computational intelligence, vol 67. Springer, Berlin, Heidelberg, pp 175–193. doi:10.1007/978-3-540-72687-6_9

    Chapter  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley

  • Bai CG, Cai KY, Hu QP, Ng SH (2008) On the trend of remaining software defect estimation. IEEE Trans Syst Man Cybern Part A Syst Humans 38(5):1129–1142. doi:10.1109/TSMCA.2008.2001071

    Article  Google Scholar 

  • Baker RD Edgington E (ed) (1995) Modern permutation test software. Marcel Decker

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022. doi:10.1162/jmlr.2003.3.4-5.993

    MATH  Google Scholar 

  • Borg M, Runeson P, Ardö A (2014) Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability. Empir Softw Eng 19(6):1565–1616. doi:10.1007/s10664-013-9255-y

    Article  Google Scholar 

  • Breiman L, Breiman L (1996) Bagging predictors. In: Machine learning, pp 123–140

  • Briand LC, Emam KE, Freimut BG, Laitenberger O (2000) A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Trans Softw Eng 26(6):518–540

    Article  Google Scholar 

  • Briand LC, Falessi D, Nejati S, Sabetzadeh M, Yue T (2014) Traceability and SysML design slices to support safety inspections: a controlled experiment. ACM Trans Softw Eng Methodol 23(1):9:1–9:43. doi:10.1145/2559978

    Article  Google Scholar 

  • Cai K (1998) On estimating the number of defects remaining in software. J Syst Softw 40(2):93–114. doi:10.1016/S0164-1212(97)00003-4

    Article  Google Scholar 

  • Capobianco G, De Lucia A, Oliveto R, Panichella A, Panichella S (2009) On the role of the nouns in IR-based traceability recovery. In: The 17th IEEE international conference on program comprehension, ICPC 2009, Vancouver, British Columbia, Canada, May 17–19, 2009. IEEE CS, pp 148–157

  • Chen T, Sahinoglu M, von Mayrhauser A, Hajjar A, Anderson C (1999) How much testing is enough? Applying stopping rules to behavioral model testing. In: 4th IEEE international symposium on high-assurance systems engineering, 1999. Proceedings, pp 249–256. doi:10.1109/HASE.1999.809500

  • Cleland-Huang J, Settimi R, Duan C, Zou X (2005) Utilizing supporting evidence to improve dynamic requirements traceability. In: 13th IEEE international conference on requirements engineering (RE 2005), 29 August - 2 September 2005, Paris, France. IEEE CS, pp 135–144

  • Colwell DJ, Gillett JR (1982) 66.49 Spearman versus Kendall. Math Gaz 66 (438):307–309

    Article  Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience

  • Cuddeback D, Dekhtyar A, Huffman Hayes J, Holden J, Kong W-K (2011) Towards overcoming human analyst fallibility in the requirements tracing process. In: Proceedings of the 33rd international conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21–28, 2011. ACM, pp 860–863

  • Czauderna A, Cleland-Huang J, Cinar M, Berenbach B (2012) Just-in-time traceability for mechatronics systems. In: IEEE second workshop on requirements engineering for systems, services and systems-of-systems (RES4), 2012, pp 1–9. doi:10.1109/RES4.2012.6347691

  • Dag JN, Regnell B, Carlshamre P, Andersson M, Karlsson J (2002) A feasibility study of automated natural language requirements analysis in market-driven development. Requir Eng 7(1):20–33

    Article  MATH  Google Scholar 

  • De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance (ICSM 2006), 24–27 September 2006, Philadelphia, Pennsylvania, USA. IEEE Computer Society, pp 299–309

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4)

  • De Lucia A, Oliveto R, Tortora G (2009) Assessing IR-based traceability recovery tools through controlled experiments. Empir Softw Eng 14(1):57–92

    Article  Google Scholar 

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2011) Improving IR-based traceability recovery using smoothing filters. In: The 19th IEEE international conference on program comprehension, ICPC 2011, Kingston, ON, Canada, June 22–24, 2011. IEEE Computer Society, pp 21–30

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dekhtyar A, Dekhtyar O, Holden J, Hayes JH, Cuddeback D, Kong W-K (2011) On human analyst performance in assisted requirements tracing: statistical analysis. In: RE 2011, 19th IEEE international requirements engineering conference, Trento, Italy, August 29 2011–September 2, 2011. IEEE, pp 111–120

  • Duan C, Cleland-Huang J (2007) Clustering support for automated tracing. In: 22nd IEEE/ACM international conference on automated software engineering (ASE 2007), November 5–9, 2007, Atlanta, Georgia, USA. ACM, pp 244–253

  • Falessi D, Reichel A (2015) Towards an open-source tool for measuring and visualizing the interest of technical debt. In: IEEE 7th international workshop on managing technical debt (MTD), 2015, pp 1–8. doi:10.1109/MTD.2015.7332618

  • Falessi D, Briand LC, Cantone G (2009) The impact of automated support for linking equivalent requirements based on similarity measures, Tech. rep., Simula Research Laboratory Technical Report 2009– 08

  • Falessi D, Cantone G, Canfora G (2011) Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Trans Softw Eng 39(1):18– 44

    Article  Google Scholar 

  • Falessi D, Shaw MA, Mullen K (2014) Achieving and maintaining CMMI maturity level 5 in a small organization. IEEE Softw 31(5):80–86. doi:10.1109/MS.2014.17

    Article  Google Scholar 

  • Fellbaum C (1998) Wordnet: an electronic lexical database. The MIT Press

  • Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995

    Article  Google Scholar 

  • Freund Y, Mason L (1999) The alternating decision tree learning algorithm. In: In machine learning: proceedings of the sixteenth international conference. Morgan Kaufmann, pp 124–133

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:1998

    Article  MathSciNet  MATH  Google Scholar 

  • Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: IEEE 27th international conference on software maintenance, ICSM 2011, Williamsburg, VA, USA, September 25–30, 2011. IEEE, pp 133–142

  • Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32(1):4–19

    Article  Google Scholar 

  • Huffman Hayes J, Dekhtyar A, Osborne J (2003) Improving requirements tracing via information retrieval. In: 11th IEEE international conference on requirements engineering (RE 2003), 8–12 September 2003, Monterey Bay, CA, USA. IEEE CS, p 138

  • Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11, ACM, New York, NY, USA, pp 481–490. doi:10.1145/1985793.1985859

    Google Scholar 

  • Krishnan S, Strasburg C, Lutz RR, Goseva-Popstojanova K, Dorman KS (2013) Predicting failure-proneness in an evolving software product line. Inf Softw Technol 55(8):1479–1495. doi:10.1016/j.infsof.2012.11.008

    Article  Google Scholar 

  • Lindvall M, Sandahl K (1996) Practical implications of traceability. Softw Pract Exper 26(10):1161–1180. doi:10.1002/(SICI)1097-024X(199610)26:10〈1161::AID-SPE58〉3.3.CO;2-O

    Article  Google Scholar 

  • Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18–26, 2013. ACM, pp 378–388

  • Lormans M, van Deursen A (2006) Can lsi help reconstructing requirements traceability in design and test?. In: 10th European conference on software maintenance and reengineering (CSMR 2006), 22–24 March 2006, Bari, Italy. IEEE Computer Society, pp 47–56

  • Lormans M, van Deursen A, Groß H (2008) An industrial case study in reconstructing requirements views. Empir Softw Eng 13(6):727–760. doi:10.1007/s10664-008-9078-4

    Article  Google Scholar 

  • Malaiya YK, Denton J (1998) Estimating the number of residual defects [in software]. In: High-assurance systems engineering symposium, 1998. Proceedings. Third IEEE international, pp 98–105. doi:10.1109/HASE.1998.731600

  • Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th international conference on software engineering, May 3-10, 2003, Portland, Oregon, USA. IEEE CS, pp 125–137

  • Mirakhorli M, Cleland-Huang J (2011) Tracing architectural concerns in high assurance systems (nier track). In: Proceedings of the 33rd international conference on software engineering, ICSE ’11, ACM, New York, NY, USA, pp 908–911. doi:10.1145/1985793.1985942

  • Mirakhorli M, Shin Y, Cleland-Huang J, Cinar M (2012) A tactic-centric approach for automating traceability of quality concerns. In: 2012 34th international conference on software engineering (ICSE). doi:10.1109/ICSE.2012.6227153, pp 639–649

  • Myers JL, Well AD (2003) Research design and statistical analysis. Lawrence Erlbaum Associates, New Jersey

    Google Scholar 

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06, ACM, New York, NY, USA, pp 452–461. doi:10.1145/1134285.1134349

  • Nam J, Kim S (2015) Clami: defect prediction on unlabeled datasets. In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering (ASE 2015)

  • Okutan A, Yildiz OT (2014) Software defect prediction using bayesian networks. Empir Softw Eng 19(1):154–181. doi:10.1007/s10664-012-9218-8

    Article  Google Scholar 

  • Otis D, Burnham K, White G, Andersonm D (1978) Statistical inference from capture data on closed animal population. Wildl Monogr 62(135)

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: 35th international conference on software engineering, ICSE ’13, San Francisco, CA, USA, May 18–26, 2013. IEEE/ACM, pp 522–531

  • Petersson H, Thelin T, Runeson P, Wohlin C (2004) Capture-recapture in software inspections after 10 years research–theory, evaluation and application. J Syst Softw 72(2):249–264

    Article  Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  • Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, ACM, New York, NY, USA, pp 147–157. doi:10.1145/2491411.2491418

  • Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Pearson Education

  • Settimi R, Cleland-Huang J, Khadra OB, Mody J, Lukasik W, DePalma C (2004) Supporting software evolution through dynamically retrieving traces to UML artifacts. In: 7th international workshop on principles of software evolution (IWPSE 2004), 6–7 September 2004, Kyoto, Japan. IEEE Computer Society, pp 49–54

  • Stone M (1974) Cross-validatory choice and assesment of statistical predictions (with discussion). J R Stat Soc Ser B 36:111–147

    MATH  Google Scholar 

  • Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell

    Book  MATH  Google Scholar 

  • Yadla S, Huffman Hayes J, Dekhtyar A (2005) Tracing requirements to defect reports: an application of information retrieval techniques. ISSE 1(2):116–124

    Google Scholar 

  • Zou X, Settimi R, Cleland-Huang J (2007) Term-based enhancement factors for improving automated requirement trace retrieval. In: ACM international symposium on grand challenges of traceability

  • Zou X, Settimi R, Cleland-Huang J (2010) Improving automated requirements trace retrieval: a study of term-based enhancement methods. Empir Softw Eng 15(2):119–146

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Falessi.

Additional information

Communicated by: Patrick Mäder, Rocco Oliveto and Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Falessi, D., Di Penta, M., Canfora, G. et al. Estimating the number of remaining links in traceability recovery. Empir Software Eng 22, 996–1027 (2017). https://doi.org/10.1007/s10664-016-9460-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-016-9460-6

Keywords

Navigation