Skip to main content
Log in

Natural language processing in mining unstructured data from software repositories: a review

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

With the increasing popularity of open-source platforms, software data is easily available from various open-source tools like GitHub, CVS, SVN, etc. More than 80 percent of the data present in them is unstructured. Mining data from these repositories helps project managers, developers and businesses, in getting interesting insights. Most of the software artefacts present in these repositories are in the natural language form, which makes natural language processing (NLP) an important part of mining to get the useful results. The paper reviews the application of NLP techniques in the field of Mining Software Repositories (MSR). The paper mainly focuses on sentiment analysis, summarization, traceability, norms mining and mobile analytics. The paper presents the major NLP works performed in this area by surveying the research papers from 2000 to 2018. The paper firstly describes the major artefacts present in the software repositories where the NLP techniques have been applied. Next, the paper presents some popular open-source NLP tools that have been used to perform NLP tasks. Later the paper discusses, in brief, the research state of NLP in MSR field. The paper also lists down the various challenges along with the pointers for future work in this field of research and finally the conclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

References

  1. Sridhara G, Hill E, Muppaneni D, Pollock L and Vijay Shanker K 2010 Towards automatically generating summary comments for Java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52, https://doi.org/10.1145/1858996.1859006

  2. https://github.com/

  3. https://sourceforge.net/

  4. https://code.google.com/

  5. https://www.bugzilla.org/about/

  6. https://academia.stackexchange.com/

  7. https://stackoverflow.com/

  8. https://octoverse.github.com/

  9. Chen T H, Thomas S W and Hassan A E 2015 A survey on the use of topic models when mining software repositories. Empirical Software Engineering 21: 1843–1919

    Article  Google Scholar 

  10. White M, Vendome C, Vasquez M L and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38

  11. Haiduc S, Arnaoudov V, Marcus A and Antoniol G 2016 The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 898–899, https://doi.org/10.1145/2889160.2891053

  12. Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The applications of natural language processing (NLP) for software engineering—a systematic literature review. In: Lecture Notes in Electrical Engineering, 424: 485–493

    Article  Google Scholar 

  13. Hassan A E 2008 The road ahead for mining software repositories. In: Proceedings of FoSM 2008, pp. 48–57, https://doi.org/10.1109/FOSM.2008.4659248

  14. Rastkar S, Murphy G C and Murray G 2014 Automatic summarization of bug reports. IEEE Transactions on Software Engineering 40: 366–380

    Article  Google Scholar 

  15. Le T D B, Vasquez M L, Lo D and Poshyvanyk D 2015 RCLinker: automated linking of issue reports and commits leveraging rich contextual information. In: Proceedings of the 23rd International Conference on Program Comprehension, pp. 36–47, https://doi.org/10.1109/ICPC.2015.13

  16. Moreno L, Bavota G and Penta M D 2016 ARENA: an approach for the automated generation of release notes. IEEE Transactions on Software Engineering 43: 106–127

    Article  Google Scholar 

  17. Rastkar S, Murphy G C, Moreno L and Bradley A W J 2011 Generating natural language summaries for crosscutting source code concerns. In: Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM), pp. 103–112, https://doi.org/10.1109/ICSM.2011.6080777

  18. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

  19. http://www.businessofapps.com/data/app-statistics/

  20. Kao A and Poteet S R (Eds.) 2007 Natural language processing and text mining. London: Springer, https://doi.org/10.1007/978-1-84628-754-1

    Book  MATH  Google Scholar 

  21. LEBRET R P 2016 Word embeddings for natural language processing. Ph.D. Thesis, Ecole polytechnique federale de Lausanne, Chapter 3

  22. https://projects.apache.org/

  23. https://opennlp.apache.org/

  24. https://stanfordnlp.github.io/CoreNLP/

  25. https://www.nltk.org/

  26. https://gate.ac.uk/

  27. https://spacy.io/

  28. https://github.com/collab-uniba/Emotion/

  29. http://mallet.cs.umass.edu/

  30. https://github.com/tensorflow/tensorflow

  31. https://github.com/Microsoft/cntk

  32. https://github.com/Theano/Theano

  33. https://keras.io/

  34. Runeson P, Alexandersson M and Nyholm O 2007 Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th International Conference on Software Engineering, https://doi.org/10.1109/ICSE.2007.32

  35. Moawad I F and Aref M 2012 Semantic graph reduction approach for abstractive text summarization. In: Proceedings of the Seventh International Conference on Computer Engineering and Systems, pp. 132–138, https://doi.org/10.1109/ICCES.2012.6408498

  36. Dohare S, Karnick H and Gupta V 2017 Text summarization using abstract meaning representation. Computation and Language arXiv:1706.01678v3

  37. https://en.wikipedia.org/wiki/Word-sense_disambiguation

  38. Saberi B and Saad S 2017 Sentiment analysis or opinion mining: a review. International Journal of Advanced Science Engineering Information Technology 7: 1660–1667

    Article  Google Scholar 

  39. Schugerl P, Rilling J and Charland P 2008 Mining bug repositories: a quality assessment. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 1105–1110, https://doi.org/10.1109/CIMCA.2008.63

  40. Sureka A and Jalote P 2010 Detecting duplicate bug report using character n-gram-based features. In: Proceedings of the Asia Pacific Software Engineering Conference, pp. 366–374, https://doi.org/10.1109/APSEC.2010.49

  41. Minh P N 2014 An approach to detecting duplicate bug reports using n-gram features and cluster shrinkage technique. International Journal of Scientific and Research Publications 4: 1–8

    Google Scholar 

  42. Banerjee S, Musgrove J and Cukic B 2012 Handling language variations in open source bug reporting systems. In: Proceedings of the 23rd IEEE International Symposium on Software Reliability Engineering Workshops, pp. 325–330, https://doi.org/10.1109/ISSREW.2012.85

  43. Banerjee S, Cukic B and Adjeroh D 2012 Automated duplicate bug report classification using subsequence matching. In: Proceedings of the 14th IEEE International Symposium on High-Assurance Systems Engineering, pp. 74–81, https://doi.org/10.1109/HASE.2012.38

  44. Bavota G 2016 Mining unstructured data in software repositories: current and future trends. In: Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Re-engineering (SANER), pp. 1–12, https://doi.org/10.1109/SANER.2016.47

  45. Shen J, Sun X, Li B, Yang H and Hu J 2016 On automatic summarization of what and why information in source code changes. In: Proceedings of the 40th Annual Computer Software and Applications Conference, pp. 103–112, https://doi.org/10.1109/COMPSAC.2016.162

  46. Ahmed T, Bosu A and Iqbal A 2017 SentiCR: a customized sentiment analysis tool for code review interactions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 106–111, https://doi.org/10.1109/ASE.2017.8115623

  47. Tourani P, Jiang Y and Adams B 2014 Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem. In: Proceedings of the 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 34–44

  48. Goul M, Marjanovic O, Baxley S and Vizecky K 2012 Managing the enterprise business intelligence app store: sentiment analysis supported requirements engineering. In: Proceedings of the 45th Hawaii International Conference on System Sciences, pp. 4168–4177, https://doi.org/10.1109/HICSS.2012.421

  49. Carreno L V G and Winbladh K 2013 Analysis of user comments: an approach for software requirements evolution. In: Proceedings of ICSE 2013, pp. 582–591, https://doi.org/10.1109/ICSE.2013.6606604

    Article  Google Scholar 

  50. https://tomcat.apache.org/tomcat-7.0-doc/appdev/deployment.html

  51. https://ant.apache.org/manual/api/org/apache/tools/ant/taskdefs/optional/unix/Symlink.html

  52. Bazelli B, Hindle A and Stroulia E 2013 On the personality traits of StackOverflow users. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, pp. 460–463, https://doi.org/10.1109/ICSM.2013.72

  53. Ortu M, Adams B, Destefanis G, Tourani P, Marchesi M and Tonelli R 2015 Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In: Proceedings of the 12th Working Conference on Mining Software Repositories, 480–483, https://doi.org/10.1109/MSR.2015.35

  54. Murgia A, Tourani P, Adams B and Ortu M 2014 Do developers feel emotions? An exploratory analysis of emotions in software artifacts. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp. 262–271, https://doi.org/10.1145/2597073.2597086

  55. Ortu M, Murgia A and Destefanis G 2016 The emotional side of software developers in JIRA. In: Proceedings of the 13th Working Conference on Mining Software Repositories, pp. 480–483, https://doi.org/10.1145/2901739.2903505

  56. Islam M R and Zibran M F 2017 Leveraging automated sentiment analysis in software engineering. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 203–214, https://doi.org/10.1109/MSR.2017.9

  57. Guzman E, Azócar D and Li Y 2014 Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the Working Conference on Mining Software Repositories, pp. 352–355, https://doi.org/10.1145/2597073.2597118

  58. Sinha V, Lazar A and Sharif B 2016 Analyzing developer sentiment in commit logs. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 520–523, https://doi.org/10.1145/2901739.2903501

  59. Calefato F, Lanubile F, Maiorano F and Novielli N 2017 Sentiment polarity detection for software development. Empirical Software Engineering 23: 1352–1382

    Article  Google Scholar 

  60. Buse R P L and Weimer W R Automatic documentation inference for exceptions. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, pp. 273–282, https://doi.org/10.1145/1390630.1390664

  61. Moratanch N and Chitrakala S 2016 A survey on abstractive text summarization. In: Proceedings of the International Conference of Circuit, Power and Computing Technologies (ICCPCT), https://doi.org/10.1109/ICCPCT.2016.7530193

  62. Gupta S and Gupta S K 2017 Summarization of software artifacts: a review. International Journal of Computer Science and Information Technology 5: 165–187

    Article  Google Scholar 

  63. McBurney P W and McMillan C 2015 Automatic source code summarization of context for Java methods. IEEE Transactions on Software Engineering 42: 103–119

    Article  Google Scholar 

  64. Nithya R and Arunkumar A 2016 Summarization of bug reports using feature extraction. International Journal of Computer Science and Mobile Computing 5: 268–273

    Google Scholar 

  65. Lotufo R, Malik Z and Czarnecki K 2013 Modelling the Hurried bug report reading process to summarize bug reports. In: Proceedings of the 28th IEEE International Conference on Software Maintenance (ICSM), pp. 430–439, https://doi.org/10.1109/ICSM.2012.6405303

  66. Haiduc S, Aponte J, Moreno L and Marcus A 2010 On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering, pp. 35–44, https://doi.org/10.1109/WCRE.2010.13

  67. Guerrouj L, Bourque D and Rigby P C 2015 Leveraging informal documentation to summarize classes and methods in context. In: Proceedings of the 37th IEEE International Conference on Software Engineering, pp. 639–642, https://doi.org/10.1109/ICSE.2015.212

  68. Chitti Babu K, Kavitha C and SankarRam N 2016 Entity based source code summarization (EBSCS). In: Proceedings of the 3rd International Conference on Advanced Computing and Communication Systems, https://doi.org/10.1109/ICACCS.2016.7586385

  69. Cortes-Coy L F, Linares-Vasquez M and Aponte J 2014 On automatically generating commit messages via summarization of source code changes. In: Proceedings of the 14th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 275–284, https://doi.org/10.1109/SCAM.2014.14

  70. Linares-Vasquez M, Cortes-Coy L F and Aponte J 2015 ChangeScribe: a tool for automatically generating commit messages. In: Proceedings of the 37th IEEE International Conference on Software Engineering (ICSE), pp. 257–277, https://doi.org/10.1109/ICSE.2015.229

  71. Li B, Vendome C, Vasquez M L, Poshyvanyk D and Kraft N A 2016 Automatically documenting unit test cases. In: Proceedings of the IEEE International Conference on Software Testing, Verification and Validation, pp. 341–352, https://doi.org/10.1109/ICST.2016.30

  72. Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories (MSR), pp. 401–405, https://doi.org/10.1109/MSR.2015.49

  73. Alobaidi M and Mahmood K 2015 Semantic approach for traceability link recovery using uniform resource identifier. In: Proceedings of the International Conference on Software Engineering Research and Practice, pp. 190–195

  74. Aponte J and Marcus A 2011 Improving traceability link recovery methods through software artifact summarization. In: Proceedings of TEFSE 2011, pp. 46–49, ACM 978-1-4503-0589-1/11/05

  75. Arunthavanathan A, Shanmugathasan S, Ratnavel S, Thiyagarajah V, Perera I et al Support for traceability management of software artefacts using Natural Language Processing. In: Proceedings of the Moratuwa Engineering Research Conference (MERCon), pp. 18–23, https://doi.org/10.1109/MERCon.2016.7480109

  76. https://wordnet.princeton.edu/

  77. https://propbank.github.io/

  78. https://verbs.colorado.edu/verbnet/

  79. Liu F, Flanigan J, Thomson S, Sadeh N and Smith N A 2015 Toward abstractive summarization using semantic representations. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 1077–1086

  80. Gupta S and Gupta S K 2019 Abstractive summarization: an overview of the state of the art. Expert Systems with Applications 121: 49–65

    Article  Google Scholar 

  81. Santos F L D and Ladeira M The role of text pre-processing in opinion mining on a social media language dataset. In: Proceedings of the Brazilian Conference on Intelligent Systems, pp. 50–54, https://doi.org/10.1109/BRACIS.2014.20

  82. Mcilroy S, Ali N, Khalid H and Hassan A E 2015 Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empirical Software Engineering 21: 1067–1106

    Article  Google Scholar 

  83. Hu H, Wang S, Bezemer C P and Hassan A E 2018 Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. Empirical Software Engineering 24: 7–32

    Article  Google Scholar 

  84. Vu P M, Nguyen T T, Pham H V and Nguyen T T 2015 Mining user opinions in mobile app reviews: a keyword-based approach. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), https://doi.org/10.1109/ASE.2015.85

  85. Zhang L, Huang X Y, Jiang J and Hu Y K 2017 CSLabel: an approach for labelling mobile app reviews. Journal of Computer Science and Technology 32: 1076–1089

    Article  Google Scholar 

  86. Iacob C and Harrison R 2013 Retrieving and analyzing mobile apps feature requests from online reviews. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 41–44, https://doi.org/10.1109/MSR.2013.6624001

  87. Gao C, Xu H, Hu J and Zhou Y 2015 AR-tracker: track the dynamics of mobile apps via user review mining. Proceedings of the IEEE Symposium on Service-Oriented System Engineering, pp. 4168–4177, https://doi.org/10.1109/SOSE.2015.13

  88. Liu J, Sarkar M K and Chakraborty G 2013 Feature-based sentiment analysis on android app reviews using SAS text. In: Proceedings of SAS Global Forum 2013, https://doi.org/10.1.1.381.3525

  89. Cheng V C, Chen L, Cheung W K and Fok C K 2017 A heterogeneous hidden Markov model for mobile app recommendation. Knowledge Information Systems 57: 207–228

    Article  Google Scholar 

  90. http://checkstyle.org/eclipse-cs/

  91. https://archive.codeplex.com/?p=stylecop

  92. Cheng V C, Chen L, Cheung W K and Fok C K 2011 Norm creation, spreading and emergence: a survey of simulation models of norms in multi-agent systems. Multiagent and Grid Systems—An International Journal 7: 21–54

    Article  Google Scholar 

  93. Savarimuthu B T and Dam H K 2013 Towards mining norms in open source software repositories. In: ADMI Revised Selected Papers of the 9th International Workshop on Agents and Data Mining Interaction, pp. 26–39, https://doi.org/10.1007/978-3-642-55192-53

  94. Pawar A and Mago V 2018 Calculating the similarity between words and sentences using a lexical database and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18: 1–4

    Google Scholar 

  95. Khan A and Salim N 2014 A review on abstractive summarization methods. Journal of Theoretical and Applied Information Technology 59: 64–72

    Google Scholar 

  96. Haiduc S, Aponte J, Moreno L and Marcus A On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE), pp. 35–44, https://doi.org/10.1109/WCRE.2010.13

  97. Jiang N N H, Gao G, Zhang T, Li X and Ren Z 2016 Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science 10: 504–517

    Article  Google Scholar 

  98. Dam H K, Savarimuthu B T R and Avery D 2015 Mining software repositories for social norms. In: Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, pp. 284–290, https://doi.org/10.1109/ICSE.2015.209

  99. Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The application of natural language processing (NLP) for software requirement engineering—a systematic review. Lecture Notes in Electrical Engineering 424: 485–493

    Article  Google Scholar 

  100. White M, Vendome C, Linares-Vasquez M and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38

  101. https://argouml.en.softonic.com/

  102. https://ieeexplore.ieee.org/Xplore/home.jsp

  103. https://link.springer.com/

  104. https://dl.acm.org/

  105. https://scholar.google.co.in/

  106. https://www.eclipse.org/eclipse/

  107. https://subversion.apache.org/

  108. https://www-archive.mozilla.org/projects/firefox/

  109. https://developer.atlassian.com/docs/

  110. http://nanoxml.sourceforge.net/orig/

  111. https://www.eclipse.org/jgit/

  112. http://commons.apache.org/proper/commons-cli/

  113. http://commons.apache.org/proper/commons-io/

  114. https://commons.apache.org/proper/commons-math/

  115. https://commons.apache.org/proper/commons-lang/

  116. https://commons.apache.org/proper/commons-csv/

  117. Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 401–405, https://doi.org/10.1109/MSR.2015.49

  118. http://www.jhotdraw.org/

  119. http://www.jedit.org/

  120. https://play.google.com/store/apps/details?id=com.levelup.beautifulwidgets&hl=en_IN

  121. https://where-is-my-perry.en.uptodown.com/android

  122. https://www.megamek.org/

  123. https://www.openhub.net/p/p_5944

  124. http://www.sweethome3d.com/

  125. Lin B, Zampetti F, Bavota G, Penta M D, Lanza M and Oliveto R 2018 Sentiment analysis for software engineering: how far can we go? In: Proceedings of the 40th International Conference on Software Engineering, pp. 94–104, https://doi.org/10.1145/3180155.3180195

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Som Gupta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, S., Gupta, S.K. Natural language processing in mining unstructured data from software repositories: a review. Sādhanā 44, 244 (2019). https://doi.org/10.1007/s12046-019-1223-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-019-1223-9

Keywords

Navigation