Natural language processing in mining unstructured data from software repositories: a review

Gupta, Som; Gupta, S K

doi:10.1007/s12046-019-1223-9

Natural language processing in mining unstructured data from software repositories: a review

Published: 30 November 2019

Volume 44, article number 244, (2019)
Cite this article

Sādhanā Aims and scope Submit manuscript

1142 Accesses
9 Citations
Explore all metrics

Abstract

With the increasing popularity of open-source platforms, software data is easily available from various open-source tools like GitHub, CVS, SVN, etc. More than 80 percent of the data present in them is unstructured. Mining data from these repositories helps project managers, developers and businesses, in getting interesting insights. Most of the software artefacts present in these repositories are in the natural language form, which makes natural language processing (NLP) an important part of mining to get the useful results. The paper reviews the application of NLP techniques in the field of Mining Software Repositories (MSR). The paper mainly focuses on sentiment analysis, summarization, traceability, norms mining and mobile analytics. The paper presents the major NLP works performed in this area by surveying the research papers from 2000 to 2018. The paper firstly describes the major artefacts present in the software repositories where the NLP techniques have been applied. Next, the paper presents some popular open-source NLP tools that have been used to perform NLP tasks. Later the paper discusses, in brief, the research state of NLP in MSR field. The paper also lists down the various challenges along with the pointers for future work in this field of research and finally the conclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Sridhara G, Hill E, Muppaneni D, Pollock L and Vijay Shanker K 2010 Towards automatically generating summary comments for Java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52, https://doi.org/10.1145/1858996.1859006
https://github.com/
https://sourceforge.net/
https://code.google.com/
https://www.bugzilla.org/about/
https://academia.stackexchange.com/
https://stackoverflow.com/
https://octoverse.github.com/
Chen T H, Thomas S W and Hassan A E 2015 A survey on the use of topic models when mining software repositories. Empirical Software Engineering 21: 1843–1919
Article Google Scholar
White M, Vendome C, Vasquez M L and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38
Haiduc S, Arnaoudov V, Marcus A and Antoniol G 2016 The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 898–899, https://doi.org/10.1145/2889160.2891053
Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The applications of natural language processing (NLP) for software engineering—a systematic literature review. In: Lecture Notes in Electrical Engineering, 424: 485–493
Article Google Scholar
Hassan A E 2008 The road ahead for mining software repositories. In: Proceedings of FoSM 2008, pp. 48–57, https://doi.org/10.1109/FOSM.2008.4659248
Rastkar S, Murphy G C and Murray G 2014 Automatic summarization of bug reports. IEEE Transactions on Software Engineering 40: 366–380
Article Google Scholar
Le T D B, Vasquez M L, Lo D and Poshyvanyk D 2015 RCLinker: automated linking of issue reports and commits leveraging rich contextual information. In: Proceedings of the 23rd International Conference on Program Comprehension, pp. 36–47, https://doi.org/10.1109/ICPC.2015.13
Moreno L, Bavota G and Penta M D 2016 ARENA: an approach for the automated generation of release notes. IEEE Transactions on Software Engineering 43: 106–127
Article Google Scholar
Rastkar S, Murphy G C, Moreno L and Bradley A W J 2011 Generating natural language summaries for crosscutting source code concerns. In: Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM), pp. 103–112, https://doi.org/10.1109/ICSM.2011.6080777
https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/
http://www.businessofapps.com/data/app-statistics/
Kao A and Poteet S R (Eds.) 2007 Natural language processing and text mining. London: Springer, https://doi.org/10.1007/978-1-84628-754-1
Book MATH Google Scholar
LEBRET R P 2016 Word embeddings for natural language processing. Ph.D. Thesis, Ecole polytechnique federale de Lausanne, Chapter 3
https://projects.apache.org/
https://opennlp.apache.org/
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://gate.ac.uk/
https://spacy.io/
https://github.com/collab-uniba/Emotion/
http://mallet.cs.umass.edu/
https://github.com/tensorflow/tensorflow
https://github.com/Microsoft/cntk
https://github.com/Theano/Theano
https://keras.io/
Runeson P, Alexandersson M and Nyholm O 2007 Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th International Conference on Software Engineering, https://doi.org/10.1109/ICSE.2007.32
Moawad I F and Aref M 2012 Semantic graph reduction approach for abstractive text summarization. In: Proceedings of the Seventh International Conference on Computer Engineering and Systems, pp. 132–138, https://doi.org/10.1109/ICCES.2012.6408498
Dohare S, Karnick H and Gupta V 2017 Text summarization using abstract meaning representation. Computation and Language arXiv:1706.01678v3
https://en.wikipedia.org/wiki/Word-sense_disambiguation
Saberi B and Saad S 2017 Sentiment analysis or opinion mining: a review. International Journal of Advanced Science Engineering Information Technology 7: 1660–1667
Article Google Scholar
Schugerl P, Rilling J and Charland P 2008 Mining bug repositories: a quality assessment. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 1105–1110, https://doi.org/10.1109/CIMCA.2008.63
Sureka A and Jalote P 2010 Detecting duplicate bug report using character n-gram-based features. In: Proceedings of the Asia Pacific Software Engineering Conference, pp. 366–374, https://doi.org/10.1109/APSEC.2010.49
Minh P N 2014 An approach to detecting duplicate bug reports using n-gram features and cluster shrinkage technique. International Journal of Scientific and Research Publications 4: 1–8
Google Scholar
Banerjee S, Musgrove J and Cukic B 2012 Handling language variations in open source bug reporting systems. In: Proceedings of the 23rd IEEE International Symposium on Software Reliability Engineering Workshops, pp. 325–330, https://doi.org/10.1109/ISSREW.2012.85
Banerjee S, Cukic B and Adjeroh D 2012 Automated duplicate bug report classification using subsequence matching. In: Proceedings of the 14th IEEE International Symposium on High-Assurance Systems Engineering, pp. 74–81, https://doi.org/10.1109/HASE.2012.38
Bavota G 2016 Mining unstructured data in software repositories: current and future trends. In: Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Re-engineering (SANER), pp. 1–12, https://doi.org/10.1109/SANER.2016.47
Shen J, Sun X, Li B, Yang H and Hu J 2016 On automatic summarization of what and why information in source code changes. In: Proceedings of the 40th Annual Computer Software and Applications Conference, pp. 103–112, https://doi.org/10.1109/COMPSAC.2016.162
Ahmed T, Bosu A and Iqbal A 2017 SentiCR: a customized sentiment analysis tool for code review interactions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 106–111, https://doi.org/10.1109/ASE.2017.8115623
Tourani P, Jiang Y and Adams B 2014 Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem. In: Proceedings of the 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 34–44
Goul M, Marjanovic O, Baxley S and Vizecky K 2012 Managing the enterprise business intelligence app store: sentiment analysis supported requirements engineering. In: Proceedings of the 45th Hawaii International Conference on System Sciences, pp. 4168–4177, https://doi.org/10.1109/HICSS.2012.421
Carreno L V G and Winbladh K 2013 Analysis of user comments: an approach for software requirements evolution. In: Proceedings of ICSE 2013, pp. 582–591, https://doi.org/10.1109/ICSE.2013.6606604
Article Google Scholar
https://tomcat.apache.org/tomcat-7.0-doc/appdev/deployment.html
https://ant.apache.org/manual/api/org/apache/tools/ant/taskdefs/optional/unix/Symlink.html
Bazelli B, Hindle A and Stroulia E 2013 On the personality traits of StackOverflow users. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, pp. 460–463, https://doi.org/10.1109/ICSM.2013.72
Ortu M, Adams B, Destefanis G, Tourani P, Marchesi M and Tonelli R 2015 Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In: Proceedings of the 12th Working Conference on Mining Software Repositories, 480–483, https://doi.org/10.1109/MSR.2015.35
Murgia A, Tourani P, Adams B and Ortu M 2014 Do developers feel emotions? An exploratory analysis of emotions in software artifacts. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp. 262–271, https://doi.org/10.1145/2597073.2597086
Ortu M, Murgia A and Destefanis G 2016 The emotional side of software developers in JIRA. In: Proceedings of the 13th Working Conference on Mining Software Repositories, pp. 480–483, https://doi.org/10.1145/2901739.2903505
Islam M R and Zibran M F 2017 Leveraging automated sentiment analysis in software engineering. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 203–214, https://doi.org/10.1109/MSR.2017.9
Guzman E, Azócar D and Li Y 2014 Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the Working Conference on Mining Software Repositories, pp. 352–355, https://doi.org/10.1145/2597073.2597118
Sinha V, Lazar A and Sharif B 2016 Analyzing developer sentiment in commit logs. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 520–523, https://doi.org/10.1145/2901739.2903501
Calefato F, Lanubile F, Maiorano F and Novielli N 2017 Sentiment polarity detection for software development. Empirical Software Engineering 23: 1352–1382
Article Google Scholar
Buse R P L and Weimer W R Automatic documentation inference for exceptions. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, pp. 273–282, https://doi.org/10.1145/1390630.1390664
Moratanch N and Chitrakala S 2016 A survey on abstractive text summarization. In: Proceedings of the International Conference of Circuit, Power and Computing Technologies (ICCPCT), https://doi.org/10.1109/ICCPCT.2016.7530193
Gupta S and Gupta S K 2017 Summarization of software artifacts: a review. International Journal of Computer Science and Information Technology 5: 165–187
Article Google Scholar
McBurney P W and McMillan C 2015 Automatic source code summarization of context for Java methods. IEEE Transactions on Software Engineering 42: 103–119
Article Google Scholar
Nithya R and Arunkumar A 2016 Summarization of bug reports using feature extraction. International Journal of Computer Science and Mobile Computing 5: 268–273
Google Scholar
Lotufo R, Malik Z and Czarnecki K 2013 Modelling the Hurried bug report reading process to summarize bug reports. In: Proceedings of the 28th IEEE International Conference on Software Maintenance (ICSM), pp. 430–439, https://doi.org/10.1109/ICSM.2012.6405303
Haiduc S, Aponte J, Moreno L and Marcus A 2010 On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering, pp. 35–44, https://doi.org/10.1109/WCRE.2010.13
Guerrouj L, Bourque D and Rigby P C 2015 Leveraging informal documentation to summarize classes and methods in context. In: Proceedings of the 37th IEEE International Conference on Software Engineering, pp. 639–642, https://doi.org/10.1109/ICSE.2015.212
Chitti Babu K, Kavitha C and SankarRam N 2016 Entity based source code summarization (EBSCS). In: Proceedings of the 3rd International Conference on Advanced Computing and Communication Systems, https://doi.org/10.1109/ICACCS.2016.7586385
Cortes-Coy L F, Linares-Vasquez M and Aponte J 2014 On automatically generating commit messages via summarization of source code changes. In: Proceedings of the 14th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 275–284, https://doi.org/10.1109/SCAM.2014.14
Linares-Vasquez M, Cortes-Coy L F and Aponte J 2015 ChangeScribe: a tool for automatically generating commit messages. In: Proceedings of the 37th IEEE International Conference on Software Engineering (ICSE), pp. 257–277, https://doi.org/10.1109/ICSE.2015.229
Li B, Vendome C, Vasquez M L, Poshyvanyk D and Kraft N A 2016 Automatically documenting unit test cases. In: Proceedings of the IEEE International Conference on Software Testing, Verification and Validation, pp. 341–352, https://doi.org/10.1109/ICST.2016.30
Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories (MSR), pp. 401–405, https://doi.org/10.1109/MSR.2015.49
Alobaidi M and Mahmood K 2015 Semantic approach for traceability link recovery using uniform resource identifier. In: Proceedings of the International Conference on Software Engineering Research and Practice, pp. 190–195
Aponte J and Marcus A 2011 Improving traceability link recovery methods through software artifact summarization. In: Proceedings of TEFSE 2011, pp. 46–49, ACM 978-1-4503-0589-1/11/05
Arunthavanathan A, Shanmugathasan S, Ratnavel S, Thiyagarajah V, Perera I et al Support for traceability management of software artefacts using Natural Language Processing. In: Proceedings of the Moratuwa Engineering Research Conference (MERCon), pp. 18–23, https://doi.org/10.1109/MERCon.2016.7480109
https://wordnet.princeton.edu/
https://propbank.github.io/
https://verbs.colorado.edu/verbnet/
Liu F, Flanigan J, Thomson S, Sadeh N and Smith N A 2015 Toward abstractive summarization using semantic representations. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 1077–1086
Gupta S and Gupta S K 2019 Abstractive summarization: an overview of the state of the art. Expert Systems with Applications 121: 49–65
Article Google Scholar
Santos F L D and Ladeira M The role of text pre-processing in opinion mining on a social media language dataset. In: Proceedings of the Brazilian Conference on Intelligent Systems, pp. 50–54, https://doi.org/10.1109/BRACIS.2014.20
Mcilroy S, Ali N, Khalid H and Hassan A E 2015 Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empirical Software Engineering 21: 1067–1106
Article Google Scholar
Hu H, Wang S, Bezemer C P and Hassan A E 2018 Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. Empirical Software Engineering 24: 7–32
Article Google Scholar
Vu P M, Nguyen T T, Pham H V and Nguyen T T 2015 Mining user opinions in mobile app reviews: a keyword-based approach. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), https://doi.org/10.1109/ASE.2015.85
Zhang L, Huang X Y, Jiang J and Hu Y K 2017 CSLabel: an approach for labelling mobile app reviews. Journal of Computer Science and Technology 32: 1076–1089
Article Google Scholar
Iacob C and Harrison R 2013 Retrieving and analyzing mobile apps feature requests from online reviews. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 41–44, https://doi.org/10.1109/MSR.2013.6624001
Gao C, Xu H, Hu J and Zhou Y 2015 AR-tracker: track the dynamics of mobile apps via user review mining. Proceedings of the IEEE Symposium on Service-Oriented System Engineering, pp. 4168–4177, https://doi.org/10.1109/SOSE.2015.13
Liu J, Sarkar M K and Chakraborty G 2013 Feature-based sentiment analysis on android app reviews using SAS text. In: Proceedings of SAS Global Forum 2013, https://doi.org/10.1.1.381.3525
Cheng V C, Chen L, Cheung W K and Fok C K 2017 A heterogeneous hidden Markov model for mobile app recommendation. Knowledge Information Systems 57: 207–228
Article Google Scholar
http://checkstyle.org/eclipse-cs/
https://archive.codeplex.com/?p=stylecop
Cheng V C, Chen L, Cheung W K and Fok C K 2011 Norm creation, spreading and emergence: a survey of simulation models of norms in multi-agent systems. Multiagent and Grid Systems—An International Journal 7: 21–54
Article Google Scholar
Savarimuthu B T and Dam H K 2013 Towards mining norms in open source software repositories. In: ADMI Revised Selected Papers of the 9th International Workshop on Agents and Data Mining Interaction, pp. 26–39, https://doi.org/10.1007/978-3-642-55192-53
Pawar A and Mago V 2018 Calculating the similarity between words and sentences using a lexical database and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18: 1–4
Google Scholar
Khan A and Salim N 2014 A review on abstractive summarization methods. Journal of Theoretical and Applied Information Technology 59: 64–72
Google Scholar
Haiduc S, Aponte J, Moreno L and Marcus A On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE), pp. 35–44, https://doi.org/10.1109/WCRE.2010.13
Jiang N N H, Gao G, Zhang T, Li X and Ren Z 2016 Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science 10: 504–517
Article Google Scholar
Dam H K, Savarimuthu B T R and Avery D 2015 Mining software repositories for social norms. In: Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, pp. 284–290, https://doi.org/10.1109/ICSE.2015.209
Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The application of natural language processing (NLP) for software requirement engineering—a systematic review. Lecture Notes in Electrical Engineering 424: 485–493
Article Google Scholar
White M, Vendome C, Linares-Vasquez M and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38
https://argouml.en.softonic.com/
https://ieeexplore.ieee.org/Xplore/home.jsp
https://link.springer.com/
https://dl.acm.org/
https://scholar.google.co.in/
https://www.eclipse.org/eclipse/
https://subversion.apache.org/
https://www-archive.mozilla.org/projects/firefox/
https://developer.atlassian.com/docs/
http://nanoxml.sourceforge.net/orig/
https://www.eclipse.org/jgit/
http://commons.apache.org/proper/commons-cli/
http://commons.apache.org/proper/commons-io/
https://commons.apache.org/proper/commons-math/
https://commons.apache.org/proper/commons-lang/
https://commons.apache.org/proper/commons-csv/
Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 401–405, https://doi.org/10.1109/MSR.2015.49
http://www.jhotdraw.org/
http://www.jedit.org/
https://play.google.com/store/apps/details?id=com.levelup.beautifulwidgets&hl=en_IN
https://where-is-my-perry.en.uptodown.com/android
https://www.megamek.org/
https://www.openhub.net/p/p_5944
http://www.sweethome3d.com/
Lin B, Zampetti F, Bavota G, Penta M D, Lanza M and Oliveto R 2018 Sentiment analysis for software engineering: how far can we go? In: Proceedings of the 40th International Conference on Software Engineering, pp. 94–104, https://doi.org/10.1145/3180155.3180195

Download references

Author information

Authors and Affiliations

Department of Computer Science Engineering, AKTU, Lucknow, India
Som Gupta
Department of Computer Science Engineering, BIET, Jhansi, India
S K Gupta

Authors

Som Gupta
View author publications
You can also search for this author in PubMed Google Scholar
S K Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Som Gupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, S., Gupta, S.K. Natural language processing in mining unstructured data from software repositories: a review. Sādhanā 44, 244 (2019). https://doi.org/10.1007/s12046-019-1223-9

Download citation

Received: 16 June 2017
Revised: 15 March 2019
Accepted: 07 September 2019
Published: 30 November 2019
DOI: https://doi.org/10.1007/s12046-019-1223-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing in mining unstructured data from software repositories: a review

Abstract

Access this article

Similar content being viewed by others

Detecting non-natural language artifacts for de-noising bug reports

A Combined Method for Usage of NLP Libraries Towards Analyzing Software Documents

Effectiveness of Recent Research Approaches in Natural Language Processing on Data Science-An Insight

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Natural language processing in mining unstructured data from software repositories: a review

Abstract

Access this article

Similar content being viewed by others

Detecting non-natural language artifacts for de-noising bug reports

A Combined Method for Usage of NLP Libraries Towards Analyzing Software Documents

Effectiveness of Recent Research Approaches in Natural Language Processing on Data Science-An Insight

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation