Abstract
With the increasing popularity of open-source platforms, software data is easily available from various open-source tools like GitHub, CVS, SVN, etc. More than 80 percent of the data present in them is unstructured. Mining data from these repositories helps project managers, developers and businesses, in getting interesting insights. Most of the software artefacts present in these repositories are in the natural language form, which makes natural language processing (NLP) an important part of mining to get the useful results. The paper reviews the application of NLP techniques in the field of Mining Software Repositories (MSR). The paper mainly focuses on sentiment analysis, summarization, traceability, norms mining and mobile analytics. The paper presents the major NLP works performed in this area by surveying the research papers from 2000 to 2018. The paper firstly describes the major artefacts present in the software repositories where the NLP techniques have been applied. Next, the paper presents some popular open-source NLP tools that have been used to perform NLP tasks. Later the paper discusses, in brief, the research state of NLP in MSR field. The paper also lists down the various challenges along with the pointers for future work in this field of research and finally the conclusion.
Similar content being viewed by others
References
Sridhara G, Hill E, Muppaneni D, Pollock L and Vijay Shanker K 2010 Towards automatically generating summary comments for Java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52, https://doi.org/10.1145/1858996.1859006
Chen T H, Thomas S W and Hassan A E 2015 A survey on the use of topic models when mining software repositories. Empirical Software Engineering 21: 1843–1919
White M, Vendome C, Vasquez M L and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38
Haiduc S, Arnaoudov V, Marcus A and Antoniol G 2016 The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 898–899, https://doi.org/10.1145/2889160.2891053
Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The applications of natural language processing (NLP) for software engineering—a systematic literature review. In: Lecture Notes in Electrical Engineering, 424: 485–493
Hassan A E 2008 The road ahead for mining software repositories. In: Proceedings of FoSM 2008, pp. 48–57, https://doi.org/10.1109/FOSM.2008.4659248
Rastkar S, Murphy G C and Murray G 2014 Automatic summarization of bug reports. IEEE Transactions on Software Engineering 40: 366–380
Le T D B, Vasquez M L, Lo D and Poshyvanyk D 2015 RCLinker: automated linking of issue reports and commits leveraging rich contextual information. In: Proceedings of the 23rd International Conference on Program Comprehension, pp. 36–47, https://doi.org/10.1109/ICPC.2015.13
Moreno L, Bavota G and Penta M D 2016 ARENA: an approach for the automated generation of release notes. IEEE Transactions on Software Engineering 43: 106–127
Rastkar S, Murphy G C, Moreno L and Bradley A W J 2011 Generating natural language summaries for crosscutting source code concerns. In: Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM), pp. 103–112, https://doi.org/10.1109/ICSM.2011.6080777
https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/
Kao A and Poteet S R (Eds.) 2007 Natural language processing and text mining. London: Springer, https://doi.org/10.1007/978-1-84628-754-1
LEBRET R P 2016 Word embeddings for natural language processing. Ph.D. Thesis, Ecole polytechnique federale de Lausanne, Chapter 3
Runeson P, Alexandersson M and Nyholm O 2007 Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th International Conference on Software Engineering, https://doi.org/10.1109/ICSE.2007.32
Moawad I F and Aref M 2012 Semantic graph reduction approach for abstractive text summarization. In: Proceedings of the Seventh International Conference on Computer Engineering and Systems, pp. 132–138, https://doi.org/10.1109/ICCES.2012.6408498
Dohare S, Karnick H and Gupta V 2017 Text summarization using abstract meaning representation. Computation and Language arXiv:1706.01678v3
Saberi B and Saad S 2017 Sentiment analysis or opinion mining: a review. International Journal of Advanced Science Engineering Information Technology 7: 1660–1667
Schugerl P, Rilling J and Charland P 2008 Mining bug repositories: a quality assessment. In: Proceedings of the 38th IEEE International Conference on Software Engineering, pp. 1105–1110, https://doi.org/10.1109/CIMCA.2008.63
Sureka A and Jalote P 2010 Detecting duplicate bug report using character n-gram-based features. In: Proceedings of the Asia Pacific Software Engineering Conference, pp. 366–374, https://doi.org/10.1109/APSEC.2010.49
Minh P N 2014 An approach to detecting duplicate bug reports using n-gram features and cluster shrinkage technique. International Journal of Scientific and Research Publications 4: 1–8
Banerjee S, Musgrove J and Cukic B 2012 Handling language variations in open source bug reporting systems. In: Proceedings of the 23rd IEEE International Symposium on Software Reliability Engineering Workshops, pp. 325–330, https://doi.org/10.1109/ISSREW.2012.85
Banerjee S, Cukic B and Adjeroh D 2012 Automated duplicate bug report classification using subsequence matching. In: Proceedings of the 14th IEEE International Symposium on High-Assurance Systems Engineering, pp. 74–81, https://doi.org/10.1109/HASE.2012.38
Bavota G 2016 Mining unstructured data in software repositories: current and future trends. In: Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Re-engineering (SANER), pp. 1–12, https://doi.org/10.1109/SANER.2016.47
Shen J, Sun X, Li B, Yang H and Hu J 2016 On automatic summarization of what and why information in source code changes. In: Proceedings of the 40th Annual Computer Software and Applications Conference, pp. 103–112, https://doi.org/10.1109/COMPSAC.2016.162
Ahmed T, Bosu A and Iqbal A 2017 SentiCR: a customized sentiment analysis tool for code review interactions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 106–111, https://doi.org/10.1109/ASE.2017.8115623
Tourani P, Jiang Y and Adams B 2014 Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem. In: Proceedings of the 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 34–44
Goul M, Marjanovic O, Baxley S and Vizecky K 2012 Managing the enterprise business intelligence app store: sentiment analysis supported requirements engineering. In: Proceedings of the 45th Hawaii International Conference on System Sciences, pp. 4168–4177, https://doi.org/10.1109/HICSS.2012.421
Carreno L V G and Winbladh K 2013 Analysis of user comments: an approach for software requirements evolution. In: Proceedings of ICSE 2013, pp. 582–591, https://doi.org/10.1109/ICSE.2013.6606604
https://tomcat.apache.org/tomcat-7.0-doc/appdev/deployment.html
https://ant.apache.org/manual/api/org/apache/tools/ant/taskdefs/optional/unix/Symlink.html
Bazelli B, Hindle A and Stroulia E 2013 On the personality traits of StackOverflow users. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, pp. 460–463, https://doi.org/10.1109/ICSM.2013.72
Ortu M, Adams B, Destefanis G, Tourani P, Marchesi M and Tonelli R 2015 Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In: Proceedings of the 12th Working Conference on Mining Software Repositories, 480–483, https://doi.org/10.1109/MSR.2015.35
Murgia A, Tourani P, Adams B and Ortu M 2014 Do developers feel emotions? An exploratory analysis of emotions in software artifacts. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp. 262–271, https://doi.org/10.1145/2597073.2597086
Ortu M, Murgia A and Destefanis G 2016 The emotional side of software developers in JIRA. In: Proceedings of the 13th Working Conference on Mining Software Repositories, pp. 480–483, https://doi.org/10.1145/2901739.2903505
Islam M R and Zibran M F 2017 Leveraging automated sentiment analysis in software engineering. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 203–214, https://doi.org/10.1109/MSR.2017.9
Guzman E, Azócar D and Li Y 2014 Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the Working Conference on Mining Software Repositories, pp. 352–355, https://doi.org/10.1145/2597073.2597118
Sinha V, Lazar A and Sharif B 2016 Analyzing developer sentiment in commit logs. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 520–523, https://doi.org/10.1145/2901739.2903501
Calefato F, Lanubile F, Maiorano F and Novielli N 2017 Sentiment polarity detection for software development. Empirical Software Engineering 23: 1352–1382
Buse R P L and Weimer W R Automatic documentation inference for exceptions. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, pp. 273–282, https://doi.org/10.1145/1390630.1390664
Moratanch N and Chitrakala S 2016 A survey on abstractive text summarization. In: Proceedings of the International Conference of Circuit, Power and Computing Technologies (ICCPCT), https://doi.org/10.1109/ICCPCT.2016.7530193
Gupta S and Gupta S K 2017 Summarization of software artifacts: a review. International Journal of Computer Science and Information Technology 5: 165–187
McBurney P W and McMillan C 2015 Automatic source code summarization of context for Java methods. IEEE Transactions on Software Engineering 42: 103–119
Nithya R and Arunkumar A 2016 Summarization of bug reports using feature extraction. International Journal of Computer Science and Mobile Computing 5: 268–273
Lotufo R, Malik Z and Czarnecki K 2013 Modelling the Hurried bug report reading process to summarize bug reports. In: Proceedings of the 28th IEEE International Conference on Software Maintenance (ICSM), pp. 430–439, https://doi.org/10.1109/ICSM.2012.6405303
Haiduc S, Aponte J, Moreno L and Marcus A 2010 On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering, pp. 35–44, https://doi.org/10.1109/WCRE.2010.13
Guerrouj L, Bourque D and Rigby P C 2015 Leveraging informal documentation to summarize classes and methods in context. In: Proceedings of the 37th IEEE International Conference on Software Engineering, pp. 639–642, https://doi.org/10.1109/ICSE.2015.212
Chitti Babu K, Kavitha C and SankarRam N 2016 Entity based source code summarization (EBSCS). In: Proceedings of the 3rd International Conference on Advanced Computing and Communication Systems, https://doi.org/10.1109/ICACCS.2016.7586385
Cortes-Coy L F, Linares-Vasquez M and Aponte J 2014 On automatically generating commit messages via summarization of source code changes. In: Proceedings of the 14th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 275–284, https://doi.org/10.1109/SCAM.2014.14
Linares-Vasquez M, Cortes-Coy L F and Aponte J 2015 ChangeScribe: a tool for automatically generating commit messages. In: Proceedings of the 37th IEEE International Conference on Software Engineering (ICSE), pp. 257–277, https://doi.org/10.1109/ICSE.2015.229
Li B, Vendome C, Vasquez M L, Poshyvanyk D and Kraft N A 2016 Automatically documenting unit test cases. In: Proceedings of the IEEE International Conference on Software Testing, Verification and Validation, pp. 341–352, https://doi.org/10.1109/ICST.2016.30
Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories (MSR), pp. 401–405, https://doi.org/10.1109/MSR.2015.49
Alobaidi M and Mahmood K 2015 Semantic approach for traceability link recovery using uniform resource identifier. In: Proceedings of the International Conference on Software Engineering Research and Practice, pp. 190–195
Aponte J and Marcus A 2011 Improving traceability link recovery methods through software artifact summarization. In: Proceedings of TEFSE 2011, pp. 46–49, ACM 978-1-4503-0589-1/11/05
Arunthavanathan A, Shanmugathasan S, Ratnavel S, Thiyagarajah V, Perera I et al Support for traceability management of software artefacts using Natural Language Processing. In: Proceedings of the Moratuwa Engineering Research Conference (MERCon), pp. 18–23, https://doi.org/10.1109/MERCon.2016.7480109
Liu F, Flanigan J, Thomson S, Sadeh N and Smith N A 2015 Toward abstractive summarization using semantic representations. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 1077–1086
Gupta S and Gupta S K 2019 Abstractive summarization: an overview of the state of the art. Expert Systems with Applications 121: 49–65
Santos F L D and Ladeira M The role of text pre-processing in opinion mining on a social media language dataset. In: Proceedings of the Brazilian Conference on Intelligent Systems, pp. 50–54, https://doi.org/10.1109/BRACIS.2014.20
Mcilroy S, Ali N, Khalid H and Hassan A E 2015 Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empirical Software Engineering 21: 1067–1106
Hu H, Wang S, Bezemer C P and Hassan A E 2018 Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. Empirical Software Engineering 24: 7–32
Vu P M, Nguyen T T, Pham H V and Nguyen T T 2015 Mining user opinions in mobile app reviews: a keyword-based approach. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), https://doi.org/10.1109/ASE.2015.85
Zhang L, Huang X Y, Jiang J and Hu Y K 2017 CSLabel: an approach for labelling mobile app reviews. Journal of Computer Science and Technology 32: 1076–1089
Iacob C and Harrison R 2013 Retrieving and analyzing mobile apps feature requests from online reviews. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 41–44, https://doi.org/10.1109/MSR.2013.6624001
Gao C, Xu H, Hu J and Zhou Y 2015 AR-tracker: track the dynamics of mobile apps via user review mining. Proceedings of the IEEE Symposium on Service-Oriented System Engineering, pp. 4168–4177, https://doi.org/10.1109/SOSE.2015.13
Liu J, Sarkar M K and Chakraborty G 2013 Feature-based sentiment analysis on android app reviews using SAS text. In: Proceedings of SAS Global Forum 2013, https://doi.org/10.1.1.381.3525
Cheng V C, Chen L, Cheung W K and Fok C K 2017 A heterogeneous hidden Markov model for mobile app recommendation. Knowledge Information Systems 57: 207–228
Cheng V C, Chen L, Cheung W K and Fok C K 2011 Norm creation, spreading and emergence: a survey of simulation models of norms in multi-agent systems. Multiagent and Grid Systems—An International Journal 7: 21–54
Savarimuthu B T and Dam H K 2013 Towards mining norms in open source software repositories. In: ADMI Revised Selected Papers of the 9th International Workshop on Agents and Data Mining Interaction, pp. 26–39, https://doi.org/10.1007/978-3-642-55192-53
Pawar A and Mago V 2018 Calculating the similarity between words and sentences using a lexical database and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18: 1–4
Khan A and Salim N 2014 A review on abstractive summarization methods. Journal of Theoretical and Applied Information Technology 59: 64–72
Haiduc S, Aponte J, Moreno L and Marcus A On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE), pp. 35–44, https://doi.org/10.1109/WCRE.2010.13
Jiang N N H, Gao G, Zhang T, Li X and Ren Z 2016 Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science 10: 504–517
Dam H K, Savarimuthu B T R and Avery D 2015 Mining software repositories for social norms. In: Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, pp. 284–290, https://doi.org/10.1109/ICSE.2015.209
Nazir F, Butt W H, Anwar M W and Khan Khattak M A 2017 The application of natural language processing (NLP) for software requirement engineering—a systematic review. Lecture Notes in Electrical Engineering 424: 485–493
White M, Vendome C, Linares-Vasquez M and Poshyvanyk D 2015 Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345, https://doi.org/10.1109/MSR.2015.38
Ponzanelli L, Mocci A and Lanza M 2015 Summarizing complex development artifacts by mining heterogeneous data. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 401–405, https://doi.org/10.1109/MSR.2015.49
https://play.google.com/store/apps/details?id=com.levelup.beautifulwidgets&hl=en_IN
Lin B, Zampetti F, Bavota G, Penta M D, Lanza M and Oliveto R 2018 Sentiment analysis for software engineering: how far can we go? In: Proceedings of the 40th International Conference on Software Engineering, pp. 94–104, https://doi.org/10.1145/3180155.3180195
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gupta, S., Gupta, S.K. Natural language processing in mining unstructured data from software repositories: a review. Sādhanā 44, 244 (2019). https://doi.org/10.1007/s12046-019-1223-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-019-1223-9