Advertisement

SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

  • Agus SulistyaEmail author
  • Gede Artha Azriadi Prana
  • Abhishek Sharma
  • David Lo
  • Christoph Treude
Article
  • 12 Downloads

Abstract

Software developers have benefited from various sources of knowledge such as forums, question-and-answer sites, and social media platforms to help them in various tasks. Extracting software-related knowledge from different platforms involves many challenges. In this paper, we propose an approach to improve the effectiveness of knowledge extraction tasks by performing cross-platform analysis. Our approach is based on transfer representation learning and word embedding, leveraging information extracted from a source platform which contains rich domain-related content. The information extracted is then used to solve tasks in another platform (considered as target platform) with less domain-related content. We first build a word embedding model as a representation learned from the source platform, and use the model to improve the performance of knowledge extraction tasks in the target platform. We experiment with Software Engineering Stack Exchange and Stack Overflow as source platforms, and two different target platforms, i.e., Twitter and YouTube. Our experiments show that our approach improves performance of existing work for the tasks of identifying software-related tweets and helpful YouTube comments.

Keywords

Word embedding Transfer representation learning Software engineering 

Notes

Acknowledgments

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its International Research Centres in Singapore Funding Initiative.

References

  1. Achananuparp P, Lubis IN, Tian Y, Lo D, Lim E-P (2012) Observatory of trends in software related microblogs. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. IEEE, pp 334–337Google Scholar
  2. Andrews JTA, Tanay T, Morton EJ, Griffin LD (2016) Transfer representation-learning for anomaly detection. In: Anomaly detection workshop. ICMLGoogle Scholar
  3. Aniche M, Treude C, Steinmacher I, Wiese I, Pinto G, Storey M-A, Gerosa M A (2018) How modern news aggregators help development communities shape and share knowledge. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, pp 499–510Google Scholar
  4. Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 97–100Google Scholar
  5. Azad S, Rigby PC, Guerrouj L (2017) Generating api call rules from version history and stack overflow posts. ACM Transactions on Software Engineering and Methodology (TOSEM)Google Scholar
  6. Bacchelli A, Sasso TD, D’Ambros M, Lanza M (2012a) Content classification of development emails. In: 2012 34Th international conference on software engineering (ICSE). IEEE, pp 375–385Google Scholar
  7. Bacchelli A, Ponzanelli L, Lanza M (2012b) Harnessing stack overflow for the ide. In: Proceedings of the Third International Workshop on Recommendation Systems for Software Engineering. IEEE Press, pp 26–30Google Scholar
  8. Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654CrossRefGoogle Scholar
  9. Begel Andrew, Bosch Jan (2013) Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder. IEEE Softw 30(1):52–66CrossRefGoogle Scholar
  10. Bengio Y, Courville A (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRefGoogle Scholar
  11. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRefGoogle Scholar
  12. Bougie G, Starke J, Storey M-A, German DM (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering, pp 31–36Google Scholar
  13. Cai X, Zhu J, Shen B, Chen Y (2016) Greta: Graph-based tag assignment for github repositories. In: Computer software and applications conference (COMPSAC), 2016 IEEE 40th annual. IEEE, vol 1, pp 63–72Google Scholar
  14. Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382CrossRefGoogle Scholar
  15. Chen C, Sa G, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on Software analysis, evolution, and reengineering (SANER). IEEE, vol 1, pp 338–348Google Scholar
  16. Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on Automated software engineering (ASE). IEEE, pp 834–839Google Scholar
  17. Chen G, Chen C, Xing Z, Xu B (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 744–755Google Scholar
  18. Chen C, Xing Z, Wang X (2017) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th International Conference on Software Engineering. IEEE Press, pp 450–461Google Scholar
  19. Chen C, Xing Z, Liu Y (2018) By the community & for the community: A deep learning approach to assist collaborative editing in q&a sites. In: Proceedings of the 21st ACM Conference on Computer-Supported Cooperative Work and Social Computing. ACM, pp 32:1–32:21Google Scholar
  20. Chenail RJ (2008) Youtube as a qualitative research asset: Reviewing user generated videos as learning resources. Q Rep 13(3):18–24Google Scholar
  21. De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: an empirical study. Empir Softw Eng 19(5):1383–1420CrossRefGoogle Scholar
  22. El Mezouar M, Zhang F, Zou Y (2018) Are tweets useful in the bug fixing process? an empirical study on firefox and chrome. Empir Softw Eng 23(3):1704–1742CrossRefGoogle Scholar
  23. Guo J, Cheng J, Cleland-Huang J (2017) Semantically enhanced software traceability using deep learning techniques. In: 2017 IEEE/ACM 39th international conference on Software engineering (ICSE). IEEE, pp 3–14Google Scholar
  24. Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack What do twitter users say about software?. In: 2016 IEEE 24th international Requirements engineering conference (RE). IEEE, pp 96–105Google Scholar
  25. Guzman E, Alkadhi R, Seyff N (2017a) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412CrossRefGoogle Scholar
  26. Guzman E, Ibrahim M, Glinz Martin (2017b) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 11–20Google Scholar
  27. Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empir Softw Eng 24(2):902–936CrossRefGoogle Scholar
  28. Johnson R, Zhang T (2015) Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in neural information processing systems, pp 919–927Google Scholar
  29. Johnson R, Zhang T (2016) Supervised and semi-supervised text categorization using lstm for region embeddings. arXiv:1602.02373
  30. Kenter T, De Rijke M (2015) Short text similarity with word embeddings. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1411–1420Google Scholar
  31. Kwak H, Lee C, Park H, Moon SB (2010) What is twitter, a social network or a news media?. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, pp 591–600Google Scholar
  32. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. biometrics pp 159–174Google Scholar
  33. Lee JY, Dernoncourt F, Szolovits P (2017) Transfer learning for named-entity recognition with neural networks. arXiv:1705.06273
  34. Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol (TOSEM) 23(4):31CrossRefGoogle Scholar
  35. MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, pp 104–114Google Scholar
  36. MacLeod L, Bergen A, Storey M-A (2017) Documenting and sharing software knowledge using screencasts. Empir Softw Eng 22(3):1478–1507CrossRefGoogle Scholar
  37. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781
  38. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  39. Mou L, Meng Z, Yan R, Li G, Xu Y, Lu Z, Jin Z (2016) How transferable are neural networks in nlp applications?. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 479–489Google Scholar
  40. Ott J, Atchison A, Harnack P, Best N, Anderson H, Firmani C, Linstead E (2018) Learning lexical features of programming languages from imagery using convolutional neural networks. In: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18. ACM, New York, pp 336–339Google Scholar
  41. Palomba F, Panichella A, De Lucia A, Oliveto R, Zaidman A (2016) A textual-based technique for smell detection. In: 2016 IEEE 24Th international conference on program comprehension (ICPC). IEEE, pp 1–10Google Scholar
  42. Pan SJ, Yang Q, et al. (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
  43. Parnin C, Treude C (2011) Measuring api documentation on the web. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 25–30Google Scholar
  44. Parnin C, Treude C, Grammel L (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Technical ReportGoogle Scholar
  45. Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: Proceedings of the 26th Conference on Program Comprehension. ACM, pp 222–232Google Scholar
  46. Pennington J, Socher R (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543Google Scholar
  47. Poché E, Jha N, Williams G, Staten J, Vesper M, Mahmoud A (2017) Analyzing user comments on youtube coding tutorial videos. In: Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, pp 196–206Google Scholar
  48. Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 1295–1298Google Scholar
  49. Ponzanelli L, Mocci A, Bacchelli A, Lanza M (2014a) Understanding and classifying the quality of technical forum questions. In: Quality software (QSIC), 2014 14th international conference on. IEEE, pp 343–352Google Scholar
  50. Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014b) Improving low quality stack overflow post detection. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–544Google Scholar
  51. Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Hasan M, Russo B, Haiduc S, Lanza M (2016a) Too long; didn’t watch!: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering. ACM, pp 261–272Google Scholar
  52. Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Russo B, Haiduc S, Lanza M (2016b) Codetube: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering Companion. ACM, pp 645–648Google Scholar
  53. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  54. Posnett D, Warburg E, Devanbu P, Filkov V (2012) Mining stack exchange: Expertise is evident from initial contributions. In: 2012 international conference on Social informatics (socialinformatics). IEEE, pp 199–204Google Scholar
  55. Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: 2012 28th IEEE international conference on Software maintenance (ICSM). IEEE, pp 596–599Google Scholar
  56. Rahman MM, Roy CK (2015) An insight into the unresolved questions at stack overflow. In: Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, pp 426–429Google Scholar
  57. Rahman MM, Roy CK (2017) Strict: Information retrieval based search term identification for concept location. In: 2017 IEEE 24Th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 79–90Google Scholar
  58. Semwal T, Yenigalla P, Mathur G, Nair SB (2018) A practitioners’ guide to transfer learning for text classification using convolutional neural networks. In: Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, pp 513–521CrossRefGoogle Scholar
  59. Sharma A, Tian Y, Lo D (2015a) Nirmal: Automatic identification of software relevant tweets leveraging language model. In: 2015 IEEE 22nd international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 449–458Google Scholar
  60. Sharma A, Tian Y, Lo D (2015b) What’s hot in software engineering twitter space?. In: 2015 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–545Google Scholar
  61. Sharma A, Tian Y, Sulistya A, Lo D, Yamashita AF (2017c) Harnessing twitter to support serendipitous learning of developers. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 387–391Google Scholar
  62. Sharma A, Tian Y, Sulistya A, Wijedasa D, Lo D (2018) Recommending who to follow in the software engineering twitter space. ACM Trans Softw Eng Methodol 27(4):16:1–16:33CrossRefGoogle Scholar
  63. Singer L, Filho FF, Storey M-A (2014) Software engineering at the speed of light: how developers stay current using twitter. In: Proceedings of the 36th International Conference on Software Engineering. ACM, pp 211–221Google Scholar
  64. Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642Google Scholar
  65. StackExchange (2019) About software engineering stack exchange. [Online; accessed 16-April-2019]Google Scholar
  66. Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Seventh international conference on spoken language processingGoogle Scholar
  67. Storey M-A, Singer L, Cleary B, Filho FF, Zagalsky A (2014) The (r) evolution of social media in software engineering. In: Proceedings of the on Future of Software Engineering. ACM, pp 100–116Google Scholar
  68. Storey M-A, Zagalsky A, Singer L, German D et al (2017) How social and communication channels shape and challenge a participatory culture in software development. IEEE Transactions on Software Engineering, (1):1–1Google Scholar
  69. Tian Y, Achananuparp P, Lubis IN, Lo D, Lim E-P (2012) What does software engineering community microblog about?. In: 2012 9th IEEE working conference on Mining software repositories (MSR). IEEE, pp 247–250Google Scholar
  70. Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4Th workshop on mining unstructured data. IEEE, pp 1–5Google Scholar
  71. Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 2011 33rd international conference on Software engineering (ICSE). IEEE, pp 804–807Google Scholar
  72. Uddin G, Khomh F (2017a) Automatic summarization of api reviews. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 159–170Google Scholar
  73. Uddin Gx, Khomh F (2017b) Opiner: An opinion search and summarization engine for apis. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 978–983Google Scholar
  74. Van Nguyen T, Nguyen AT, Phan HD, Nguyen TD, Nguyen TN (2017) Combining word2vec with revised vector space model for better code retrieval. In: Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, pp 183–185Google Scholar
  75. Vasilescu B, Filkov V, Serebrenik A (2013) Stackoverflow and github: Associations between software development and crowdsourced knowledge. In: 2013 international conference on Social computing (socialcom). IEEE, pp 188–195Google Scholar
  76. Vasilescu B, Serebrenik A, Devanbu P, Filkov V (2014) How social q&a sites are changing knowledge sharing in open source software communities. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, pp 342–354Google Scholar
  77. Wang X, Kuzmickaja I, Stol K-J, Abrahamsson P, Fitzgerald B (2013) Microblogging in open source software development: The case of drupal and twitter, Software. IEEEGoogle Scholar
  78. Wang S, Lo D, Vasilescu B, Serebrenik A (2014) Entagrec: An enhanced tag recommendation system for software information sites. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 291–300Google Scholar
  79. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bullet 1 (6):80–83CrossRefGoogle Scholar
  80. Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 1–10Google Scholar
  81. Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185CrossRefGoogle Scholar
  82. Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 413–424Google Scholar
  83. Xu C, Sun X, Li B, Lu X, Guo H (2018) Mulapi: Improving api method recommendation with api usage location. J Syst Softw 142:195–205CrossRefGoogle Scholar
  84. Yadid S, Yahav E (2016) Extracting code from programming tutorial videos. In: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. ACM, pp 98–111Google Scholar
  85. Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on Software reliability engineering (ISSRE). IEEE, pp 127–137Google Scholar
  86. Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406CrossRefGoogle Scholar
  87. Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From word embeddings to document similarities for improved information retrieval in software engineering. In: Proceedings of the 38th international conference on software engineering. ACM, pp 404–415Google Scholar
  88. YouTube (2017) Youtube. [Online; accessed 20-AUG-2018]Google Scholar
  89. Yu J, Qiu M, Jiang J, Huang J, Song S, Chu W, Chen H (2018) Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, pp 682–690Google Scholar
  90. Zhang J, He J, Ren Z, Chen X (2018) Recommending apis for api related questions in stack overflow. IEEE Access 6:6205–6219CrossRefGoogle Scholar
  91. Zhao T, Cao Q, Sun Q (2017) An improved approach to traceability recovery based on word embeddings. In: 2017 24th Asia-pacific software engineering conference (APSEC). IEEE, pp 81–89Google Scholar
  92. Zhou P, Liu J, Yang Z, Zhou G (2017) Scalable tag recommendation for software information sites. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 272–282Google Scholar
  93. Zhenchang HL, Han XZ, Li X, Feng Z (2018) Reasoning common software weaknesses via knowledge graph embedding. In: 2018 IEEE 25rd international conference on Software analysis, evolution, and reengineering (SANER)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.PT Telkom IndonesiaJakartaIndonesia
  2. 2.Singapore Management UniversitySingaporeSingapore
  3. 3.University of AdelaideAdelaideAustralia

Personalised recommendations