Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Devine, Peter; Koh, Yun Sing; Blincoe, Kelly

doi:10.1007/s10664-022-10254-y

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Published: 07 January 2023

Volume 28, article number 26, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

488 Accesses
2 Citations
Explore all metrics

Abstract

Understanding users’ needs is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms. This study evaluates machine learning classifiers’ performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback channels and/or labelled by different researchers). We also measure the difference in performance of using channel-specific metadata as a feature in classification. We find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We also demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers evaluated do not perform well on unseen datasets. We show that multi-dataset training or zero shot classification approaches can somewhat mitigate this performance decrease. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Sampling in software engineering research: a critical review and guidelines

Article 28 April 2022

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Notes

https://monkeylearn.com/
https://doi.org/10.5281/zenodo.5733504
This dataset contains feedback that is labelled as “Error”. While the classification of this class of feedback is not reported on in the paper, we use this class as our bug report class.
The replication package contains two datasets referring to research questions 1 and 3 from this study, of which the latter is a pre-filtered set of feedback (filtered to contain only requirements relevant feedback) used to measure clustering performance, rather than classification. Therefore, while no classification metrics are reported for this RQ3 dataset (Dataset E), we still use it for training and testing models.
https://huggingface.co/docs/tokenizers
https://huggingface.co/transformers
https://huggingface.co/models?pipeline_tag=zero-shot-classification
https://huggingface.co/Peterard/distilbert_bug_classifier
https://huggingface.co/Peterard/distilbert_feature_classifier
https://doi.org/10.5281/zenodo.5733504

References

Ali Khan J, Liu L, Wen L, Ali R (2020) Conceptualising, extracting and analysing requirements arguments in users’ forums: the crowdre-arg framework. J Softw: Evol Process 32(12):e2309
Google Scholar
Araujo A, Golo M, Viana B, Sanches F, Romero R, Marcacini R (2020) From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In: Anais do XVII encontro nacional de inteligência artificial e computacional. SBC, pp 378–389
Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl Newsl 6(1):20–29
Article Google Scholar
Berki E, Georgiadou E, Holcombe M (2004) Requirements engineering and process modelling in software quality management—towards a generic process metamodel. Softw Qual J 12(3):265–283
Article Google Scholar
Broy M (2006) Requirements engineering as a key to holistic software quality. In: International symposium on computer and information sciences. Springer, pp 24–34
Ciurumelea A, Schaufelbühl A, Panichella S, Gall H C (2017) Analyzing reviews and code of mobile apps for better release planning. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 91–102
Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated classification of non-functional requirements. Requir Eng 12(2):103–120
Article Google Scholar
Damian D, Chisan J (2006) An empirical study of the complex relationships between requirements engineering processes and other processes that lead to payoffs in productivity, quality, and risk management. IEEE Trans Softw Eng 32 (7):433–453
Article Google Scholar
Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers). https://aclanthology.org/N19-1423. Association for Computational Linguistics, Minneapolis, pp 4171–4186
Dhinakaran V T, Pulle R, Ajmeri N, Murukannaiah P K (2018) App review analysis via active learning: reducing supervision effort without compromising classification accuracy. In: 2018 IEEE 26th international requirements engineering conference (RE). IEEE, pp 170–181
Di Sorbo A, Grano G, Aaron Visaggio C, Panichella S (2021) Investigating the criticality of user-reported issues through their relations with app rating. J Softw: Evol Process 33(3):e2316
Google Scholar
Gillies A (2011) Software quality: theory and management. Lulu com
Guzman E, El-Haliby M, Bruegge B (2015) Ensemble methods for app review classification: an approach for software evolution (n). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 771–776
Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack: what do twitter users say about software?. In: 2016 IEEE 24th international requirements engineering conference (RE). IEEE, pp 96–105
Guzman E, Ibrahim M, Glinz M (2017) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 11–20
Hadi M A, Fard F H (2021) Evaluating pre-trained models for user feedback analysis in software engineering: a study on classification of app-reviews. arXiv:2104.05861
Henao P R, Fischbach J, Spies D, Frattini J, Vogelsang A (2021) Transfer learning for mining feature requests and bug reports from tweets and app store reviews. In: 2021 IEEE 29th international requirements engineering conference workshops (REW). IEEE, pp 80–86
Iacob C, Harrison R, Faily S (2013) Online reviews as first class artifacts in mobile app development. In: International conference on mobile computing, applications, and services. Springer, pp 47–53
Iqbal T, Khan M, Taveter K, Seyff N (2021) Mining reddit as a new source for software requirements. In: 2021 IEEE 29th international requirements engineering conference (RE). IEEE, pp 128–138
Kassab M, Neill C, Laplante P (2014) State of practice in requirements engineering: contemporary data. Innov Syst Softw Eng 10(4):235–241
Article Google Scholar
Lim S, Henriksson A, Zdravkovic J (2021) Data-driven requirements elicitation: a systematic literature review. SN Comput Sci 2(1):1–35
Article Google Scholar
Lin D, Bezemer C P, Zou Y, Hassan A E (2019) An empirical study of game reviews on the steam platform. Empir Softw Eng 24(1):170–207
Article Google Scholar
Maalej W, Kurtanović Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Requir Eng 21(3):311–331
Article Google Scholar
Magalhães C, Sardinha A, Araújo J (2021) Mare: an active learning approach for requirements classification. In: RE@Next! track of the 29th IEEE international requirements engineering conference
Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23(5):2764–2794
Article Google Scholar
Nuseibeh B, Easterbrook S (2000) Requirements engineering: a roadmap. In: Proceedings of the conference on the future of software engineering, pp 35–46
Pagano D, Maalej W (2013) User feedback in the appstore: an empirical study. In: 2013 21st IEEE international requirements engineering conference (RE). IEEE, pp 125–134
Panichella S, Di Sorbo A, Guzman E, Visaggio C A, Canfora G, Gall H C (2016) Ardoc: app reviews development oriented classifier. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pp 1023–1027
Radliński Ł (2012) Empirical analysis of the impact of requirements engineering on software quality. In: International working conference on requirements engineering: foundation for software quality. Springer, pp 232–238
Rempel P, Mäder P (2016) Preventing defects: the impact of requirements traceability completeness on software quality. IEEE Trans Softw Eng 43 (8):777–797
Article Google Scholar
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108
Scalabrino S, Bavota G, Russo B, Di Penta M, Oliveto R (2017) Listening to the crowd for the release planning of mobile apps. IEEE Trans Softw Eng 45(1):68–86
Article Google Scholar
Stanik C, Haering M, Maalej W (2019) Classifying multilingual user feedback using traditional machine learning and deep learning. In: 2019 IEEE 27th international requirements engineering conference workshops (REW). IEEE, pp 220–226
Sultan M A, Bethard S, Sumner T (2014) Back to basics for monolingual alignment: exploiting word similarity and contextual evidence. Trans Assoc Comput Linguist 2:219–230
Article Google Scholar
Tizard J, Wang H, Yohannes L, Blincoe K (2019) Can a conversation paint a picture? Mining requirements in software forums. In: 2019 IEEE 27th international requirements engineering conference (RE). IEEE, pp 17–27
Tizard J, Rietz T, Liu X, Blincoe K (2021) Voice of the users: an extended study of software feedback engagement. Requir Eng 1–23
Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 1–10
Yin W, Hay J, Roth D (2019) Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. arXiv:1909.00161

Download references

Funding

The work was conducted in the course of a PhD study by the lead author (Peter Devine), which is funded by the Faculty of Engineering at The University of Auckland.

Author information

Authors and Affiliations

Human Aspects of Software Engineering Lab, University of Auckland, Auckland, New Zealand
Peter Devine & Kelly Blincoe
School of Computer Science, University of Auckland, Auckland, New Zealand
Yun Sing Koh

Authors

Peter Devine
View author publications
You can also search for this author in PubMed Google Scholar
Yun Sing Koh
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Blincoe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Devine.

Ethics declarations

Consent

Matters of consent are not applicable to this work due to the fact that no human participants were involved.

Conflict of Interest

None of the authors listed have a declared conflict of interest related to this work.

Competing Interests

One of the authors of this paper (Kelly Blincoe) is the editorial boards of the IEEE Transactions on Software Engineering, the Empirical Software Engineering Journal, and the Journal of Systems and Software.

Additional information

Communicated by: Apostolos Ampatzoglou -Gemma Catolino-Daniel Feitosa-Valentina Lenarduzzi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Devine, P., Koh, Y.S. & Blincoe, K. Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata. Empir Software Eng 28, 26 (2023). https://doi.org/10.1007/s10664-022-10254-y

Download citation

Accepted: 31 October 2022
Published: 07 January 2023
DOI: https://doi.org/10.1007/s10664-022-10254-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Sampling in software engineering research: a critical review and guidelines

How different are different diff algorithms in Git?

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Consent

Conflict of Interest

Competing Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Sampling in software engineering research: a critical review and guidelines

How different are different diff algorithms in Git?

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Consent

Conflict of Interest

Competing Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation