1 Introduction

With an ever-increasing number of mobile users, hundreds of thousands of mobile applications (apps) have been developed for Android users. The Google Play Store is currently the biggest online mobile apps store, where more than 2.6 million free and paid applications are accessible as of 10 January 2022 [5]. On the Google Play Store, users can choose a mobile app for their personal use from different categories and download it to their mobile devices. Mobile app usage gain popularity over the last few years due to its easy-to-use features, simplicity, and mobility. People use these apps for a large variety of daily tasks from health stats, and sports, to business transactions. People share their experiences of using particular apps and provide feedback in the form of reviews and ratings. Application stores such as the Google Play Store and Apple Applications Store give application developers a unique consumer feedback system in the form of application reviews which can be used for different purposes [22]. Reviews often contain suggestions to enhance features, complaints, suggestions to bring more customized options, and innovative ideas. However, analyzing these reviews has several associated challenges. Firstly, app stores include a large number of reviews, which require great effort and a substantial amount of time to examine. Secondly, the caliber of reviews has a wide range of helpful and innovative advice to affront comments. Thirdly, a review tells the different app features that are positive, negative, or neutral, however, sentiment analysis is required for that purpose.

Star rating is a useful numerical value assigned to mobile apps by the users. Since rating represents the average of all the ratings granted by the app users and combines the negative, neutral and positive features in the comments, it reduces the scope of user feedback. Also, technically star rating is not accurate and does not represent a true representation of a mobile app as the user may not be satisfied with an app and still award four stars to the app [6]. Recently, a lot of research has been done to tackle the problem of review anomalies to rank the online reviews based on the sentimental analysis [39, 50]. Existing approaches to sentiment analysis using app reviews lack in several aspects. First, often the data selected for experiments has a small number of samples which limits the scope of the study. Second, predominantly, existing studies utilize reviews related to a single app or a few apps, and mobile app diversity is not covered. Third, with a limited amount of data, the performance evaluation of the models is not exhaustive. This study overcomes these limitations and performs comprehensive experiments in this regard. Keeping in mind the scope and potential of app reviews over app rating, this study leverages the mobile app reviews from Google Play Store and performs sentiment analysis.

This study proposes a framework to perform the sentiment analysis and make the following contributions

  • A framework is contrived for analyzing the sentiments of users’ reviews of mobile apps on the Google Play Store. The framework utilizes many machine learning models and investigates their performance for the task at hand with the term frequency-inverse document frequency (TF-IDF).

  • A large dataset is built that contains the app reviews and comprises 251661 reviews in total for eight different categories of mobile apps. For each category, different apps are selected for review collection. To resolve the problem of dataset imbalance, a similar number of reviews are collected for the dataset.

  • Performance of machine learning models is investigated including logistic regression (LR), random forest (RF), multinomial Naive Bayes (MNB), K nearest neighbor (KNN), and support vector machine (SVM) with regards to accuracy, precision, recall, and F1 score. The influence of preprocessing is also analyzed where stats features are also used during this process like the number of words, characters, special characters, etc.

The remainder of this paper is organized into four sections. The literature review is presented in Section 2 followed by the data collection process and proposed methodology in Section 3. Results are presented in Section 4 while Section 5 gives the conclusion.

2 Literature review

With the introduction of mobile apps for Android and iOS, diverse use of such apps has been observed from common daily tasks to business financial transactions. Positive ratings of such apps greatly help the developers which strive to provide more functionality for these apps. Reviews contain suggestions for improvements, criticism of existing problems, and potential solutions to solve these problems, so analyzing the reviews can provide significantly important information. Sentiment analysis has gained large attraction during the past few years and has been utilized to mine opinions from social media text. Similar to other domains, sentiment analysis has been utilized for analyzing mobile app reviews.

Several machine learning algorithms are used to predict the rating of an app on Google Play Store and concluded that the machine learning algorithm like linear regression elegant tree provides the best rating prediction results [6, 33]. The authors investigated how Twitter uses complementary data to drive the creation of mobile apps. A total of 30793 apps were analyzed over six months to see whether there were any links between the number of tweets and program testimonials. Using a machine learning classifier, topic modeling, and subsequence crowdsourcing, the study extracted 22.4% more feature requests and 12.89% another problem report from Twitter [36].

By grouping fine-grained features into more, and relevant features, a refined accuracy for review classification is obtained in [51]. The authors compared the results of peer analysis to 7 apps they downloaded from the Apple and Google app stores. In the proposed method, the app developer can look into users’ reviews regarding a specific feature and filter views that are unrelated. The method provides a 91% accuracy score and 73% recall rate. The authors analyze reviews for using two Android apps in [34]. The study use reviews of two recent apps, one from the Brain and Puzzle category and the second from the personalization category. SAS 12.1 is used for performing sentiment analysis on the collected 600 reviews. Performance comparison reveals that the rule-based model is more precise and effective for sentiment analysis. The study [14] asserts that due to a diverse range of apps and related reviews, models being domain or subject dependent, a single model is not suitable to perform sentiment analysis. As a result, different types of classification models are combined to overcome this limitation and benefit from each others’ merits, resulting in improved sentiment classification performance.

Authors combined app reviews and ratings in [16] to study the relationship between user-reported cases and app ratings. In addition, important factors with a high influence on apps’ perceived quality are studied. The focus is placed on topics that are highly correlated to app rating. The study indicates that users’ feedback on apps’ bugs has a direct relation to the app’s lower rating. The study [54] identifies key features that are associated with apps’ high ratings and present a novel approach using app description and user reviews for finding such key features. For this purpose, natural language processing, and machine learning approaches have been leveraged. Results indicate that an average F1 of 78.13% can be achieved with the proposed approach.

Twitter provides complementary information to support mobile app development. By analyzing a total of 30793 apps over six weeks, the authors found strong correlations between the number of reviews and tweets for most apps in [30]. Moreover, through applying machine learning classifiers, topic modeling, and subsequent crowd-sourcing, the author successfully mined 22.4% of additional feature requests and 12.89% additional bug reports from Twitter. The author also found that 52.1% of all feature requests and bug reports were discussed in both tweets and reviews. In addition to finding common and unique information from Twitter and the app store, sentiment and content analyses are also performed for 70 randomly selected apps. From this, the author found that tweets provided more critical and objective views on apps than reviews from the app store.

The author introduces a scalable system to help analyze and predict Android apps’ compliance with privacy requirements in [60]. The proposed system is not only intended for regulators and privacy activists but also meant to assist app publishers and app store owners in their internal assessments of privacy requirement compliance. The authors performed exploratory data analysis on the data collected from the Google Play Store data for feature relationship in [29]. The purpose is to dive deeper into discovering relationships of specific features such as how the number of words in an app name, for instance, affects installs, to use them to find out which apps are more likely to succeed. Using these extracted features and the recent sentiment of users the authors predicted the success of an app soon after it is launched into the Google Play Store.

The authors in [24] show that 23.3% of users request further features for the apps, i.e., comments through which users either suggest new features for an app or express preferences for the re-design of already existing features of an app. One of the challenges app developers face when trying to make use of such feedback is the massive number of available reviews. Through this work, the authors provided a process by designing a mobile app review analyzer (MARA), a prototype for automatic retrieval of mobile app feature requests from online reviews.

In [21], the authors aim to measure the extent of the sentiment analysis results given by customers to Go-Jek through the comments. Customers’ opinions are taken to get positive, negative, or neutral comments. Go-Jek is one of the most popular providers of online transportation services in Indonesia that has now grown to become the on-demand mobile platform and the leading application that provides a full range of services ranging from transportation, logistics, payments, and food delivery services, and various other services.

The author examines reviews at the update level to better understand how users perceive bad updates in [22]. The study focuses on the top 250 bad updates (i.e., updates with the highest increase in the percentage of negative reviews relative to the prior updates of the app) from 26726 updates of 2526 top free-to-download apps in Google Play Store. The authors find that feature removal and user interface issues have the highest increase in the percentage of negative reviews. Bad updates with crashes and functional issues are the most likely to be fixed by a later update.

The author applies a conjoint study approach in [38]. The author conducts the research to quantify the monetary value that users place on their friends’ personal information. Utilizing the scenario of social app adoption, the authors further investigate the impact of the comprehensiveness of shared profile information on valuation and varying the data collection context, i.e., friends’ information is not relevant, or is relevant to app functionality.

3 Proposed methodology

This section describes the methodology used to acquire the dataset, its visualization, the preprocessing techniques applied for dataset selection, feature extraction to implement classification techniques, and the proposed methodology.

Figure 1 shows the methodology adopted for this study. The Google Play store data is scraped by using the regular expression and the raw data of star ratings and comments are acquired for apps from different categories. Secondly, the stats features are implemented on the raw data i.e., counting the number of words, characters, special words, case sensitivity, etc. Thirdly, the data is preprocessed by using the removal of many words such as frequent words, rare word stop words, and conversion of lowercase with correction of misspelled words. Fourthly, the sentimental analysis is performed on the data labeled using the TextBlob where the sentiments are assigned based on sentiment polarity. Fifthly, the TF-IDF features are extracted from the preprocessed data to train the machine learning models used in this study. In the later stage, the features are used with the models for training, testing, and validation with LR, MNB, RF, KNN, and SVM. The results are analyzed based on F1 score, recall, precision, and accuracy.

Fig. 1
figure 1

The flow of Google Play Store applications reviews classification

3.1 Data collection

This study collects a large dataset comprising 234453 user reviews, scraped from the Google Play Store. The scrapping process is depicted in Fig. 2. The ‘request’ library is used to extract the data from the Google Play Store. The ‘request’ is a Python package that allows users to submit HTTP requests. The data were extracted with the labeling like headers files, multipart files, and parameters files using Python libraries. This labeling helped to identify the category of the application. The number of mobile applications followed the category. The reviews and rating data are translated in the form of sentimental analysis (labeled as positive, negative, and neutral). The data is further classified by using machine-learning algorithms.

Fig. 2
figure 2

The scrapping process of Google Play Store applications reviews

The regular expression (re) is a character sequence that aids in matching or finding other strings or groupings of strings by employing a specified syntax stored in a pattern. The ‘re’ is essentially a character series that assists in searching for matching patterns in the text [37]. We used the Beautiful Soup for data scrapping [10]. This is a library of python, which is used to extract information from the HTML and XML documents. The review data is divided into eight different categories depending on their usability for different tasks and include ‘action’, ‘casual’, ‘communication’, etc. Figure 3 represents the name and percentage ratio of each category in the dataset. Each category further has reviews and ratings for different apps.

Fig. 3
figure 3

Percentage of each category in the dataset

The detail of the scrapped data into general categories with the number of consumer reviews for each category is given in Table 1. A total of 24000 reviews are gathered for the ‘Action’ category with 4001 reviews each for six mobile apps including ‘Bush Rush’, ‘Metal Soldiers’, ‘Real Gangster Crime’, ‘Talking Tom Gold Run’, etc. The complete details of each category and its comprising apps are provided in Table 1.

Table 1 Detailed dataset information regarding categories and apps

3.2 Stats feature using reviews

The user reviews raw data is extracted from the Google Play Store for further processing. The stat feature counts the number of words, characters, average word length, stop words, special characters, numeric and upper case. The objective of using the stats feature for the collected reviews is to analyze the patterns for positive, negative, and neutral reviews. The following steps are performed for stats features. Figure 4 shows the stats feature.

Fig. 4
figure 4

Stats feature of Google Play store applications reviews

3.2.1 Number of words

Initially, the number of words is counted from each user review. It is observed that the negative users’ reviews have a lesser number of words than the positive reviews.

3.2.2 Number of characters

The number of characters is counted from each review. This process is performed by calculating the length of each review. Similar to the number of words, the number of characters for negative reviews is short.

3.2.3 Average word length

The average word length is calculated at this level for each collected review. The average word length makes the system more smooth and expressive. The sum of words has been calculated from each review and these words are divided by the total length of the review to obtain the average word length.

3.2.4 Number of stop words

The stop word calculation is helpful to get extra information i.e. ‘and’, ‘a’, ‘the’, ‘is’, ‘are’, etc. The stop words are calculated by applying the Natural language toolkit (NLTK) library. The implementation of NLTK is used for the machine learning algorithms [12].

3.2.5 Number of special characters

There are different special characters used in users’ reviews. The number of hashtags is calculated and extracted. Moreover, some more information is extracted from the reviews. Hashtags always appear at the beginning of a word, however, other special characters may appear in the middle, or other places in a sentence like ‘—’, question mark, ‘@’, etc.

3.2.6 Number of numeric values

The number of numerical values is calculated which users mention shortening the length of reviews, for example, ‘4’ is often used instead of ‘for’. Although, numeric values are often removed from the text, using them could provide additional information.

3.2.7 Number of uppercase words

Most of the time the user expresses his feelings in the form of capital words like Anger and anger is often written in UPPERCASE words. So, extracting such information may be helpful to analyze the sentiments.

3.3 Preprocessing steps

Regarding the use of machine learning models for sentiment analysis, the data needs to be cleaned before training the models, as, it helps to increase the training accuracy and improves the performance of machine learning models [43]. Performance is improved as the noise, unnecessary and redundant data, that do not contribute to predicting the target class, are removed [26]. Figure 5 shows the steps followed in the preprocessing.

Fig. 5
figure 5

Preprocessing steps followed in this study

3.3.1 Lower case

In the first step, the dataset is transformed into the lower case. Replication of the same words in the dataset is omitted. For instance, the terms ‘Excellent’ and ‘excellent’ are treated differently when determining the word count.

3.3.2 Removal of stop words

Removal of stop words is essential in Natural Language Processing (NLP) tasks and text analysis [2]. This process can be followed routine using a pre-defined library or using a list of stop words.

3.3.3 Common word removal

In this step, commonly used words are removed from the reviews. A collection of 10 most frequently occurring words were analyzed and removed.

3.3.4 Rare words removal

In this step, the rare words from the user reviews are removed. Due to the scarcity, noise dominates the link between uncommon and other words. Unusual words are replaced with the general word to increase the count of the words.

3.3.5 Spelling correction

Many real-world NLP application problems rely on misspelling detection and repair modules to function properly [28]. Textblob library is used for spelling correction because this step is more potent in reducing the copies of words by preprocessing.

3.3.6 Tokenization

The process of breaking text into a list of tokens that are used as whole words or part of words called subwords is known as tokenization [4]. Tokenization is used for the distribution of the user reviews in the sequence of words or sentences, which are transformed into a blob and then into a string of words by using the ‘textblob’ library.

3.3.7 Stemming

Stemming is a typical need of NLP, as well as a preprocessing step in text mining applications [52]. By using a simple rule-based approach, the words are transformed into their root form, like removing ‘ing’, ‘ly’, ‘s’, and so on. Porter Stemmer is used from the NLTK library for this process.

3.3.8 Lemmatization

The technique of deriving the dictionary form of a word (e.g. swim) given one of its inflected forms is known as lemmatization (e.g. swims, swimming, swam, swum) [28]. Rather than draining the suffices, the phrase is converted to its origin phrase by lemmatization. As a result, lemmatization is the best choice. Before part-of-speech (PoS) tagging, lemmatization is required for morphological analysis and the elimination of inflections by returning the base of the word without the ends [18].

3.4 Dataset annotation

The sentiment is assigned to a review with respect to its weight and a different threshold value is used [39]. Textblob library of Python is used for this purpose [1]. Textblob is a Python module for handling textual data. It provides a straightforward application programming interface (API) for language processing tasks such as noun word extraction, tagging, sentiment analysis, and more. TextBlob can be used for classification purposes for large training data sets with many dimensions. Figure 6 shows the distribution of sentiments for the collected dataset. It can be seen that 58% of people have positive feedback, 25% gave neutral comments while only 16% gave negative feedback.

Fig. 6
figure 6

Sentiments ratio of Google Play store applications reviews

3.5 Feature extraction

The reviews are organized and cleaned by using the preprocessing steps and now can be used for feature extraction. Figure 7 shows the process of feature extraction for sentiment analysis of app reviews. TF-IDF is used to extract features of the specific words from the reviews dataset. TF-IDF is the most commonly used feature for text analysis. Each term in a document is given a weightage depending on its TF and IDF [44].

Fig. 7
figure 7

Feature extraction for sentiment analysis

TF determines the frequency of each unique term in a document and can be used as follows [56]

$$ TF(T) = \frac{\text{No. of time} \textit{T} \text{appears in a document}}{\text{Total number of terms inside document}} $$
(1)

TF-IDF punishes frequent terms and assigns higher weights to those terms that appear less likely in a given corpus. It is calculated as [56]

$$ IDF = log \frac{\text{Total number of documents}}{\text{No.of documents through term}\ t\ \text{in it}} $$
(2)

In the end, TF-IDF can be obtained by multiplying TF and IDF.

3.6 Machine learning models for reviews classification

The implementation of the various supervised machine-learning algorithms such as LR, RF, SVM, SVM, and KNN is used in this research. The machine learning algorithms are refined for better performance using different hyperparameters and a list of all parameters is provided in Table 2.

Table 2 Parameter information of supervised machine learning algorithms

3.6.1 Logistic regression

LR is preferred here because LR shows better performance for binary classification and text categorization tasks [20]. Contrary to linear regression which produces continuous numerical values, LR modifies its output by applying the logistic sigmoid function to yield a probability value that may then be translated to two or more discrete groups. For our dataset, the random_state= 0 and multi_class=‘ovr’ options are utilized. The following equation is used to find logistic regression [41]

$$ Y(x) = \frac{L}{1+ e^{(-n(v-v_{0}))}} $$
(3)

where the natural algorithm basis is e (also known as Euler Number), the sigmoid midpoint’s x-value is v0, the greatest value of the curve is L, and the steepness of the curve is represented by n.

LR has been used for a variety of tasks including clinical studies [57], finance applications [19], meteorological and ecological data [3], tomography [47] and many other [7, 50].

3.6.2 Random forest

RF is a machine-learning algorithm used for classification. The bagging approach is used in this model to train several decision trees using different bootstrap samples [17]. RF’s basic premise is that it is computationally cheap for creating a tiny decision tree with few attributes. The trees merge by merging or taking the majority vote to construct a single, powerful learner if there is the possibility to build multiple tiny, weak decision trees simultaneously. RF is frequently discovered to be the most accurate learning algorithm. For this experiment dataset, random state= 1, n_jobs= − 1, n_estimators= 1000, and max_features= 4 parameters are used. The equation for finding RF is given below [45]

$$ P=mode\{T1(y), T2(y), . . . ,,Tm(y)\} $$
(4)

where p is the final prediction by majority voting of decision trees, while T1(y),T2(y), T3(y), and Tm(y) are the number of decision trees participating in the prediction procedure.

RF has been used for several applications related to text analysis like sentiment analysis and text classification [26, 53], COVID-19 pandemic [58], e-commerce [35], and prediction tasks [59].

3.6.3 Multinomial Naive Bayes

MNB is a probabilistic classifier that relies on the Bayes theorem’s properties and assumes that the features are highly independent. One of the benefits of this algorithm is that it just needs a small amount of training data to compute the parameters for prediction. Due to the independence of characteristics, the single variance of the characteristic is determined rather than the maximum covariance matrix. For each one, the conditional likelihood for a given textual analysis d and a class c is P(c|d). This likelihood can be determined using the Bayes theorem, according to the following equation [40].

$$ P(c|d) = \frac{p(d|c)*p(c)}{p(d)} $$
(5)

where P(c|d) is posterior probability, P(d|c) is a likelihood, P(c) is class prior probability, and P(x) is predictor prior probability.

Applications of MNB include text classification [42], apps review [11], sentiment classification [32], and so on.

3.6.4 K nearest neighbor

KNN is a non-parametric supervised learning approach that categorizes data points into a certain category using the training set. It collects information on all training instances and categorizes fresh cases based on their similarity. For our dataset, the n_neighbors= 3 option is used with the following distance metric

$$ Pr(Y = j | X = x_{0}) = \frac{1}{k} \sum\limits_{i\in N_{0}} I(y_{i} = j) $$
(6)

where N0 is the set of K-nearest views, and an indicator variable I(yi = j that evaluates to 1 if a particular observation (xi,yi) is N0 belongs to a class j, and 0 otherwise.

KNN shows better performance for applications where the feature space is small as it uses all the samples for the training. Despite its simplicity, KNN has been used for several applications like market prediction [27], medical imaging [9], signal classification [25], and COVID-19 prediction [8, 55].

3.6.5 Support vector machine

SVM is a non-parametric binary linear classifier that relies on a collection of mathematical functions [46]. SVM divides the data into classes by drawing a line or a hyperplane [44]. The SVM algorithm interprets each assessment in vectorized form as a data point in space. The fundamental concept behind this approach is to create a model specified by the hyperplane w, which is utilized to analyze the complete vectorized data. The hyperplane is regarded as ideal when it divides the data with the maximum distance between samples of different classes. For our dataset, kernel=‘linear’, probability=True, and C = 1 parameters are used.

$$ f(x)= sgn(w^{T} x + b) $$
(7)

where w = sum_ia_ix_iy_i is zero for all cases, but the support vectors (those lying exactly at the separating hyperplane), and 1,-1 are the labels.

SVM has been widely used in multifarious domains including bioinformatics [13], text analysis [48], hydrology [15], computational biology [31], financial forecasting [49] and disease prediction [23].

4 Results and discussion

The performance of various algorithms has been examined for the collected dataset. Experiments are performed on a Jupyter notebook and Python is used to implement the machine learning models. Performance is determined with respect to F1 score, precision, recall, and accuracy.

4.1 Accuracy of models

Table 3 shows the results of machine learning models regarding accuracy. It can be observed that accuracy varies both with the model, as well as for different categories of apps. For example, the performance of SVM is the best for all app categories, however, the highest accuracy of 0.95 is for the ‘photography’ app category. KNN shows the lowest accuracy of 0.65 with the ‘Health and Fitness’ category, as shown in Fig. 8. MBN also has poor performance with a 0.71 accuracy score for the ‘Sports’, and ‘Health and Fitness’ categories. The performance of models varies as the nature of words used for reviews regarding different categories is different.

Table 3 Accuracy results of machine learning algorithms
Fig. 8
figure 8

Graphical visualization of accuracy results

Despite variations in the performance of machine learning models, SVM shows consistently better performance for all app categories with an accuracy score higher than 0.92. LR and RF also show better performance with accuracy above 0.90, except for RF accuracy of 0.89 for the ‘Health and Fitness’ category.

4.2 F1 score of machine learning models

The F1 score ranges from 0 to 1 and is regarded more important performance evaluation metric as compared to accuracy, precision, and recall. It considers both precision and recall and provides a more balanced analysis of a model, especially when the dataset is imbalanced. Table 4 shows the results for the F1 score of all machine learning models used in this study. Results indicate that the highest F1 score of 0.93 is obtained by SVM for ‘Sports’ and ‘Racing’ app categories. The lowest F1 score is from KNN with 0.54 for the ‘Health and Fitness’ category, similar to its performance regarding accuracy. Figure 9 shows the visual illustration of the F1 score for all models. It indicates that on average the performance of many models is good, except for KNN and MBN which show substantially poor performance. Often the performance of KNN is poor when it comes to the large dataset which is the case for the current study as it contains more than 200000 reviews. KNN shows better performance with small datasets.

Table 4 F1 score results of machine learning algorithms
Fig. 9
figure 9

Graphical visualization of F1 score results

4.3 Recall results of machine learning models

The recall is a measure of completeness that indicates the proportion of true positive occurrences of a class identified as such. Table 5 provides the results of each machine learning model for each app’s category regarding review classification. Similar to its performance for accuracy and F1 score, SVM shows superior performance regarding recall score for the review classification with the highest 0.92 scores for ‘Sports’ and ‘Racing’ categories. Figure 10 further illustrates the poor performance of KNN and MNB regarding recall with the lowest recall of 0.50 and 0.55 for MNB and KNN, respectively for the same category.

Table 5 Recall results of machine learning algorithms
Fig. 10
figure 10

Graphical visualization of recall results

4.4 Precision of reviews classification

Contrary to recall, precision refers to the proportion of a class’s instances that are accurately classified as positive. Figure 11 shows the graphical illustration of the results of machine learning models regarding the precision while detailed results for every category are provided in Table 6. Results corroborate that SVM outperforms all machine learning models regarding precision, for app review classification with the highest precision of 0.95 for the ‘Racing’ category. The precision score of SVM is higher than 0.91 for all categories except for ‘Health and Fitness’ with a 0.89 precision score.

Fig. 11
figure 11

Graphical illustration of precision results

Table 6 Precision results of machine learning algorithms

The performance of SVM is followed by RF with superb results for every category. Actually, on average, the performance of RF is better than SVM except for the fact that the highest precision belongs to SVM. The lowest precision of 0.64 is achieved by the KNN for the ‘Health and Fitness’ category.

4.5 Discussions

This study analyzes the performance of several machine learning models regarding Google App reviews. Reviews analysis has been an important research area for the past few years due to wide use of mobile apps for different tasks like sports, finance, entertainment, fitness, etc. Users are always interested in the mobile apps which best server their needs without interruptions and bugs. From this perspective, the classification of users’ reviews regarding different apps can guide the users to select an appropriate app. However, such reviews are diverse, contain noise and unwanted characters and of different length which makes their analysis a challenging task. This study leverages several well-known machine learning models for app reviews classification and investigate their performance. It thus helps the researcher to select the best available model when dealing with app review classification. For this purpose, a large dataset of app reviews containing 251661 reviews is scrapped which can help researchers study such reviews. In addition, various dimensions of the reviews are explored regarding the performance of machine learning models. For example, stats features like number of words, average word length, etc. tend to yield better accuracy. This can guide the researchers to explore the metadata of the app reviews and perform further experiments. Similarly, the computational complexity of the models can be targeted where the reviews can be clustered into groups and its influence on computational time and accuracy can be analyzed.

5 Conclusion and future work

With an increasing number of mobile phones, the development and deployment of mobile apps have become a potential market where users select and use apps to perform a large variety of tasks. The choice of an app is greatly influenced by the users’ reviews posted for a particular app, in addition to using reviews for apps refining with existing and new features. Classifying such reviews into positive and negative would greatly help new users in the selection of appropriate apps. However, existing studies do not perform extensive experiments due to the lack of a large dataset. This study overcomes this limitation and scraps a large dataset comprising 251661 reviews for eight different app categories and fifty-nine apps. An approximately equal number of reviews are gathered to avoid the problem of imbalanced dataset. The dataset is used for experiments with several machine learning models like LR, RF, MNB, KNN, and SVM following preprocessing, stats feature collection and TF-IDF. Results indicate that SVM shows superior classification results obtaining a higher than 0.93 score for accuracy, precision, recall, and F1 score. Results suggest that the use of machine learning models with cleaned review can yield high accuracy results. Moreover, utilizing the stats features like the number of words, average word length, use of capital words, etc. for training the models show better performance. The performance of models varies regarding different app categories, as the nature of reviews is different for each category, however, the performance of SVM, LR, and RF proves to be more consistent. The study considers equal number of samples regarding app review categories and the impact of dataset imbalance is not studied. The collected dataset is large and provides the opportunity to perform analysis using other machine learning models on a large variety of reviews regarding app categories and apps. We intend to enlarge the dataset further by incorporating more app categories. Also, creating clusters and examining the link between the apps’ reviews and ratings is also under consideration. For future studies, we intend to consider meta data to be included in the classification process to analyze its impact on the performance of machine learning models.