Determining the best feature combination through text and probabilistic feature analysis for GPT-2-based mobile app review detection

Lee, Seung-Cheol; Lee, Dong-Gun; Seo, Yeong-Seok

doi:10.1007/s10489-023-05201-3

Determining the best feature combination through text and probabilistic feature analysis for GPT-2-based mobile app review detection

Open access
Published: 26 December 2023

Volume 54, pages 1219–1246, (2024)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Determining the best feature combination through text and probabilistic feature analysis for GPT-2-based mobile app review detection

Download PDF

1320 Accesses
1 Citation
Explore all metrics

Abstract

Mobile apps, used by many people worldwide, have become an essential part of life. Before using a mobile app, users judge the reliability of apps according to their reviews. Therefore, app reviews are essential components of management for companies. Unfortunately, some fake reviewers write negative reviews for competing apps. Moreover, artificial intelligence (AI)-based macro bot programs that generate app reviews have emerged and can create large numbers of reviews with malicious purposes in a short time. One notable AI technology that can generate such reviews is Generative Pre-trained Transformer-2 (GPT-2). The reviews generated by GPT-2 use human-like grammar; therefore, it is difficult to detect them with only text mining techniques, which use tools like part-of-speech (POS) tagging and sentiment scores. Thus, probability-based sampling techniques in GPT-2 must be used. In this study, we identified features to detect reviews generated by GPT-2 and determined the optimal feature combination for improving detection performance. To achieve this, based on the analysis results, we built a training dataset to find the best feature combination for detecting the generated reviews. Various machine learning models were then trained and evaluated using this dataset. As a result, the model that used both text mining and probability-based sampling techniques detected generated reviews more effectively than the model that used only text mining techniques. This model achieved a top classification accuracy of 90% and a macro F1 of 0.90. We expect the results of this study to help app developers maintain a more stable mobile app ecosystem.

Graphical abstract

An Approach of Extracting Feature Requests from App Reviews

AndroParse - An Android Feature Extraction Framework and Dataset

RGF-Bot: A Novel Feature Selection Method to Identify Malicious Bot Accounts on Social Networking Sites Using Machine Learning

Article 03 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the widespread adoption of smartphones, the utilization of the Internet in mobile environments has surged. Applications installed on smartphones are commonly referred to as “mobile apps”, which are predominantly used for information exchange, online shopping, and social networking. These mobile apps are predominantly distributed through Google’s Play Store and Apple’s App Store. Both of these major app stores host a vast number of apps. Mobile apps are now being employed not just on smartphones but also on various electronic devices [1] and have found applications in business-related tasks [2]. Additionally, by using the built-in sensors of smartphones, they play a significant role in users’ medical and daily routines [3]. Hence, mobile apps play a crucial role as a necessity rather than an option.

App markets categorize popular mobile apps in the top of rankings to guide users in their effective use. To rank apps, app markets provide app ratings and reviews. Mobile app users use these app ratings and reviews to evaluate the reliability of an app [4]. If an app receives a low rating and negative reviews, its visibility in the app market diminishes, resulting in a reduced user base. Consequently, low app ratings and negative reviews can adversely impact the reputation and revenue of the company distributing the app. Moreover, given the influence of negative reviews on user decisions, companies consistently update their apps (addressing bugs and enhancing features) to garner positive feedback [5, 6]. Companies routinely monitor app ratings and reviews, collecting user feedback, which then informs their app update strategies [7].

App reviews express users’ opinions about the app. Most are written in one or two short sentences expressing the user’s positive or negative emotions. Some app reviews even contain feedback about the app. However, some app reviews contain wrong information, referred to as false reviews. If app developers reflect these false reviews into app updates, system damage may occur, such as incorrect UI configurations, spreading of inaccurate information, and software bugs [8]. Moreover, app developers may waste significant resources, both time and money, to scrutinize these false reviews. In addition to false reviews, there are reviews with commercial intent and those containing malicious slander, referred to as fake reviews. Fake reviews can cause more severe issues than false reviews. They mainly have the following purposes [9,10,11]:

To employ reviewers to write positive reviews and boost the app to the top of app market rankings;
To employ reviewers to write negative or false reviews for competitors’ apps;
To use macro bot programs to automatically generate positive or false reviews.

Before macro bot programs were introduced, companies with malicious intent hired fake reviewers to write large amounts of fake reviews. Although fake reviews written by fake reviewers are indeed false, they are difficult to distinguish from true human-written reviews because they are written by humans. However, hiring fake reviewers to write fake reviews has a significant disadvantage. Groups with malicious intent that hire fake reviewers to write fake reviews must spend extensive time and money to generate fake reviews [12]. Due to this disadvantage, groups with malicious intent have turned to macro bot programs that generate fake reviews.

Macro bot programs are designed to perform a specific task. As they are automated, no human intervention is required after the initial setup [13]. Most macro bot programs mimic human behavior and perform specific tasks repeatedly [14]. Programs with these features generate fake reviews faster than humans writing reviews [15]. Hence, if macro bot programs are abused to write fake reviews, numerous fake reviews can spread quickly [16]. Given this issue, research to identify quickly spreading fake reviews is essential. In particular, the network information, user information, and review text of the community being created should be used for fake review detection.

Previous studies on fake review detection were mainly focused on the identification of sentences generated by fake reviewers or macro bot programs. To detect fake reviews, researchers have analyzed the relationship between the text features of fake reviews and behavioral patterns of fake reviewers [17], used metadata based on the behavior of fake reviewers [18], and employed advanced natural language processing (NLP) and deep-learning technologies to examine the text features of reviews [19].

Unfortunately, techniques for generating fake reviews have also advanced owing to the rapid development of AI technologies. Thus, new detection approaches that employ AI techniques and are not simply focused on text features are required. The importance of developing these new approaches arises from the advent of language models that combine NLP and AI technologies, such as generative pre-trained transformer (GPT) [20]. These models can generate text similar to human-written text [21]. Accordingly, by combining a language model with a macro bot program, large quantities of text difficult for humans to distinguish can be generated in a short time [22]. Furthermore, if a language model is used for malicious purposes to generate fake reviews, these reviews can be easily exposed to app developers who apply feedback to next app update, causing defects to be reflected in the app system and damage the app system. Fake reviews generated by employing language models feature the same language grammar as that used by humans, making their detection by applying traditional text-mining techniques difficult. Therefore, comprehensive investigations considering new perspectives are needed to detect reviews generated by using sophisticated language models such as GPT.

In this study, we extract features necessary for detecting reviews generated by latest language models such as GPT and propose the best feature combination to achieve the best review detection performance. We use reviews from the mobile app market, as shown in Fig. 1, to evaluate the effectiveness of various feature combinations in detecting fake reviews. Further, we comparatively analyze the impact of text-mining techniques and the probability-based sampling techniques of GPT on the detection of reviews generated by language models.

We refer to app reviews written by humans as “human reviews” and reviews generated by the GPT-2 [23] language model as “machine reviews.” First, we collect human reviews to fine-tune the GPT-2 model. We use the fine-tuned GPT-2 model to generate machine reviews. Subsequently, we extract statistically significant features for detecting machine reviews using text-mining and probability-based sampling techniques. We categorize the features extracted using text-mining as “text features” and those extracted using GPT-2’s probability-based sampling as “probabilistic features.” We statistically analyze the detection efficacy of the extracted features to select the most important ones. We then evaluate the performance of various machine-learning models using different feature combinations. Finally, we discuss the effectiveness of the selected best feature combination for detecting machine reviews and the detect-capability of machine reviews generated by the latest GPT models.

The GPT-2 language model is a particular version of GPT. Although latest GPT models have been developed without fine-tuning for general-purpose text generation, we primarily focus on machine reviews. As these reviews are either positive or negative towards specific apps, fine-tuning the GPT model is necessary to generate fake reviews. Accordingly, we used the GPT-2 model as it is the most easily able to fine-tune the model compared with GPT-3 [24], GPT-3.5, or GPT-4 [25].

The contributions of this paper are as follows:

For machine review detection, we define text features and probabilistic features and provide collection and preprocessing procedures to build a dataset.
We analyze the effects of text and probabilistic features for machine review detection through statistical techniques, identify the features meaningful for machine review detection, and present the results visually.
To identify the best combination of features for machine review detection, we evaluate the performance of machine-learning models and provide the test results.
To provide the result of evaluating the detect-capability of machine reviews generated by the latest version of GPT.

Section 2 introduces the background of GPT-2 and its sampling strategies. Section 3 describes studies of utilizing the GPT-2 language model and research related to the detection of fake reviews generated by GPT-2. Section 4 presents the data collection and preprocessing for distinguishing human and machine reviews and the analysis of how text and probabilistic features affect machine review detection in terms of feature selection. Section 5 presents the configuration of various testing environments using machine-learning models with the features selected in Section 4, performing evaluation of the machine review detection model, and selection of the best feature combination. Section 6 discuss the effectiveness of the best feature combination, selected in Section 5, for detecting machine reviews even when fake reviewers attempt to abuse the analysis results in Section 4 and the possibility of detecting reviews generated by the latest versions of GPT models using our selected features. Finally, Section 7 presents the conclusions

2 Background

2.1 Generative Pre-trained Transformers (GPT)

This section introduces GPT, a language model that can be used for generating machine reviews. The era of large language models (LLM) has been ushered in by the introduction of the transformer model, developed by Google in 2017 [26]. GPT, developed using the decoder structure of the transformer, was released by OpenAI, which is a non-profit research organization in 2018 [20]. As of 2023, OpenAI has continuously upgraded its models from GPT-1 to GPT-4 [20, 23,24,25]. These upgraded GPT models can generate sophisticated machine reviews that are difficult to detect.

GPT models are LLM trained on a vast corpus, such as web text and novels. The models excel in text generation and can effectively generate domain-specific text through fine-tuning. They can also perform NLP tasks such as Q&A and summarization. The machine reviews that we aim to detect can also be generated by using these models. Therefore, understanding the features and capabilities of the evolving GPT models is essential. GPT-1 has 117 million parameters and performs NLP tasks such as generating natural sentences. GPT-2 is trained with 1.5 billion parameters and possesses more advanced natural language understanding capabilities than GPT-1, especially in tasks such as Q&A. GPT-3 has 175 billion parameters and can perform various NLP tasks without fine-tuning, similar to GPT-2. GPT4 is trained with over 1,000 billion parameters and can handle multi-modal data of various types. Thus, GPT-4 can process not only text but also a wide range of real-world problems. When machine reviews are generated, the language model should be selected by considering cost-effectiveness. Fake reviewers are likely to receive financial support from specific companies. In such situations, selecting a cost-effective language model is crucial. Among the GPT series, fake reviewers are the most likely to use GPT-2. GPT-2 and GPT-3 have distinctive differences. The latest version of GPT-2, known as GPT-2 xl, is publicly available, whereas using GPT-3 or GPT-4 involves subscription costs through API provided by OpenAI. Moreover, although GPT-3 can generate general-purpose text across various domains without fine-tuning, it exhibits performance similar to that of fine-tuned GPT-2. Therefore, fake reviewers are likely to use GPT-2 to reduce cost and generate machine reviews tailored to their desired domain.

2.2 Text generation strategies from GPT

In this study, we investigate text generation mechanisms and corresponding strategies in GPT. General language models include an encoder-decoder structure and learn through the vectorization of text information. The encoder passes this vectorized information to the decoder, which interprets the information and converts it into natural language. GPT specializes in text generation by employing an auto-regressive approach. During training with large text datasets, GPT processes each token sequentially and predicts the next token based on previous tokens. GPT can adopt various sampling strategies for predicting the next token in a given context. The sampling mechanism can be considered as a strategy of the decoder structure. Here, we define the unit of input and target or phrases for prediction as “tokens.” One of the key sampling strategies in the decoder structure is “Greedy search,” which selects the next token with the highest prediction probability in a given context. For example, when predicting the next token after a context such as “The dog,” if the prediction probability for “is” is 0.95 and for “was” is 0.84, greedy search will select “is.” As this strategy must select tokens with the highest prediction probability, if one incorrect token is chosen, the likelihood of subsequent token predictions being inaccurate increases.

To overcome these limitations of greedy search, the “beam search” strategy has been introduced. Instead of simply selecting the next token with the highest prediction probability, beam search calculates the cumulative prediction probabilities for multiple token sequences and selects the sequence with the highest cumulative probability. unfortunately, this strategy also has limitations. The limitations ensures that the generated text has less variability compared to human-written text because it always produces consistent results for the same input. To overcome these limitations, some randomness are required to the token selection process because it is important for maintaining variability and a probability distribution in token selection. For example, we consider a group of tokens, namely, “very” (0.81), “good” (0.14), and “bad” (0.02), when predicting the next token after the phrase “The dog is.” In this case, the likelihood of each token being selected is determined by its prediction probability. Accordingly, the selection probability for “very” is represented as 0.81. This method does not only simply select tokens with the highest probability, but also allows token selection considering the predicted probability distribution, thus ensuring variability and randomness of results. Noteworthy probability-based sampling strategies in this method are top-k sampling and top-p sampling (nucleus sampling). GPT models can utilize both of these strategies simultaneously. Fig. 2 presents the visualized examples of these sampling strategies.

Top-k sampling is a strategy that sorts tokens in the descending order of their prediction probabilities, and randomly samples from within the top k tokens when predicting the next token. Eq. (1) represents top-k sampling.

$${\textstyle\sum_{w\in V_{top-k}}}P(w\vert x)$$

(1)

w represents the token, V represents the token vector, and x represents the previous token (word). Thus, w is a randomly selected token from the top k tokens in the V vector. Top-k sampling builds the generated text by randomly selecting from the k tokens with the highest prediction probabilities. Unfortunately, this sampling strategy may have limitations because it randomly selects from a fixed set of k tokens. For example, if the candidate tokens following the “is” token are “very” (prediction probability: 0.81), “good” (prediction probability: 0.14), and “bad” (prediction probability: 0.02) and if k is set to 2, the “good” token with a significantly lower prediction probability than “very” may also be generated. As a result, the likelihood of selecting a meaningless token increases.

By contrast, top-p sampling adopts a strategy that can avoid some of the issues of top-k sampling. Similar to top-k sampling, top-p sampling sorts the candidate tokens for prediction in the descending order of their probability values. It then cumulatively sums the prediction probabilities of the top tokens, and when it reaches p, it samples from those tokens. Eq. (2) represents top-p sampling.

$${\textstyle\sum_{w\in V_{top-p}}}P(w\vert x)$$

(2)

V represents the token vector composed of cumulative probabilities reaching the value of p. Therefore, w is selected from V. For instance, if the candidate tokens following “is” are “very” (prediction probability: 0.81), “good” (prediction probability: 0.14), and “bad” (prediction probability: 0.02) and if p is set to 0.96, V comprises tokens whose cumulative probabilities exceed 0.96. Tokens are sampled from V using each token’s prediction probability as the likelihood of it being sampled. This sampling method can prevent the selection of meaningless tokens. These sampling strategies aim to achieve results similar to human text generation, including generating various text and avoiding meaningless sentences. These sampling strategies have been utilized in previous studies on language models, with the main focus on text generation and detection of specific text. We use these strategies for detecting machine reviews.

3 Related work

3.1 Positive and negative uses of language models

Text generation based on language models can be highly beneficial. Language models can be used in different fields for text generation. For example, a previous study has focused on generating classical poetry [27]. Hu et al. suggested that using language models for generating classical poetry captures the meaning in the lines of the poem, better than RNN-based generation methods. Thus, language models are applicable in the field of literary. Nevertheless, they can have significant adverse impacts on society if abused. Two recent prominent issues are fake news and deepfake. Language models can be used to generate fake news that can rapidly propagate through the media [28]. Moreover, according to an analysis of language model-generated fake news, the reliability of the news was evaluated as similar to that of real news written by humans. Given the potential for abuse in generating human-like news, there is an urgent need for solutions. Additionally, when creating fake news with language models, one can generate not only news content but also news headlines with a satirical style [29]. Experimental results indicated that generated news headlines were more polished than those written by humans. Based on this analysis, fake news created by language models can effectively convey various emotions and sway public opinion. Hence, the generation of hard-to-detect fake news has led many researchers to study detection methods.

In fake-news detection research, fake-news detection methods based on the latest computer security techniques have emerged [30]. Grover—the model proposed by Zellers et al.—is a generator of fake news that is used for detecting fake news. The researchers stated that Grover is the most effective model for detecting fake news. Grover was compared with other models for the detection of fake news, including GPT-2, BERT, and FastText. According to the results, Grover had the best fake-news classification performance among the models. However, language models can generate not only fake news but also deepfakes. As such, research has been conducted on the detection of deepfakes generated by language models. Some of these studies focused on the detection of deepfake tweets on Twitter [31]. To detect deepfake tweets, Fagni et al. collected tweets from 23 bots and generated tweets with text generation techniques and language models such as Markov chains, RNN, RNN+Markov, LSTM, and GPT-2. They generated approximately 25,000 tweets (half written by humans and half generated by bots) and used them to evaluate 13 deepfake text detection methods. The results indicated that applying RNN-based techniques to detect generated tweets improved the performance of the detection model.

3.2 Language model-based generated text detection

Researchers have developed various methods for preventing the abuse of language models. A study was performed on the detection of text articles on the Internet generated by a language model; rather than the problem of classifying generated articles, this study raised the problem of misclassifying articles written by real humans as generated by a language model [32]. The researchers agreed that large-scale language models such as Grover are necessary to prevent the misuse of language models but also mentioned the possibility of these models being misused. They suggested the necessity of a trade-off between false positives and false negatives for the detection model to detect generated text. The authors warned that if the rate of false positives (cases of the model erroneously labeling human-written text as machine-generated) is high, the classification performance for human-written text will be poor.

In another study on preventing language-model abuse, top-k and top-p, which are decoding sampling strategies of language models, were used. These two sampling techniques select probabilities for generating human-like text [33]. The authors found that this strategy of language models can be easily detected by machine classification systems. Thus, there are studies on not only preventing abuse by using language-model’s probability-based sampling techniques but also analyzing the level of text generated by the language model. One study indicated that when text is generated using a language model without considering the language model’s algorithmic features, various defects appear in the generated text [34]. As such, many studies have been performed on preventing the abuse of language models, and to detect text generated by a language model, it is necessary to consider the language model’s probability-based sampling techniques and the perfection of the generated text.

3.3 Generated fake text detection

In this section, we introduce tools for detecting fake text generated by language models. Prominent tools include the Giant Language Model Test Room (GLTR) and GPTZero [35, 36].

GLTR

GLTR is a tool for identifying text generated by language models using the following text identification process. First, the user applies the language model to be used in GLTR (GPT-2, BERT, etc.). Then, the user inputs the text to identify into GLTR. GLTR then tokenizes the input text. Next, each token is sequentially fed into the language model, which then calculates the prediction probability for each input token. It also adjusts the prediction probabilities for all the tokens trained in the language model. As in the top-k sampling described in Section 2.2, all tokens are listed and sorted in descending order of their prediction probabilities. Then, information regarding the position of each input token within this sorted list, specifically the k-ranking, and prediction probability p, is extracted. Finally, the value of k and p for each token sorted with their input order are extracted. GLTR then utilizes these values of k and p for identification. The values of k for each token are grouped based on specific ranges, and these groups are divided into top-k green, yellow, red, and purple. The user can set the range for grouping. Table 1 shows the k-range set for each group in GLTR.

Table 1 Group according to k range in GLTR

Determining the best feature combination through text and probabilistic feature analysis for GPT-2-based mobile app review detection

Abstract

Graphical abstract

Similar content being viewed by others

An Approach of Extracting Feature Requests from App Reviews

AndroParse - An Android Feature Extraction Framework and Dataset

RGF-Bot: A Novel Feature Selection Method to Identify Malicious Bot Accounts on Social Networking Sites Using Machine Learning

1 Introduction

2 Background

2.1 Generative Pre-trained Transformers (GPT)

2.2 Text generation strategies from GPT

3 Related work

3.1 Positive and negative uses of language models

3.2 Language model-based generated text detection

3.3 Generated fake text detection

GLTR

GPTZero

3.4 Fake review detection

4 Feature analysis for detecting generated reviews from language models

4.1 Dataset

4.2 Definitions of text and probabilistic features

4.3 Feature extraction using text-mining and probability-based sampling techniques

Other

Basic

Part-Of-Speech

Sentimental

Top-k

Top-p

4.4 Selection of meaningful features for detecting machine reviews

5 Feature combination for detecting machine review

5.1 Description and setting of machine-learning models

5.2 Selection of best combination of features for machine review detection

First Case

Second Case

Third Case

5.3 Comparative analysis based on statistical tests

6 Discussion

6.1 Experiments for RQ1

6.2 Experiments for RQ2

6.3 Experiments for RQ3

7 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation