1 Introduction

With the widespread adoption of smartphones, the utilization of the Internet in mobile environments has surged. Applications installed on smartphones are commonly referred to as “mobile apps”, which are predominantly used for information exchange, online shopping, and social networking. These mobile apps are predominantly distributed through Google’s Play Store and Apple’s App Store. Both of these major app stores host a vast number of apps. Mobile apps are now being employed not just on smartphones but also on various electronic devices [1] and have found applications in business-related tasks [2]. Additionally, by using the built-in sensors of smartphones, they play a significant role in users’ medical and daily routines [3]. Hence, mobile apps play a crucial role as a necessity rather than an option.

App markets categorize popular mobile apps in the top of rankings to guide users in their effective use. To rank apps, app markets provide app ratings and reviews. Mobile app users use these app ratings and reviews to evaluate the reliability of an app [4]. If an app receives a low rating and negative reviews, its visibility in the app market diminishes, resulting in a reduced user base. Consequently, low app ratings and negative reviews can adversely impact the reputation and revenue of the company distributing the app. Moreover, given the influence of negative reviews on user decisions, companies consistently update their apps (addressing bugs and enhancing features) to garner positive feedback [5, 6]. Companies routinely monitor app ratings and reviews, collecting user feedback, which then informs their app update strategies [7].

App reviews express users’ opinions about the app. Most are written in one or two short sentences expressing the user’s positive or negative emotions. Some app reviews even contain feedback about the app. However, some app reviews contain wrong information, referred to as false reviews. If app developers reflect these false reviews into app updates, system damage may occur, such as incorrect UI configurations, spreading of inaccurate information, and software bugs [8]. Moreover, app developers may waste significant resources, both time and money, to scrutinize these false reviews. In addition to false reviews, there are reviews with commercial intent and those containing malicious slander, referred to as fake reviews. Fake reviews can cause more severe issues than false reviews. They mainly have the following purposes [9,10,11]:

  • To employ reviewers to write positive reviews and boost the app to the top of app market rankings;

  • To employ reviewers to write negative or false reviews for competitors’ apps;

  • To use macro bot programs to automatically generate positive or false reviews.

Before macro bot programs were introduced, companies with malicious intent hired fake reviewers to write large amounts of fake reviews. Although fake reviews written by fake reviewers are indeed false, they are difficult to distinguish from true human-written reviews because they are written by humans. However, hiring fake reviewers to write fake reviews has a significant disadvantage. Groups with malicious intent that hire fake reviewers to write fake reviews must spend extensive time and money to generate fake reviews [12]. Due to this disadvantage, groups with malicious intent have turned to macro bot programs that generate fake reviews.

Macro bot programs are designed to perform a specific task. As they are automated, no human intervention is required after the initial setup [13]. Most macro bot programs mimic human behavior and perform specific tasks repeatedly [14]. Programs with these features generate fake reviews faster than humans writing reviews [15]. Hence, if macro bot programs are abused to write fake reviews, numerous fake reviews can spread quickly [16]. Given this issue, research to identify quickly spreading fake reviews is essential. In particular, the network information, user information, and review text of the community being created should be used for fake review detection.

Previous studies on fake review detection were mainly focused on the identification of sentences generated by fake reviewers or macro bot programs. To detect fake reviews, researchers have analyzed the relationship between the text features of fake reviews and behavioral patterns of fake reviewers [17], used metadata based on the behavior of fake reviewers [18], and employed advanced natural language processing (NLP) and deep-learning technologies to examine the text features of reviews [19].

Unfortunately, techniques for generating fake reviews have also advanced owing to the rapid development of AI technologies. Thus, new detection approaches that employ AI techniques and are not simply focused on text features are required. The importance of developing these new approaches arises from the advent of language models that combine NLP and AI technologies, such as generative pre-trained transformer (GPT) [20]. These models can generate text similar to human-written text [21]. Accordingly, by combining a language model with a macro bot program, large quantities of text difficult for humans to distinguish can be generated in a short time [22]. Furthermore, if a language model is used for malicious purposes to generate fake reviews, these reviews can be easily exposed to app developers who apply feedback to next app update, causing defects to be reflected in the app system and damage the app system. Fake reviews generated by employing language models feature the same language grammar as that used by humans, making their detection by applying traditional text-mining techniques difficult. Therefore, comprehensive investigations considering new perspectives are needed to detect reviews generated by using sophisticated language models such as GPT.

In this study, we extract features necessary for detecting reviews generated by latest language models such as GPT and propose the best feature combination to achieve the best review detection performance. We use reviews from the mobile app market, as shown in Fig. 1, to evaluate the effectiveness of various feature combinations in detecting fake reviews. Further, we comparatively analyze the impact of text-mining techniques and the probability-based sampling techniques of GPT on the detection of reviews generated by language models.

Fig. 1
figure 1

Example of 4 types of mobile app reviews

We refer to app reviews written by humans as “human reviews” and reviews generated by the GPT-2 [23] language model as “machine reviews.” First, we collect human reviews to fine-tune the GPT-2 model. We use the fine-tuned GPT-2 model to generate machine reviews. Subsequently, we extract statistically significant features for detecting machine reviews using text-mining and probability-based sampling techniques. We categorize the features extracted using text-mining as “text features” and those extracted using GPT-2’s probability-based sampling as “probabilistic features.” We statistically analyze the detection efficacy of the extracted features to select the most important ones. We then evaluate the performance of various machine-learning models using different feature combinations. Finally, we discuss the effectiveness of the selected best feature combination for detecting machine reviews and the detect-capability of machine reviews generated by the latest GPT models.

The GPT-2 language model is a particular version of GPT. Although latest GPT models have been developed without fine-tuning for general-purpose text generation, we primarily focus on machine reviews. As these reviews are either positive or negative towards specific apps, fine-tuning the GPT model is necessary to generate fake reviews. Accordingly, we used the GPT-2 model as it is the most easily able to fine-tune the model compared with GPT-3 [24], GPT-3.5, or GPT-4 [25].

The contributions of this paper are as follows:

  • For machine review detection, we define text features and probabilistic features and provide collection and preprocessing procedures to build a dataset.

  • We analyze the effects of text and probabilistic features for machine review detection through statistical techniques, identify the features meaningful for machine review detection, and present the results visually.

  • To identify the best combination of features for machine review detection, we evaluate the performance of machine-learning models and provide the test results.

  • To provide the result of evaluating the detect-capability of machine reviews generated by the latest version of GPT.

Section 2 introduces the background of GPT-2 and its sampling strategies. Section 3 describes studies of utilizing the GPT-2 language model and research related to the detection of fake reviews generated by GPT-2. Section 4 presents the data collection and preprocessing for distinguishing human and machine reviews and the analysis of how text and probabilistic features affect machine review detection in terms of feature selection. Section 5 presents the configuration of various testing environments using machine-learning models with the features selected in Section 4, performing evaluation of the machine review detection model, and selection of the best feature combination. Section 6 discuss the effectiveness of the best feature combination, selected in Section 5, for detecting machine reviews even when fake reviewers attempt to abuse the analysis results in Section 4 and the possibility of detecting reviews generated by the latest versions of GPT models using our selected features. Finally, Section 7 presents the conclusions

2 Background

2.1 Generative Pre-trained Transformers (GPT)

This section introduces GPT, a language model that can be used for generating machine reviews. The era of large language models (LLM) has been ushered in by the introduction of the transformer model, developed by Google in 2017 [26]. GPT, developed using the decoder structure of the transformer, was released by OpenAI, which is a non-profit research organization in 2018 [20]. As of 2023, OpenAI has continuously upgraded its models from GPT-1 to GPT-4 [20, 23,24,25]. These upgraded GPT models can generate sophisticated machine reviews that are difficult to detect.

GPT models are LLM trained on a vast corpus, such as web text and novels. The models excel in text generation and can effectively generate domain-specific text through fine-tuning. They can also perform NLP tasks such as Q&A and summarization. The machine reviews that we aim to detect can also be generated by using these models. Therefore, understanding the features and capabilities of the evolving GPT models is essential. GPT-1 has 117 million parameters and performs NLP tasks such as generating natural sentences. GPT-2 is trained with 1.5 billion parameters and possesses more advanced natural language understanding capabilities than GPT-1, especially in tasks such as Q&A. GPT-3 has 175 billion parameters and can perform various NLP tasks without fine-tuning, similar to GPT-2. GPT4 is trained with over 1,000 billion parameters and can handle multi-modal data of various types. Thus, GPT-4 can process not only text but also a wide range of real-world problems. When machine reviews are generated, the language model should be selected by considering cost-effectiveness. Fake reviewers are likely to receive financial support from specific companies. In such situations, selecting a cost-effective language model is crucial. Among the GPT series, fake reviewers are the most likely to use GPT-2. GPT-2 and GPT-3 have distinctive differences. The latest version of GPT-2, known as GPT-2 xl, is publicly available, whereas using GPT-3 or GPT-4 involves subscription costs through API provided by OpenAI. Moreover, although GPT-3 can generate general-purpose text across various domains without fine-tuning, it exhibits performance similar to that of fine-tuned GPT-2. Therefore, fake reviewers are likely to use GPT-2 to reduce cost and generate machine reviews tailored to their desired domain.

2.2 Text generation strategies from GPT

In this study, we investigate text generation mechanisms and corresponding strategies in GPT. General language models include an encoder-decoder structure and learn through the vectorization of text information. The encoder passes this vectorized information to the decoder, which interprets the information and converts it into natural language. GPT specializes in text generation by employing an auto-regressive approach. During training with large text datasets, GPT processes each token sequentially and predicts the next token based on previous tokens. GPT can adopt various sampling strategies for predicting the next token in a given context. The sampling mechanism can be considered as a strategy of the decoder structure. Here, we define the unit of input and target or phrases for prediction as “tokens.” One of the key sampling strategies in the decoder structure is “Greedy search,” which selects the next token with the highest prediction probability in a given context. For example, when predicting the next token after a context such as “The dog,” if the prediction probability for “is” is 0.95 and for “was” is 0.84, greedy search will select “is.” As this strategy must select tokens with the highest prediction probability, if one incorrect token is chosen, the likelihood of subsequent token predictions being inaccurate increases.

To overcome these limitations of greedy search, the “beam search” strategy has been introduced. Instead of simply selecting the next token with the highest prediction probability, beam search calculates the cumulative prediction probabilities for multiple token sequences and selects the sequence with the highest cumulative probability. unfortunately, this strategy also has limitations. The limitations ensures that the generated text has less variability compared to human-written text because it always produces consistent results for the same input. To overcome these limitations, some randomness are required to the token selection process because it is important for maintaining variability and a probability distribution in token selection. For example, we consider a group of tokens, namely, “very” (0.81), “good” (0.14), and “bad” (0.02), when predicting the next token after the phrase “The dog is.” In this case, the likelihood of each token being selected is determined by its prediction probability. Accordingly, the selection probability for “very” is represented as 0.81. This method does not only simply select tokens with the highest probability, but also allows token selection considering the predicted probability distribution, thus ensuring variability and randomness of results. Noteworthy probability-based sampling strategies in this method are top-k sampling and top-p sampling (nucleus sampling). GPT models can utilize both of these strategies simultaneously. Fig. 2 presents the visualized examples of these sampling strategies.

Fig. 2
figure 2

Process of top-k and top-p sampling

Top-k sampling is a strategy that sorts tokens in the descending order of their prediction probabilities, and randomly samples from within the top k tokens when predicting the next token. Eq. (1) represents top-k sampling.

$${\textstyle\sum_{w\in V_{top-k}}}P(w\vert x)$$
(1)

w represents the token, V represents the token vector, and x represents the previous token (word). Thus, w is a randomly selected token from the top k tokens in the V vector. Top-k sampling builds the generated text by randomly selecting from the k tokens with the highest prediction probabilities. Unfortunately, this sampling strategy may have limitations because it randomly selects from a fixed set of k tokens. For example, if the candidate tokens following the “is” token are “very” (prediction probability: 0.81), “good” (prediction probability: 0.14), and “bad” (prediction probability: 0.02) and if k is set to 2, the “good” token with a significantly lower prediction probability than “very” may also be generated. As a result, the likelihood of selecting a meaningless token increases.

By contrast, top-p sampling adopts a strategy that can avoid some of the issues of top-k sampling. Similar to top-k sampling, top-p sampling sorts the candidate tokens for prediction in the descending order of their probability values. It then cumulatively sums the prediction probabilities of the top tokens, and when it reaches p, it samples from those tokens. Eq. (2) represents top-p sampling.

$${\textstyle\sum_{w\in V_{top-p}}}P(w\vert x)$$
(2)

V represents the token vector composed of cumulative probabilities reaching the value of p. Therefore, w is selected from V. For instance, if the candidate tokens following “is” are “very” (prediction probability: 0.81), “good” (prediction probability: 0.14), and “bad” (prediction probability: 0.02) and if p is set to 0.96, V comprises tokens whose cumulative probabilities exceed 0.96. Tokens are sampled from V using each token’s prediction probability as the likelihood of it being sampled. This sampling method can prevent the selection of meaningless tokens. These sampling strategies aim to achieve results similar to human text generation, including generating various text and avoiding meaningless sentences. These sampling strategies have been utilized in previous studies on language models, with the main focus on text generation and detection of specific text. We use these strategies for detecting machine reviews.

3 Related work

3.1 Positive and negative uses of language models

Text generation based on language models can be highly beneficial. Language models can be used in different fields for text generation. For example, a previous study has focused on generating classical poetry [27]. Hu et al. suggested that using language models for generating classical poetry captures the meaning in the lines of the poem, better than RNN-based generation methods. Thus, language models are applicable in the field of literary. Nevertheless, they can have significant adverse impacts on society if abused. Two recent prominent issues are fake news and deepfake. Language models can be used to generate fake news that can rapidly propagate through the media [28]. Moreover, according to an analysis of language model-generated fake news, the reliability of the news was evaluated as similar to that of real news written by humans. Given the potential for abuse in generating human-like news, there is an urgent need for solutions. Additionally, when creating fake news with language models, one can generate not only news content but also news headlines with a satirical style [29]. Experimental results indicated that generated news headlines were more polished than those written by humans. Based on this analysis, fake news created by language models can effectively convey various emotions and sway public opinion. Hence, the generation of hard-to-detect fake news has led many researchers to study detection methods.

In fake-news detection research, fake-news detection methods based on the latest computer security techniques have emerged [30]. Grover—the model proposed by Zellers et al.—is a generator of fake news that is used for detecting fake news. The researchers stated that Grover is the most effective model for detecting fake news. Grover was compared with other models for the detection of fake news, including GPT-2, BERT, and FastText. According to the results, Grover had the best fake-news classification performance among the models. However, language models can generate not only fake news but also deepfakes. As such, research has been conducted on the detection of deepfakes generated by language models. Some of these studies focused on the detection of deepfake tweets on Twitter [31]. To detect deepfake tweets, Fagni et al. collected tweets from 23 bots and generated tweets with text generation techniques and language models such as Markov chains, RNN, RNN+Markov, LSTM, and GPT-2. They generated approximately 25,000 tweets (half written by humans and half generated by bots) and used them to evaluate 13 deepfake text detection methods. The results indicated that applying RNN-based techniques to detect generated tweets improved the performance of the detection model.

3.2 Language model-based generated text detection

Researchers have developed various methods for preventing the abuse of language models. A study was performed on the detection of text articles on the Internet generated by a language model; rather than the problem of classifying generated articles, this study raised the problem of misclassifying articles written by real humans as generated by a language model [32]. The researchers agreed that large-scale language models such as Grover are necessary to prevent the misuse of language models but also mentioned the possibility of these models being misused. They suggested the necessity of a trade-off between false positives and false negatives for the detection model to detect generated text. The authors warned that if the rate of false positives (cases of the model erroneously labeling human-written text as machine-generated) is high, the classification performance for human-written text will be poor.

In another study on preventing language-model abuse, top-k and top-p, which are decoding sampling strategies of language models, were used. These two sampling techniques select probabilities for generating human-like text [33]. The authors found that this strategy of language models can be easily detected by machine classification systems. Thus, there are studies on not only preventing abuse by using language-model’s probability-based sampling techniques but also analyzing the level of text generated by the language model. One study indicated that when text is generated using a language model without considering the language model’s algorithmic features, various defects appear in the generated text [34]. As such, many studies have been performed on preventing the abuse of language models, and to detect text generated by a language model, it is necessary to consider the language model’s probability-based sampling techniques and the perfection of the generated text.

3.3 Generated fake text detection

In this section, we introduce tools for detecting fake text generated by language models. Prominent tools include the Giant Language Model Test Room (GLTR) and GPTZero [35, 36].

GLTR

GLTR is a tool for identifying text generated by language models using the following text identification process. First, the user applies the language model to be used in GLTR (GPT-2, BERT, etc.). Then, the user inputs the text to identify into GLTR. GLTR then tokenizes the input text. Next, each token is sequentially fed into the language model, which then calculates the prediction probability for each input token. It also adjusts the prediction probabilities for all the tokens trained in the language model. As in the top-k sampling described in Section 2.2, all tokens are listed and sorted in descending order of their prediction probabilities. Then, information regarding the position of each input token within this sorted list, specifically the k-ranking, and prediction probability p, is extracted. Finally, the value of k and p for each token sorted with their input order are extracted. GLTR then utilizes these values of k and p for identification. The values of k for each token are grouped based on specific ranges, and these groups are divided into top-k green, yellow, red, and purple. The user can set the range for grouping. Table 1 shows the k-range set for each group in GLTR.

Table 1 Group according to k range in GLTR

GLTR assigns a group to each token of the text to be identified. It has been shown that GLTR includes a statistical difference between groups for human-written text and machine-generated text. Human-written text primarily contains a larger number of red and purple tokens in the top-k groups, whereas text generated by language models has fewer red and purple tokens. This difference serves as a key for distinguishing human-written text and machine-generated text. We believe that not only top-k values, but also a variety of top-p values can serve as keys. Top-p represents the probability of individual tokens, whereas the top-p value for all tokens in the text reflects the features of the language model, making it useful as a key.

GPTZero

Similar to GLTR, GPTZero is a tool for identifying text generated by language models, and was released in January 2023. This tool is continuously being updated to detect text generated by recent versions of ChatGPT [37]. Within just one week of its release, this tool was used by approximately 30,000 users. It classifies the text as either generated by a language model or written by a human by measuring its “Perplexity” and “Burstiness.” Perplexity is one of the key metrics used for evaluating language models. It indicates how much effort a language model has devoted to inferring each token when generating text. Therefore, a lower score means the language model was able to generate text without considering a wide range of tokens, and could efficiently select good tokens. Hence, a lower score suggests that the language model has superior text-generating capabilities, whereas a higher score suggests less effective generation. The other metric, burstiness, measures the frequency of similar tokens appearing repetitively in the generated text. A higher frequency of similar tokens increases the likelihood that the text was generated by a language model. Therefore, employing these two metrics used in GPTZero for detecting text generated by language models could be important.

3.4 Fake review detection

We introduce various studies on detecting fake reviews generated by language models and detecting fake reviews written by fake reviewers.

Lu et al. suggest that pre-trained language models such as BERT do not reflect knowledge about sentiments well [38]. They proposed a fake review detection model, BSTC, based on this drawback. BSTC detects fake reviews by using the context, meaning, and sentimental information of reviews. In particular, SKEP was introduced to extract efficient sentimental information, and through this, an excellent fake review detection accuracy of 93.44% was achieved. Adelani et al. similarly suggest that it is difficult for language models to generate text reflecting emotional information [39]. Thus, they showed that several fake review detection models do not perform well in detecting fake reviews generated through a fine-tuned language model. In particular, GLTR, GPT-2PD, and Grover also showed that fake reviews are difficult to detect. Wang et al. proposed a fake review detection model by fusing multiple features of the text side of the review with features other than text [40]. Sentimental information, semantic information, syntactic information, and number of words were used for the features of the text aspect used in the proposed model, and review status (review rating, etc.) information was used for other features. Alsubari et al. conducted fake review detection on reviews collected from the Trip Advisor website [41]. They utilized a frequency-based approach, term frequency-inverse document frequency (TF-IDF), and sentiment scores as features for detecting fake reviews. The approach was evaluated using four machine-learning models, and among them, the Random Forest model achieved approximately 95% accuracy in classifying fake reviews.

Previous studies have mainly focused on the use of text-mining methodologies such as TF-IDF and sentiment analysis for detecting fake reviews. However, the effectiveness of these traditional methods on LLM such as GPT is questionable. A previous study [34] has shown that ignoring the algorithmic features of language models could introduce various flaws in the generated text. Considering this, we introduced the probabilistic features described in Section 2.2. Applying probabilistic features in fake text detection is a recent trend in research. Nonetheless, most studies limit their methods to simple criteria or thresholds. The effectiveness of this approach on short texts, such as mobile reviews, is not clearly understood. In this study, we adopted probabilistic features as a key feature, aiming to effectively apply it to mobile reviews. We evaluated the effectiveness of combining different text and probabilistic features in detecting fake mobile reviews.

4 Feature analysis for detecting generated reviews from language models

In this section, the effects of text features and probabilistic features on machine-review detection are analyzed, and the most meaningful features for machine review detection are identified. We generated machine reviews by fine-tuning the language model with human reviews and preprocessed the text features and probabilistic features of the generated machine reviews to build the dataset. Each feature is presented in figures and tables for statistical analysis. Fig. 3 shows the data collection, preprocessing, and analysis procedures.

Fig. 3
figure 3

Entire analysis process for machine review detection

4.1 Dataset

Human reviews were collected, as shown in Fig. 3, and machine reviews were generated using the language model. First, to generate machine reviews, we collected human reviews from KaggleFootnote 1. The collected data comprise user reviews written on the Google Play Store. The data include the app’s name, user reviews, and information related to sentiment classification and sentiment scores. Kaggle serves as a platform where companies and individual users can post data from various fields for use and analysis, and it aids in proposing models. Data collected from Kaggle include 70,000 human-written reviews. We sampled 5,000 reviews from these. Next, we fine-tuned the GPT-2 small model using human reviews to generate machine reviews. To achieve this, we first removed meaningless white space and emotional icons (e.g. human face icons) from the text of the collected reviews.

We fine-tuned the GPT-2 small model using the preprocessed text and generated text using the fine-tuned GPT-2 small model. Issues were encountered during the text generation process, such as repetitive text and incorrectly structured sentences. To address this, we tuned the learning parameters used in the fine-tuning of GPT-2 small, specifically the epoch and batch_size. Additionally, we tuned the parameters used in text generation, namely top-k and top-p. Table 2 shows the tuning values of the parameters used when generating machine reviews.

Table 2 Parameter of GPT-2 for generating machine reviews

Fine-tuning parameters in Table 2 refers to the parameters used when fine-tuning the GPT2 small model. Generating Parameters signifies the parameters used when generating text with the fine-tuned GPT-2 small model. The parameters listed in Table 2 were tuned through repeated experiments to address the issues described earlier. The fine-tuning parameters were set by observing changes in the loss value when fine-tuning the GPT-2 small model. The epoch and batch_size values listed in Table 2 represent the points at which there was no significant change in the loss value.

For the generating parameters, we conducted tuning through an ablation study. The ranges for each parameter setting were as follows: the top-k values were 10, 40, and 100, and the top-p values were 0.86, 0.92, and 0.96. During the ablation study, we used perplexity as the metric to evaluate each parameter value. We generated 1,000 machine reviews by setting the values for both parameters during text generation. Subsequently, we calculated the perplexity of machine reviews for each parameter value. The x-axis in Fig. 4 represents perplexity, and the y-axis represents top-k and top-p used when generating machine reviews. The results are illustrated in Fig. 4.

Fig. 4
figure 4

Perplexity distribution according to top-k and top-p parameter values

Figure 4 displays the distribution of perplexity for machine reviews based on each parameter value. The perplexity of the text generated for each parameter value tends to increase as the top-k and top-p values increase. Higher top-k and top-p values mean that the model considers more tokens when generating text. As explained in Section 3.3, a lower perplexity indicates better performance of the language model though lower top-k and top-p values mean generating more consistent text, whereas higher values produce more diverse text. Therefore, we considered a trade-off between these factors to determine the appropriate level. Specifically, we selected 40 for top-k and 0.96 for top-p to generate various text without inducing significant confusion in the language model. Using the selected parameter values, we produced 5,000 machine reviews. Table 3 shows samples of human and machine reviews. As indicated in Table 3, the machine reviews use grammar similar to that used in human reviews, making differentiating between the two difficult.

Table 3 Sample of human and machine reviews

To examine whether machine reviews possess text quality similar to that of the human reviews in this study (that is, whether they appear to be human-written), we evaluated them using performance metrics for language models. Common performance metrics for language models include the following scores: Bi-Lingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and perplexity. The BLEU and ROUGE scores are calculated by comparing the original sentences with the sample sentences (in this study, the machine review). Meanwhile, the goal of this study was not to generate text identical to the human reviews used for fine-tuning GPT-2 but rather to generate text similar to the domain of the human reviews. Therefore, we used perplexity as the evaluation metric instead of BLEU and ROUGE. We extracted the perplexity of both the collected human reviews and machine reviews. Figure 5 is a boxplot showing the distribution of perplexity for both types of reviews.

Fig. 5
figure 5

Difference in perplexity distribution between human and machine reviews

The x-axis in Fig. 5 represents human and machine reviews, whereas the y-axis represents perplexity. As evident from Inter Quartile Range (IQR) for both human and machine reviews, the generated ones we generated for the experiment show low perplexity scores. This result implies that GPT-2, when generating machine reviews, considers fewer tokens than human-written reviews. GPT-2 faced fewer challenges in generating machine reviews.

To determine whether a difference exists in the variance and mean between the human and machine reviews, we conducted an F-test and T-test. The null hypothesis for the F-test was “there is no significant difference in variance between the two groups.” The F-test resulted in a p-value of 0.9546, which is greater than 0.05, and thus, the null hypothesis was accepted. This means there is no significant difference in variance between the two groups. The null hypothesis for the T-test was “there is no significant difference in the mean between the two groups.” The T-test resulted in a p-value of 0.1763, which is greater than 0.05, and thus, the null hypothesis was accepted. This means there is no significant difference in the mean between the two groups. The results of both tests indicate that the machine reviews are significantly similar to the human reviews. Therefore, we verified that the machine reviews in this study are appropriate, and proceeded to preprocess them. Specifically, we removed reviews that were shorter than 30 characters, not in English, or duplicated.

4.2 Definitions of text and probabilistic features

Text mining is based on the POS tagging technique and sentiment analysis technique. However, it is difficult to detect human-like text with these text-mining techniques alone, for the following reasons. The text generated by language models can reflect the desired sentiments, and the models can generate text of a nearly identical level to human-written text. This makes it difficult to detect the generated text through text-mining techniques. Furthermore, by fine-tuning the language model with text for the desired purpose, the model can generate text tailored to the target environment. Thus, to classify and detect language model-generated text, the language model’s text-generation process must be analyzed. As mentioned in Section 1, we refer to the features extracted through such text-mining techniques as “Text features.”

After the text generation process of GPT-2 described in Section 2.2, we defined features based on top-k and top-p sampling as “probabilistic features”. Both sampling methods are used by GPT-2 to decide each token when generating text. Specifically, GPT-2 assigns a prediction probability value, p, to all tokens to decide a single token. The tokens are then sorted in descending order based on p. Each token has a position in the top k of the list along with its associated prediction probability, p. We used the values of k and p as data for extracting probabilistic features. We considered that these values of k and p serve as the key in detecting the text generated by GPT-2.

4.3 Feature extraction using text-mining and probability-based sampling techniques

We extracted the features defined in Section 4.2 for machine review detection. For this purpose, we extracted the text features using text mining techniques and configured them in a format that is easy to analyze. The NLTK package—an NLP support module of Python—was used to extract the text features. NLTK provides a corpus, morpheme analysis, and POS tagging for natural language analysis [42]. Next, we extracted the probabilistic features using top-k and top-p sampling. The extracted features were preprocessed with the same top-k colors used in GLTR (green, yellow, red, and purple). Table 4 presents the various features obtained through data collection and preprocessing. The following are descriptions of the subcategories of the features presented in Table 4.

Table 4 Definitions of text and probabilistic features

Other

Among the feature categories in Table 4, the sub-category named with Other, is a label for review data and reviews to extract text and probabilistic features. In the class variable, “Human” means Human review, and “Machine” means Machine review. content is the text of each review.

Basic

The basic text feature (basic) is one of the text features and consists of the text length (basic_str_len), the number of words (basic_word_count), and the number of sentences (basic_sentence_count). We extracted the basic text features using NLTK’s word tokenizer and sentence tokenizer. Additionally, we used the features to generate review clusters representing similar review types in two reviews (whether the number of words used in the reviews and review lengths are similar and whether the number of sentences used in the reviews is similar). Algorithm 1 presents the function for extracting basic text features. Lines 1 and 4 of Algorithm 1 correspond to the process of importing the packages required to divide the review into sentences and words and preprocess it. The code in Lines 2 and 5 extracts the sentences and words of the review using the two imported tokenized functions. The code in Line 3 obtains the text length.

Algorithm 1:
figure e

Extract basic features

Part-Of-Speech

The POS features—one of the three features constituting the text features—were extracted using NLTK’s POS tagging: number of adjectives (pos_adj), number of prepositions (pos_adp), number of adverbs (pos_adv), number of conjunctions (pos_conj), number of articles (pos_det), number of nouns (pos_noun), number of numbers (pos_num), number of participles (pos_prt), number of pronouns (pos_ pron), number of verbs (pos_verb), number of punctuation marks (pos_dot), and number of other characters (pos_x). Each POS feature indicates the number of each POS used in one review. Algorithm 2 presents the function for extracting these POS features. The code in Line 2 of Algorithm 2 downloads “universal_tagset” [43] using nltk.download to use the pos tag function in Line 6. “universal_tagset” is a tagset customized according to universally used POSs, such as nouns, verbs, and adjectives. The most important code for extracting POS features appears on Lines 5 and 6. This code divides the review into words using word tokenize and then attaches the POSs to the separated words through pos tag. Lines 7–10 count the POSs attached in Lines 5 and 6 and return the number of POSs used in the review.

Algorithm 2:
figure f

Extract POS features

Sentimental

The sentiment feature—one of the three features constituting the text features—is used to compare the sentiment scores (senti_sentimental_score) of human and machine reviews. For this, NLTK’s SentimentIntensityAnalyzer was used. SentimentIntensityAnalyzer uses a sentiment lexicon called VADER. The VADER package provides sentiment analysis—particularly for text on social media—and is useful owing to its high execution speed [44]. Sentiment Intensity Analyzer provides the polarityScore (pScore) function for calculating positive, negative, and neutral sentiment scores. A higher sentiment score estimated using this method indicates a stronger positive sentiment, and a lower score indicates a stronger negative sentiment. However, to obtain more precise sentiment scores, the scores were normalized and expressed as strongly positive (SP), weakly positive (WP), neutral (N), weakly negative (WN), and strongly negative (SN). The corresponding scores were as follows: less than –0.5, SN; –0.5 to –0.1 (exclusive), WN; –0.1 to 0.1, N; 0.1 to 0.5, WP; ≥ 0.5, SP. Algorithm 3 presents the code for extracting the sentiment score of one review. Line 1 of Algorithm 3 presents the code for importing the package to use the VADER sentiment lexicon.

Line 3 presents the code for obtaining the sentiment score through the pScore function of SentimentIntensityAnalyzer provided in the VADER sentiment lexicon. The review is passed as an argument to the pScore function, and the returned sentimental_score is a sentiment score rated as positive, neutral, or negative. To further subdivide it, the sentiment score is normalized through Lines 4–14.

Algorithm 3:
figure g

Extract sentimental score

Top-k

The top-k feature—one of the probabilistic features—is used to check the difference between the top-k features of human and machine reviews. Top-k sampling of the language model is used to extract this feature. We define the top-k feature identically to GLTR. Top-k green (top_k_green) indicates 1 ≤ k < 10, and top-k yellow (top_k_yellow) indicates 10 ≤ k < 100. For these two features, a higher value indicates a higher probability that the token was generated by the language model. Top-k red (top_k_red) indicates 100 ≤ k < 1,000 and top-k purple (top_k_purple) indicates 1,000 < k. For these two features, a higher value indicates a lower probability that the token was generated by the language model.

Top-p

The top-p feature—another probabilistic feature—is adjusted by the hyperparameter p of the language model. If the p of the language model is set as 0.96, the sum of the sampling cumulative probabilities of each token is adjusted so that it does not exceed 0.96. The top-p features were defined as follows in this study: mean sampling probability of tokens (top_p_mean); maximum sampling probability (top_p_max); variance of token sampling probabilities (top_p_var).

The code in Algorithm 4 preprocesses the probabilistic features top-k and top-p. We extract the top-k and top-p of the review using the application programming interface (API) provided in GLTR. Line 1 imports the API provided in GLTR, and Line 4 extracts the payload provided in GLTR.api. After the tokenization of the review passed as an argument, the extracted payload has the top-k position and probability information of each token. Lines 5 and 6 extract the top-k and top-p lists from the extracted payload. Lines 7–14 preprocess the top-k features using the top-k list extracted from Lines 5 and 6. The position of each token in the top-k list is classified into a specific range, and each top-k feature is preprocessed. Lines 15–18 find the maximum, mean, and variance of top-p using the top-p list extracted from Lines 5 and 6. Line 19 returns the extracted top-k and top-p features.

Algorithm 4:
figure h

Extract Top k and Top p features

4.4 Selection of meaningful features for detecting machine reviews

Box plots were used to analyze the effects of text features and probabilistic features on machine review detection. We performed a T-test on the features extracted in Section 4.3 to investigate whether the means of the two review groups (human and machine reviews) differed. The established null hypothesis was, “There is no difference between the means of the two groups.” Table 5 presents the T-test results for the text and probabilistic features. The null hypothesis was rejected for most of the features with a p-value of < 0.05. Thus, we statistically verified that there was a difference in the means between the two review groups for features excluding the pos_x feature. Therefore, all features except pos_x were found to be meaningful in classifying the two review groups.

Table 5 T-test results for all features analyzed for machine review detection

Fig. 6 presents the distribution of each text feature of the human and machine reviews using box plots. According to the results in Fig. 6(b) to (d), basic_str_len, basic_word_count, and basic_sentence_count are general text features, and the medians of these features are similar in the box plots for the two review types. However, with regard to the IQR, most features exhibited a wider distribution for the machine reviews than for the human reviews. The machine reviews consisted of more diverse sentences than the human reviews. Compared with the human reviews, they used multiple sentences and did not omit words as frequently. In contrast to humans, language models rarely omit words, implying that when constructing sentences, language models produce text with a higher level of perfection than humans.

Fig. 6
figure 6

Box plots showing text feature distributions of human and machine reviews

Figure 6(a) presents the senti_sentiment_score features, which were similar between the two review groups. This is because according to the sentiment-score distribution of the collected human reviews, most of the reviews had a strongly positive sentiment. In comparison, the frequencies of weakly positive, neutral, weakly negative, and strongly negative sentiments in human reviews were low. When people write reviews, they are generally expected to leave positive, short reviews. Because the machine reviews were generated by fine-tuning the language model with these short, positive human reviews, the machine reviews were expected to have a similar distribution to the human reviews. However, if a malicious user fine-tunes the language model with malicious reviews and generates machine reviews, the senti_sentiment_score feature is expected to be useful for distinguishing the two review groups.

Fig. 6(e) to (p) present the visualization results for distributions of the POS features. In these figures, the box plots of the two review groups exhibit similar patterns. However, similar to the other features, the POS features exhibited a wider distribution for the machine reviews than for the human reviews. Figure 6(f), (n), and (p) present the results for pos_verb, posadp, and pos_dot, which exhibited slight differences between the two review groups. Figure 6(j) presents the results for the number of nouns, which exhibited similar medians between the two review groups, but the IQR distribution was wider for the machine reviews. This difference suggests that when people write reviews mainly to express their opinions, they write short and concise reviews that omit many nouns. According to the differences in the distributions of all parts of speech in Fig. 7, language models that generate machine reviews exhibit the features of generating perfectly written text, in contrast to human-written text. This result can occur because various types of data were collected to train the GPT-2 model.

Fig. 7
figure 7

Box plots showing top-k feature distributions of human and machine reviews

Figure 7 presents box plots for the top-k features. Figure 7(c) and (d) show the distributions of top_k_red and top_k_purple among the probabilistic features of the human and machine reviews. As shown in Fig. 7(c) and (d), top_k_red and top_k_purple mainly exhibited large medians for the human reviews. Figure 7(a) and (b) show the distributions of top_k_green and top_k_yellow among the probabilistic features of the two review groups, which exhibited large medians for the machine reviews. This indicates that when a person considers the next word, they select one that is difficult for the language model to extract.

Figure 8 presents box plot visualizations of the distributions of top-p features among the probabilistic features of the human and machine reviews. Fig. 8(a) to (c) show the distribution for top_p_mean, top_p_max, and top_p_var of each review group. top_p_mean and top_p_var exhibited larger medians for the machine reviews than for the human reviews. The results in Fig. 8(a) and (c) indicate that for each feature, the medians differed between the two review groups. This is likely because the value of p was adjusted to write text at the same level as humans when fine-tuning the language model. Generally, the value of p of the tokens were slightly higher for the machine reviews than for the human reviews. Thus, the two review groups differed with regard to the value of p, which would likely affect the classification performance of a machine review classification model.

Fig. 8
figure 8

Box plots showing top-p feature distributions of human and machine reviews

To analyze the top-k features more comprehensively, we clustered both basic_str_len and basic_word_count for human and machine reviews. Figure 9 shows the variations in top-k features for clusters organized by basic_str_len and basic_word_count. In this figure, the x-axis denotes cluster numbers organized by basic_str_len and basic_word_count, arranged in ascending order, and the y-axis denotes top-k features. As demonstrated in Fig. 9(a) and (b), the median for basic_str_len was smaller for human reviews compared to machine reviews. However, the observations in Fig. 9(c) and (d) were inconsistent with those in Fig. 9(a) and (b).

Fig. 9
figure 9

Box plots showing the text-length and word-count clustering results for the top-k features

These two sets of results were consistent with the difference in top-k feature distributions between the two review groups; i.e., regardless of basic_str_len, the human reviews exhibited high frequencies of top_k_red and top_k_purple, and the machine reviews exhibited high frequencies of top_k_green and top_k_yellow. Fig. 9(e) to (h) present the differences in distributions between the two review groups for the basic_word_count cluster. Fig. 9(e) to (h) show similar results to the basic_str_len cluster. However, according to the IQR of the machine reviews, all the cluster results exhibited larger medians than the human reviews. Thus, it is expected that machine reviews use more words and omit fewer words than human reviews.

Similar to the top-k features, we clustered basic_str_len and basic_word_count to analyze the top-p features. Fig. 10 shows the variations in top-p features for clusters organized by basic_str_len and basic_word_count. In this figure, the x-axis denotes cluster numbers organized by basic_str_len and basic_word_count, arranged in ascending order, while the y-axis denotes top-p features. As shown in Fig. 10(a) to (c), top_p_mean, top_p_max, and top_p_var had various distributions regardless of basic_str_len.

Fig. 10
figure 10

Box plots showing the text-length and word-count clustering results for the top-p features

However, in Fig. 10(d) to (f), each feature exhibits larger medians for machine reviews than for human reviews based on basic_word_count. This is likely because the hyperparameter p was set to 0.96 during the fine-tuning of the language model. In contrast to the language model, since a value of p was not set for the human reviews, various distributions occured. Therefore, most tokens in both human and machine reviews tend to reach a cumulative probability of 0.96, suggesting that tests using the various top-p features appearing in human reviews are effective for classifying the two review groups.

5 Feature combination for detecting machine review

Various combinations of the features analyzed in Section 4 were used to train the machine review detector, after which the performance was evaluated. In particular, we compared the effects of the text features and probabilistic features on machine review detection through model experiments and selected the optimal combination of features for detecting machine reviews. To this end, we used text features and probabilistic features among the features selected in Section 4 as the primary features of the machine review classification model. First, we evaluated the machine review classification models using text features. Second, we evaluated the models using the text and probabilistic features (top-k). Finally, we evaluated the models using the text and probabilistic features (top-k and top-p). The models used for evaluation applied typical machine-learning techniques such as logistic regression (LR), random forest (RF), a support vector machine (SVM), AdaBoost (AB), and artificial neural networks (ANN). For a balanced classification-model evaluation, K-fold cross-validation was used. For the evaluation indicators, we used the F1 score, which represented the classification performance for human and machine reviews, and the classification accuracy. The macro F1 score was also used, which was the mean classification performance of the two review groups.

5.1 Description and setting of machine-learning models

Before the experiment, we configured the experimental environment for the model evaluation. The data preprocessed as described in Section 4 were used for the experiment. The data contained 5,000 human reviews and 5,000 machine reviews generated by a language model. The Python Scikit-learn package and TensorFlow Keras were used for the models applied in the experiment.

  • SVM—a supervised learning model for pattern recognition—is mainly used for classification and regression analysis. The main hyperparameters used in the SVM are the kernel, C, and gamma. According to the features used, the SVM model becomes high-dimensional, and the computational complexity increases. To solve this problem, the kernel trick is used.

  • RF is an ensemble model of decision trees. It uses the results of multiple decision trees and performs learning by varying the data used in each tree. Among the results of decision trees trained with different data, the optimal result is determined via voting to derive the output. The main hyperparameters are max_depth, min_sample_split, max_ leaf_nodes, min_samples_leaf, n_estimators. The hyperparameters used in this experiment were max_depth and n_estimators. max_depth indicates the length of the path between the root node and leaf nodes of each tree. This is used to prevent overfitting, in which the tree classifies to the smallest number of samples when attempting to solve a classification problem. n_estimators determines the number of trees used in the RF. As n_estimators increases, the time complexity of the model increases, and even meaningless trees can be used for learning. To prevent this, n estimators are used.

  • AB is an ensemble model that can modify past incorrect learning in its learning process. It uses decision trees as weak learners and is highly sensitive to outliers and noise data. However, compared with the other learning models, it is less sensitive to overfitting. The main hyperparameters of AB are base_estimators, n_estimators, and learning_rate. n_estimators indicate the number of learners used for learning. learning_rate is a coefficient applied when correcting error values or sequential values of weak learners.

  • LR is a binary classification model used to solve classification problems. Although the algorithm is simple and easy to implement, it exhibits high performance when there is a datum that becomes an independent variable. Thus, it achieves high performance when classifying independent variables that can be linearly separated. The main hyperparameters of LR are the penalty and C. The penalty specifies the type of regularization, and C regulates the penalty and determines its strength.

  • ANN is statistical learning models inspired by biological neural networks. ANN primarily use supervised learning. When there are many inputs, the ANN uses them to derive approximations at each node. The nodes are connected and the model can perform machine learning, e.g., pattern recognition. The key hyperparameters include the number of nodes in each layer, batch_size, optimization function (optimizer), and learning_rate. The number of nodes in each layer represents the amount of weights to be calculated when training the data using ANN. batch_size controls the amount of data used for each training iteration in the ANN. optimizer is a function that updates the weights during ANN training. learning_rate indicates the magnitude of weight updates.

We used the aforementioned machine-learning models to evaluate the impact of features detailed in Section 4 on machine review detection and to determine the best feature combination. We tuned the hyperparameters of the machine-learning models used in the evaluation to their optimal values. To achieve this, we utilized GridSearch [45], which is one of the search techniques used to optimize the hyperparameters of machine-learning models. This technique involves configuring various combinations of hyperparameters for a machine-learning model and then evaluating them to find the optimal set of hyperparameters that yield the best results.

Table 6 displays the machine-learning models used for evaluation and each model’s hyperparameters. It also shows the hyperparameter combinations set in GridSearch to find the optimal hyperparameters for each model. The values highlighted in bold in the “Values for tuning model” column of Table 6 indicate the optimal hyperparameters for each model. Finally, we configured the model using the optimal hyperparameters listed in Table 6.

Table 6 Hyper parameter configurations for evaluating machine review detection models

5.2 Selection of best combination of features for machine review detection

For detecting machine reviews, various combinations of the preprocessed and selected features presented in Section 4 were used to train the models, followed by a performance evaluation. Finally, the best combination of features for accurate machine review detection was selected. The objectives of this experiment were as follows: 1) to evaluate machine review detection models trained with text features; 2) to evaluate machine review detection models trained with text features and the probabilistic feature top-k; and 3) to evaluate machine review detection models trained with all the features. Table 7 presents the features used for each objective.

Table 7 Feature selection of each experiment for machine review detection

As shown in Table 7, all the text features were used in the first experiment (first case), except for pos_x. Because only meaningful features were evaluated according to the T-test results in Section 4, pos_x was excluded from the experiment. According to the analysis presented in Section 4, machine reviews use more human-like text than even human reviews. Thus, in contrast to humans, the models do not omit words or generate perfect text. These results indicated that there were differences in text features between human and machine reviews. Therefore, we conducted the first experiment to evaluate these differences.

In the second experiment, top-k was used along with the text features from the first experiment. The top-k features presented in Table 7 were defined as green, yellow, red, and purple according to the values of k. The reason for the experiment was as follows. Regarding the probabilistic feature top-k, in Section 4, top_k_red and top_k_purple exhibited larger medians for human reviews than for machine reviews. In contrast, top_k_green and top_k_yellow exhibited larger medians for machine reviews than for human reviews. Given these analysis results, the second experiment was conducted to evaluate the effectiveness of the top-k feature for machine review detection.

Finally, in the third experiment, all the features presented in Table 7 except for pos_x were used. According to the analysis results presented in Section 4, the top-k feature has limitations Specifically, when the language model generates text, it may generate meaningless tokens because only k tokens among the token candidates are selected. To solve this problem, top-p sampling is used instead of top-k sampling. Top-p sampling selects tokens from the token candidates whose cumulative probability converges to the hyperparameter p. Unfortunately, the hyperparameter p of language models that generate machine reviews is fixed rather than dynamic. The p value of language models is set to resemble humans ensuring the models generate human-like text. However, humans exhibit various p probability distributions. Given these analysis results, the third experiment was performed to evaluate the effect of the top-p feature on machine review detection.

First Case

Table 8 presents the evaluation results of the machine review detection models trained using the text features, corresponding to the first objective. The SVM, RF, and ANN had high classification accuracies, whereas AB and LR exhibited low performance. Overall, most of the models exhibited low performance, likely because the human and machine reviews mostly had similar text features. Similar to the analysis results presented in Section 4, although there were differences in the text features, there were numerous similar reviews, resulting in low performance. Furthermore, all the models exhibited low F1 scores for machine reviews, indicating that the text features alone were not useful for detecting machine reviews.

Table 8 Evaluation results of machine review detection models trained using only text features

Second Case

Table 9 presents the evaluation results of the machine review detection models trained using text features and the probabilistic feature top-k, corresponding to the second objective. Compared with the first experiment where only text features were used for training, most models exhibited improved performance. All the models achieved macro F1 scores of approximately 0.85, and the ANN achieved a macro F1 score of 0.88. This performance improvement is attributed to the differences in top-k percentages, one of the probabilistic features. Similar to the results analyzed in Section 4, the large medians of top_k_red and top_k_purple for the human reviews appeared to have significantly affected the models’ performance. The results of the second experiment indicated that although the text features used in the first experiment were insufficient for classifying the two review groups using them together with probabilistic features was effective.

Table 9 Evaluation results of machine review detection models trained using text features and the probabilistic feature top-k

Third Case

Table 10 presents the evaluation results of the machine review detection models trained using text features and all probabilistic features, corresponding to the third objective. Compared with the second case, all the models exhibited a performance improvement of approximately 1%–2%. However, none of the models except for the SVM and ANN achieved high performance. This is likely because the human reviews had a more diverse top-p feature distribution than the machine reviews, as discussed in Section 4, which affected the models’ classification performance. This suggests that the top-p feature is suitable for classifying the two review groups. Additionally, when all the features in the machine review classification models were used, most models exhibited an average classification accuracy of 87%, a maximum classification accuracy of 90%, and a macro F1 score of approximately 0.90. Although it has become difficult to detect generated text using only text features since the advent of language models, using probabilistic features together with the text features can significantly improve the detection performance.

Table 10 Evaluation results of machine review detection models trained using text features and all the probabilistic features

According to the experimental results, the best feature combination for detecting machine reviews was text features and all probabilistic features. We used 10-fold cross-validation to test whether a model trained with this feature combination was biased toward the data used for training and validation. Table 11 presents the results of the 10-fold cross-validation. It shows the average performance for each fold after 10-fold cross-validation was conducted with the feature combination from the third experiment. All models exhibited the same results obtained in the third experiment, and the ANN and SVM exhibited a significant detection effect. Taken together, these results indicate that the problem of detecting human reviews and machine reviews is difficult to solve using only text features. Owing to the emergence of language models, the text features approach used to detect text generated by pre-AI macro bot programs has declined in detection performance. However, it appears to be effective for detecting machine reviews when both text features and probabilistic features are used.

Table 11 Evaluation of machine review detection model performance through 10-fold cross-validation

The experimental results showed that the best feature combination for detecting machine reviews include text features such as POS, sentiment information, and basic features, as well as probabilistic features utilizing top-k and top-p. When used together, these yielded the best macro F1 score performance. Among all the machine-learning models, ANN and SVM achieved balanced performances with accuracy and macro F1 scores of 0.89, making them the most effective models for machine review detection. These results suggest that excellent detection performance can be achieved if we apply our selected feature combination to ANN or SVM. We expect that more effective machine review detection can be achieved by variously utilizing our selected feature combination.

5.3 Comparative analysis based on statistical tests

We selected the best feature combination and models for machine review detection. We evaluated whether our selected feature combination and models can achieve meaningful performance in detecting text generated by GPT-2 compared with existing techniques. GLTR and GPTZero, as described in Section 3.3, are technologies for detecting text generated by language models and were used to evaluate the performance of our selected feature combination and models. GLTR utilized the top-k and top-p information among the probabilistic features used in our study. We structured GLTR’s features to include top-k and top-p to compare them with our selected feature combination. We extracted the top-k and top-p information from the machine reviews produced in this study. The extracted top-k features were structured as top-k green, yellow, red, and purple features, identical to the range of k used in our study. We also calculated the average of p to form the top-p mean feature for top-p information. Then, we configured GPTZero for evaluation as a model. We extracted the perplexity and burstiness features from the machine reviews created in our study and used them as GPTZero’s features. Finally, the features of GLTR and GPTZero were utilized as inputs for all machine-learning models used for evaluation. The hyperparameters for these machine-learning models were configured identically to those used in our study.

Table 12 shows the evaluation results of the machine review detection performance using our selected feature combination and the features of GLTR and GPTZero with machine-learning models. The feature combination in Table 12 refers to GLTR, GPTZero, and our selected feature combination. GPTZero’s perplexity and burstiness features do not achieve meaningful performance in detecting machine reviews. Most of the macro F1 scores are low; notably, the performance in detecting machine reviews is significantly worse than that in detecting human-written reviews. These results suggest that the perplexity and burstiness of machine reviews form a considerably different distribution compared with those of human-written reviews. Unlike GPTZero, GLTR’s top-k and top-p features achieve excellent performance in detecting machine reviews. However, when compared with our selected feature combination, the performance is insufficient. Based on these results, we expect that although top-k and top-p are meaningful features for detecting machine reviews, achieving good performance is difficult without using text features such as those in our selected feature combination. The feature combination we selected achieves excellent machine review detection performance in all models. These results indicate that using text features combined with probabilistic features can help in identifying patterns for machine review detection. Additionally, using top-p in different manners as well as the average value is a possible factor in enhancing machine review detection performance.

Table 12 Statistical significance between our best features, GLTR, and GPTZero using a machine learning model

Finally, we evaluated the performance of machine-learning models using our selected feature combination in comparison with GLTR and GPTZero. We examined the normality and homogeneity of variance on the macro F1 score performance of each model, as presented in Table 12. Additionally, to investigate the performance differences among the three feature sets, we conducted repeated measures analysis of variance (RM ANOVA) and Tukey honestly significant difference (HSD) post-hoc tests. Table 13 displays the results of the statistical tests.

Table 13 Results of verifying the performance differences between Our best features, GLTR, and GPTZero

To validate the normality of all feature sets, we performed the Shapiro-Wilk test; the p-values for our selected feature combination, GLTR, and GPTZero were above 0.05, confirming a normal distribution. We also conducted the Levene test to verify homogeneity of variance; the variances of the three feature sets were found to be equal. Therefore, the performance of all feature sets in Table 13 satisfies the conditions of normality and homogeneity of variance. Subsequently, we conducted RM ANOVA tests to check for the existence of performance differences among the three feature sets. In Table 12, the F-value is 18.3193 and p-value is 0.0004, which is less than 0.05, indicating a significant difference. Based on these results, we performed the Tukey HSD post-hoc test to estimate the mean differences between the three feature sets. The post-hoc test reveals that the p-value for the mean difference between our selected feature combination and GPTZero is 0.0. The p-value for the mean difference between our selected feature combination and GLTR is 0.0015, whereas the p-value between GLTR and GPTZero is 0.0229. Therefore, all feature sets have a significant mean difference with a p-value < 0.05. Thus, the machine-learning model utilizing our selected feature combination achieves higher performance than GLTR and GPTZero, and the results are statistically significant.

6 Discussion

In this section, we performed additional analyses to evaluate the performance of the feature combination and models selected in Section 5, especially when fake reviewers abuse the features, selected in Section 4, to produce machine reviews that are challenging to detect.

Fake reviewers generate machine reviews by fine-tuning reviews with malicious intent into the GPT-2. As indicated by the results presented in Section 5, detectors are likely to detect generated machine reviews based on text features and probabilistic features. Unfortunately, fake reviewers can abuse the features selected in Section 4 to generate machine reviews that are difficult to detect. The process that fake reviewers make them harder to detect is as follows: First, a fake reviewer generates a machine review. Then, we extract machine reviews that have features similar to those selected in Section 4. The reviews extracted in this way are called adaptive machine reviews. Adaptive machine review can change the performance of machine review detection models. Therefore, we use adaptive machine review to evaluate the performance of the machine review detection model used in Section 5. We generate adaptive machine reviews that have similar features to human reviews. Then, it is evaluated using the adaptive machine review generated by the machine review detection model used in Section 4.

Next, we evaluated the efficacy of the features selected in Section 4 for detecting machine reviews produced by the latest GPT model. The GPT model is GPT-4, which is currently being employed in several natural language tasks. It is a multi-modal model capable of handling both text and other forms of media such as images and videos. Unfortunately, GPT-4 cannot yet be fine-tuned. Therefore, we used the fine-tunable GPT-3.5 model for specific natural language tasks. GPT-3.5 is a fine-tuned version of GPT-3 and is capable of natural language tasks similar to ChatGPT. We assumed that fake reviewers may use GPT-3.5 to generate machine reviews and abuse them to make their detection difficult. Based on this assumption, we examined whether the performance of our machine review detection model, which is based on our selected feature combination, changes.

  • RQ 1) If fake reviewers generate machine reviews with the intention of avoiding detection using the text features we have selected, how is the efficacy of our machine review detection model impacted?

  • RQ 2) Considering machine reviews produced by fake reviewers to avoid detection through our selected probabilistic features, is there a noticeable change in the performance of the machine review detection model?

  • RQ 3) When fake reviewers employ GPT-3.5, one of the latest GPT iterations, for generating machine reviews, is still our selected feature combination effective in identifying these reviews?

6.1 Experiments for RQ1

The following is a scenario of a fake reviewer adapting text features in machine reviews. 1) The fake reviewer collects the analysis results used in the machine review classifier; 2) the fake reviewer modifies the data to fine-tune the language model to resemble the text features of human reviews; 3) the fake reviewer generates machine reviews with the fine-tuned language model; 4) from these generated machine reviews, the fake reviewer resamples machine reviews that have similar text features to human reviews. We performed an experiment to determine whether the performance of the text feature-based machine review detection models evaluated in Section 5 changes when machine reviews with similar text features to human reviews are generated through this scenario.

Figire 11 is a box plot that shows the adjusted distribution of basic_str_len, pos_noun, and pos_adp among the text features of the two review groups to be similar. As shown in Fig. 11(a) to (b), we adjusted the feature distributions of the two review groups to have similar mean and median values. The adaptive machine reviews were used in the text feature-based machine review detection models evaluated in Section 5. The evaluation was performed on several of the machine-learning models used in Section 5, and the results are presented in Table 14.

Fig. 11
figure 11

Differences in feature distributions between human reviews and adaptive machine reviews with the adjusted text feature distribution

Table 14 Results of model evaluation using adaptive machine reviews with adjusted text features

As shown in Table 14, the performance was worsened for all the models. AB and RF showed a performance drop of about 10%, and ANN and SVM showed a performance drop of about 5%. Although LR exhibited minimal performance decrease, it shows a bad result due to its inherently bad performance. The model evaluation results indicated that when a fake reviewer generates adaptive machine reviews with text features adjusted to resemble human reviews, the model’s detection performance for the generated adaptive machine reviews will decline. Accordingly, machine review classification models using text features were found to be vulnerable to attacks by fake reviewers.

6.2 Experiments for RQ2

Fake reviewers can adjust not only text features but also probabilistic features. The scenario of adjusting probabilistic features is identical to that for text features. Fig. 12 shows box plots where the probabilistic features top-k and top-p are adjusted in machine reviews to resemble human reviews. We generated adaptive machine reviews with adjusted distributions, as shown in Fig. 12. top_k_green and top_k_yellow exhibited large differences between human and machine reviews, but we adjusted them to resemble human reviews, as shown in Fig. 12(a) and (b). Additionally, top_k_red and top_k_purple exhibited larger medians for human reviews than for machine reviews, but we adjusted them to be similar between the machine and human reviews, as shown in Fig. 12(c) and (d). We also adjusted top_p_mean, top_p_max, and top_p_var to be similar between the human and machine reviews, as shown in Fig. 12(e) to (g). After the adjustments, all the probabilistic features exhibited distributions similar to those for the human reviews.

Fig. 12
figure 12

Differences in feature distributions between human reviews and adaptive machine reviews with the adjusted probabilistic feature distribution

Thus, we verified the impact of a fake reviewer adjusting the probabilistic features on model classification performance. Similar to the experiment for RQ1, the evaluation was performed on several of the machine-learning models used in Section 5. The results are presented in Table 15.

Table 15 Results of model evaluation using adaptive machine reviews with adjusted probabilistic features

As shown in Table 15, the model performance declined, similar to the experiment for RQ1. However, the decline in the model performance differed slightly from that in the previous experiment. Although the model classification performance decreased as in RQ1, but decreased less when the stochastic feature distribution was adjusted. AB and SVM showed a performance drop of about 3% compared to the previous performance. AB and RF showed a performance drop of about 4%. On the other hand, the ANN exhibited the most stable values among the models. Hence, the best feature combination selected for detecting machine reviews in Section 5 has a substantial impact on machine review detection and is expected to yield stable defense performance against attacks by fake reviewers.

6.3 Experiments for RQ3

The results of our experiments for RQ 1 and RQ 2 confirmed that we can detect the adaptive machine reviews generated by fake reviewers. Unfortunately, these adaptive machine reviews were generated using GPT-2. We evaluated the potential changes in detection performance when fake reviewers use the latest GPT models to generate machine reviews. As mentioned in Section 2.1, the latest GPT models outperform GPT-2 in natural language tasks. If fake reviewers use the latest GPT models to generate adaptive reviews, we expect that distinguishing these reviews from human-written ones will be challenging. Additionally, fake reviewers are likely to generate these reviews in an adaptive manner, as in RQ 1 and RQ 2. Therefore, we evaluated whether our best features selected in Section 5 is effective for machine reviews generated by the latest GPT models and for adaptive machine reviews.

For the experiments, we used the GPT-3.5 model, which can generate general-purpose text even without fine-tuning. GPT-3.5 can perform well in natural language tasks without fine-tuning and can fine-tune capabilities for specific natural language tasks. Therefore, we used the fine-tuned and non-fine-tuned GPT-3.5 models for generating adaptive machine reviews, as in RQ 1 and RQ 2.

Table 16 shows the performance evaluation results for detecting adaptive machine reviews using the feature combination selected in Section 5. The results in Table 16 reveal that the adaptive machine reviews generated by using the non-fine-tuned GPT-3.5 achieve a 0.99 macro F1 score. GPT-3.5 can generate general-purpose text without fine-tuning. Though, the generated text appears non-diverse and consistent compared with human-written text, aiming for perfect sentence formation. Such text differs significantly from the various writing styles of humans, and we believe this contributes to the high detection rate. In contrast, the adaptive machine reviews generated by using the fine-tuned GPT-3.5 achieve a 0.83 macro F1 score. This represents an approximately 0.07 reduction in performance compared with that reported in Section 5. Specifically, the F1 score for human reviews increases by approximately 0.03, whereas the F1 score for machine reviews decreases by approximately 0.18. These results indicate that detecting machine reviews has become more challenging with latest GPT. Compared with GPT-2, GPT-3.5 is trained on a significantly larger dataset. This means that when generating machine reviews, GPT-3.5 can select from a wider array of words, making its text generation similar to the diverse writing styles of humans. Therefore, as the performance of GPT models improves, the need for research on the detection of text generated by the latest GPT models increases owing to the potential decreases in the performance of existing detection models.

Table 16 Results of detecting machine reviews generated by the latest GPT using the combination of features we selected

7 Conclusion

The effects of text features and probabilistic features of machine reviews on the detection of machine reviews generated by GPT-2-based language models were analyzed. According to an analysis of human reviews and machine reviews, we found that among the text features, most POS features are used more frequently in machine reviews than in human reviews. Moreover, the probabilistic features based on the decoder sampling strategies of language models differed significantly between the two review groups. Taken together, the analysis results indicated that the text features and probabilistic features of the two review groups are meaningful factors for classifying the two groups. Accordingly, we selected useful features for detecting machine reviews.

Additionally, to select the best combination of features for detecting machine reviews, we evaluated various combinations of features using representative machine-learning models. When only text features were used for training, the models exhibited low detection accuracy of 68%–71%. When both text features and probabilistic features were used, the models exhibited high detection accuracy of 84%–90%, and the ANN achieved the highest macro F1 score of 0.90. Furthermore, we presented a process whereby fake reviewers generate adaptive machine reviews. We then generated adaptive machine reviews and used them to evaluate the models. According to the results, the performance of most models declined by approximately 3%. Nonetheless, the models exhibited stable detection accuracy. Besides, our method using the optimal feature combination for machine review detection achieved superior performance in detecting machine reviews compared with existing studies such as GLTR and GPTZero.

We analyzed and evaluated machine reviews generated by language models such as GPT-2. We found that using text features and probabilistic features together is effective for detecting generated machine reviews. We used basic metrics like frequency, maximum, mean, and variance for the top-p features, exploring various configurations of these features could enhance their effect for detection. Furthermore, ultra-large language models have recently been developed throughout the world, which have more extensive training and technological development than the language models used to generate the machine reviews in this study. Thus, the detection results will likely differ if the machine reviews are generated with ultra-large language models. In future work, we plan to work on detecting machine-generated reviews using larger language models than GPT-2.