Prediction of customer’s perception in social networks by integrating sentiment analysis and machine learning

Understanding the customer behavior and perception are important issues for motivating customer satisfaction in marketing analysis. Customer conversation with customer support services through social networks channel provides a wealth of information for understanding customer perception. Therefore, in this paper, a hybrid framework that integrated sentiment analysis and machine learning techniques is developed to analyze interactive conversations among customers and service providers in order to identify the change of polarity of such conversation. This framework aims to detect the conversation polarity switch as well as predict the sentiment of the end of the customer conversation with the service provider. This would help companies to improve customer satisfaction and enhance the customer engagement. The effectiveness of the proposed framework is measured by extracting a real dataset that expresses more than 5000 conversational threads between a customer service agent of an online retail service provider (AmazonHelp) and different customers using the retailer’s twitter public account for the duration of one month. Different classical and ensemble machine learning classifiers were applied, and the results showed that the decision trees outperformed all other techniques.


Introduction
Customer Perception (CP) plays a significant role in influencing customer satisfaction in marketing analysis. Companies need to be aware of the current customer perceptions to make more accurate and effective plans for product development and marketing. Customer perception is the process whereby a person organizes and interprets the impressions of the senses in order to provide meaning in their environment (Stephen, 2005). The opinions expressed by customers in social media can be analyzed and introduce associations of customers' perceptions towards a product. Such that it can be seen as an important aspect that must be addressed by the company to improve customer perception. Therefore, through the analysis of these perceptions, the position of the companies in their customers' minds (brand image) and the performance of the services provided can be revealed (Karamitsos et al., 2019).Customer feedback is available through multiple channels such as: responses to online text messages, conversations with customer support services, and phone surveys, emails, and social media. Most of this feedback is in the form of unstructured text which requires analytical aspects such as Natural Language Processing (NLP) in order to extract meaningful insights that represent customer perception.
Retail services are activities performed by the service provider to assist the customers in achieving their goals. On the other hand, social networking sites have transformed customers from silent, isolated and invisible individuals, into a noisy, public, and even more unmanageable than usual collective (Patterson, 2012). The way companies engage with their customers to handle the customer's inquiries could also influence public opinions (Balakrishan et al., 2014). According to the study conducted in Dam (2020) online communities significantly and positively influence brand loyalty. Therefore, social networks affect the customer 's perception of the brand image (Budiman, 2021). Thus, customers opinions, reviews and comments that have been published online could influence existing and new customers' decision making.
Sentiment analysis has been one of the core research topics in Natural Language Processing (NLP) in order to extract meaningful insights that represent customer perception (Stephen, 2005;Zhang et al., 2020). It is considered as the tool for studying the customer's opinions, feelings, and emotions expressed through different social media channels (Saragih & Girsang, 2017). Many studies exploited the social media analytics for detecting the influence of events on customer behavior (Wu et al., 2015;Welfare et al., 2010), and others focused on the changing of the population mood in order to detect significant events (Welfare et al., 2010). Most sentiment analysis approaches focus on investigating the polarity of the customer's comments at the document, sentence, aspect level; without considering the interaction among the writer and the speaker (Zhang et al., 2020). The interactive text is one of the important issues that reflects the customer's perception toward the product/service. Despite that, interactive sentiment analysis is still considered as a challenge. The interactive sentiment analysis aims to investigate the affective state and sentimental change for each writer in a conversation. So, it attracts researchers in the Natural Language Processing (NLP) field as well as industry (Zhang et al., 2020).
As suggested by Gregoire et al. (2015), customers tend to share their positive experiences online and would continue to buy products from the company if the recovery of the service is appropriately done and their problems are addressed satisfactorily. On the other hand, a study shows that users whose mood shifts are mostly negative generate rants, which are angry forms of expressions (Martin et al., 2013).
In the case of reaching a big audience, their angry content spreads all over the social platform. As an attempt to identify such bad effect among customers who share conversations within the same social platform, this paper proposes a framework to detect such a polarity switch (Abas et al., 2020) in customer conversation with customer service provider. Polarity switch is concerned with switching the user's mood from a positive optimistic one to negative one (or vice versa). For example, the customer may start the conversation with the service providers in the form of a complaint (negative polarity). However, the response from the customer providers in handling the conversation might lead to change this polarity and make the customer become satisfied and pleased by the end of conversation (positive polarity) which leads to polarity switch. On the other hand, the response may be poor, and the customer might get annoyed by the end of the conversation (negative polarity), with no polarity switch. Therefore, in this research Conversation Polarity Change (CPC) concept is introduced to handle this issue. CPC detection process aims to identify the sentiment of the customer's tweets to automatically detect the change in the customer's polarity throughout the conversation. This is accomplished by analyzing the conversational threads between an online retail service provider agent and a customer (source user) to extract insights about the flow of the conversation. Significant conversational features that affect the change of polarity (polarity switch) between the start and the end of the conversation are extracted. The second objective of this paper is prediction of the sentiment of source user text at the end of the conversation.
Those objectives raised since service providers can greatly benefit from the detection and prediction of sentiment change of their customers. First, detecting a negative sentiment at the beginning of the conversation indicates an unsatisfied customer, so the company can give higher priorities to solving the problems of such customers. Second, the detection of the change of sentiment of customers can help the company analyze the reasons behind that change whether it's positive or negative and find reasons which can be supported by the most affecting conversation features revealed by the proposed work. This can help change the policy the company follows when handling customers. Third, the prediction of the customer's polarity by the end of the conversation allows the company to apply preventive measures to avoid a negative sentiment by attending faster to such conversations and following the best practices. This might change the predicted bad outcome of the conversation into a good one.
Following, the contributions of this research work are summarized, followed by how the rest of the paper is organized.

Contributions
The contributions of this paper are as follows: • First, extraction and collection of a Twitter dataset that represent conversations between customers and the Retailer Agent of Amazon (@AmazonHelp). In addition to the conversation tweets, several related features that might affect the customer's sentiment were extracted. • Developing machine learning model for the detection of the Conversation Polarity Change (CPC) in order to measure the change in the customer's polarity. • This model is also used to predict the final polarity of the customer for each of the conversations based on the extracted features. • Experimental evaluations were performed to measure the accuracy of our proposed model.

Organizations
The paper is structured as follows: Section 2 starts with discussing the related work. Section 3 introduces the proposed conversation analysis framework. Section 4 explains the usage of different machine learning techniques (both classical and ensemble classifiers). Section 5 illustrates the experimental setup used in building and validating that framework and discusses the results, while the paper is concluded in Section 6.

3 2 Related work
The massive amount of social media data is considered as an effective resource to extract valuable knowledge for enhancing business intelligence (Wu et al., 2015). Many studies emphasized that analyzing social media can improve the decision process (Ibrahim et al. 2017;Wu et al., 2015;Saragih & Girsang, 2017). In general, most sentiment analysis approaches focus on investigating the polarity of the customers' comments at the document, sentence, and aspect level. Most of those works focused on the analysis of the customer reviews on products, however a few of them investigated the flow of interactive text between the customers and customer service agent which may reflect the image of the brand. In this section, research works that have been developed to analyze the customer's perception through text analysis are presented. The first part is concerned with applying sentiment analysis on customers feedback while the second part focuses on applying sentiment analysis on conversational level (either emails or social conversations).

Using sentiment analysis for measuring customers satisfaction
The authors of Saragih and Girsang (2017), investigated the customer engagement by analyzing the comments on social media (Facebook and Twitter) for online transport companies. This research investigates by mining the comments of the Facebook public page and tweets of Twitter in three online transports in Indonesia; Gojek, Grab, and Uber using the API service which is provided by both social media platforms. The data of the comments are classified into three categories: positive, negative, and neutral sentiment based on Indonesia sentiment words library, then a simple scoring system is applied to count the number of positive, negative, and neutral respectively. Then, the results were compared with the number of followers of the three online transports from their social media pages, on Facebook and Twitter, to measure the correlation between customer engagement with each company and their feedback.
In (Al-Otaibi et al., 2018), they developed a Tweet advisor system which is used to measure the customer 's satisfaction using sentiment analysis. They used the SVM as a classification algorithm with training data consisting of data objects whose class labels are annotated either positive, negative, or neutral. The system also had other services such as Account Analysis which enabled the user to search for a specific Twitter account and analyze its author 's activity rate in addition to the followers ' engagement with this account for the last ten days. Furthermore, the developed system enables the user to search Twitter's social media network for any keyword, hashtag or mention of interest to check the public opinion and other valuable indicators about it. The text extraction process was applied once on single word (unigram), and another on two words (bigram). The experimental result indicates that the unigram text extraction method with SVM classification together brings a score that reaches 87%, which is higher than bigram.
Another work (Fitri et al. 2018) targets measuring the level of public satisfaction in using data service of a telecommunication operator for internet access in Indonesia. In this research, a system to analyze sentiment via Twitter about the level of user satisfaction with the operator data service was developed based on a Naive Bayes Classifier (NBC) to classify the tweets into positive, negative and neutral. The analysis data is taken from the Ama-zonHelp official public Twitter account through Twitter API. The time span for tweet data retrieval is four months with accuracy 99%.

3
There is another type of work in literature that is concerned with Business-to-Business (B2B) client satisfaction such as (Tarnowska et al., 2019) and (Tarnowska and Ras 2021) where both works use surveys to develop recommender systems for the B2B market. In (Tarnowska et al. 2019), they extracted action rules for companies to increase their clients' loyalty using benchmark customer surveys. Furthermore, they use aspect-based Sentiment Analysis to detect polarity of the text fields in the surveys. Another recent work appears in Tarnowska and Ras (2021) where authors use another type of surveys that have unstructured free text fields. Their results show that the impact of their recommendations was less than that of the recommender system in Tarnowska et al. (2019) due to the loss of knowledge associated with quantitative data.
The proposed work in this paper focuses more on Business-to-Customers B2C market and the customer satisfaction based on utilizing unstructured data extracted from the customer's tweets in conversations between the retailer customer agent and the customer.

Conversational sentiment analysis for measuring customer satisfaction
The work of Gilbert (2014) investigated the customer support data for a large Swedish Telecommunication corporation with the purpose of determining the sentiment of customers' emails. The data consists of customer support e-mails collected from a large Swedish telecom company over the course of a year. The customer service e-mail had around 168,010 e-mails grouped into 69,900 conversations (i.e. email threads). VADER (Borg & Boldt, 2020), a sentiment analysis framework, was used for labeling individual emails in the support threads, i.e. providing a sentiment score for each email. The authors investigated the possibility of automatically classifying email content polarity (i.e. negative or positive sentiment) with the aim to enable customer support to see customer sentiment without reading through the complete email threads. Furthermore, they predicted the sentiment of future email responses (i.e. a yet to be received response from a customer based on the e-mail message the support personnel is about to send). The findings show VADER would provide a better customer experience, and it outperforms individual two human raters who examined a random sample of e-mails an F1-score of 0.84 for customer sentiment classification task while it was 0. 667 for prediction of future email responses).
Researchers of Ibrahim et al. (2017) proposed an approach for investigating the impact of online brand community on the users' perceptions of brand image. They focused on exploring the patterns of engagement between companies and customers on Twitter platforms. They statistically measured polarity of opinions of people towards different products among the used retailers, then they patterned the customer engagement through five popular online retail brands (Amazon UK, Tesco, Argos, John Lewis and Asda). Results showed that all brands received more positive tweets about their products compared to negative tweets. Lowest percentage of negative sentiments was detected from Amazon and highest percentage of positive tweets was from Argos. Interestingly, the highest percentage of negative sentiments was also from Argos. Moreover, they also examined the different factors that may affect the customer's engagement with the company through a microblogging platform. Therefore, another case study on a specific brand was conducted and tweets between the company's Twitter account and the customers that mentioned that specific brand were analyzed. The results showed that the level, attitude, length and media types of tweets have a significant impact on the customer's emotional transition of tweets during conversation time, and the change of polarity. The findings showed that there was a 2.1% decrease of negative sentiments and a 0.3% increase of positive sentiments for customers' tweets and retweets between the beginning and the final stage of conversations as a result of the attitude of AmazonHelp service provider.
The work of Zhang et al. (2020) concerned with the conversational sentiment analysis in the purpose of developing a conversational database called ScenarioSA. The ScenarioSA describes the interaction between two speakers and represents the sentimental change of each speaker in the conversation. Moreover, the proposed model utilized different machine/ deep learning algorithms for the application and evaluation of the sentimental. The experimental results realized accuracy level in the range of 61% to 73%.
Unlike the above mentioned research work that applied sentiment Analysis for Measuring customer satisfaction through either emails or conversation, the proposed framework aims to identify the change of customer polarity at the end of conversation as well as predict the final polarity of customer by the end of conversation.

Proposed framework for prediction of customer perception
According to Orben et al. (2019), Libai et al. (2010), social media helped companies in engaging and recognizing their customers and supported them to connect with loyal customers. During interactions through the online environment, a relaxed atmosphere for both customer-customer and customer-company interactions could be accomplished (Lemon & Verhoef, 2016). This "beyond purchase" behavioral dimension of customer engagement includes manifestations, such as social influence through word of mouth, and customer recommendations (Appel et al., 2020;Libai et al., 2010). Thus, the customer's engagement over social networks with online retailing has an important influence on brand images. From the retailers' perspectives, they can monitor the online user community's feedback more effectively to take necessary action when it is needed. On the other hand, the flow of the conversation between the customer and the retail service providers may affect other customers' insight regarding service providers.
Therefore, this work aims to explore the impact of the customer's online interactions with the service provider on the polarity of the customers' opinion towards retail service providers. It has been widely known that when there is a dedicated channel that is specifically provided to customers for expressing their opinions, customers are more likely to send negative words through this channel. However, the company engagement with customers has a positive impact on the customers' sentiments towards the brand. Significant reduction of negative sentiments from the customer could be obtained after appropriate interactions with retail service providers. Therefore, as mentioned in Sect. 1.1, the proposed framework aims to identify the customer's CPC at the end of conversation with the retail service providers (in this study, AmazonHelp is used as a case study). Furthermore, prediction of the sentiment of customers text at the end of the conversation is examined to find out the effect of different conversation features such as its length and text on the improvement/ deterioration of the customer's opinion towards the customer service provider behavior. To achieve those objectives, a lexicon-based sentiment tool is used to label the conversations (Socher et al., 2013), then different machine learning techniques are trained on the labeled data in order to classify the change of polarity of the customer's conversations. The main phases of the proposed framework as shown in Fig. 1 are: 1-Conversation Extraction: extracting conversations between the customer and the retail service providers (henceforth referred to as customer conversations).

2-Feature
Engineering: This process aims to map raw data representing attributes of the conversation and the source user from extracted through Twitter API into feature vectors to be used in classification and prediction of the customer's sentiment. 3-Conversation Labeling: for identifying and labeling the customer's tweet polarity at the start and the end of conversations using sentiment analysis Stanford API tool and then applying the CPC. 4-Feature Grouping: which categorized both raw data as well as calculated data extracted from the customer's conversation into four feature groups/sets. Those groups are conversation content features, conversation interaction features, conversation activities features, and source user features. 5-Applying Different Machine Learning Models: for the detection of CPC and for the prediction the polarity of the customer tweet at the end of each conversation based on the introduced features.
Following, the details of each of these phases are explained.

Conversation extraction
Twitter has become a valuable source of data that can be efficiently used for different domains and applications. Such data are mainly used for marketing and social studies (Aswani et al., 2018;Shirdastian et al., 2017). Companies can add a deep link to their tweets automatically displaying a call to action button, which allows the customer to send the business a direct message. Thus, we considered only the tweets sent to the customer support of Amazon "@AmazonHelp" (2009) on twitter, which is the official customer support account of Amazon -one of the world's largest online retailers-on twitter. Analyzing data that represents the conversations as well as the change of the polarity of customers Only the messages written in English have been extracted using Twitter API, 1 so twitter was searched for all the English tweets mentioning the twitter account "@Amazon-Help". The pre-processing was performed in two steps. First, the whole conversation was extracted, then the cleaning of the text was applied. Traditionally, a conversation thread has a tree structure, where the parent is the source tweet while the rest can be split into individual branches each starting with either a reply from the customer service provider or a reply from an external customer/user. For this research, each conversation was flattened out to be only 2 levels. The first level is the source tweet, and the second level has all of its replies including replies to replies. Therefore, in order to identify the source tweet which represent the start of each conversation, the tweets returned were filtered out such that only tweets that are not a reply to any other tweets are considered as source tweets. After getting the source tweets of the customer conversations with AmazonHelp, twitter was searched to get the rest of the conversation for each of the source tweets. The conversations returned contained tweets of the original customer (source user), AmazonHelp, and any other external customers/users who took part in the conversation. Next, cleaning of the text was applied by stripping URLs, usernames, hashtags, and emoticons. Also, tweets were normalized by removing special characters and any separators other than blanks. Second, on the cleaned tweets, we performed a lemmatization and a grammatical tagging.

Feature engineering
During this phase, raw data of both conversation and the source user are mapped into features to be used through the classification and prediction tasks. According to the previous phase, the following attributes were extracted from the Twitter API for each conversation: -Tweets features: starting by the source tweet. The extracted features of each tweet are: text, author, time, tweet hashtags, media and URL, retweet count and favorite count of tweet. -Source user features: namely, the author's verified status (true/false), the author's followers count, and the author's followees count. -Conversational features: other conversational features are required to be used by the classification model, such as, the conversation length (number of tweets in the conversation), total time or duration of the conversation, number of external users, and number of replies/comments of other external users in the conversation. Note that external users are any users participating in the conversation other than AmazonHelp and the original author of the source tweet.
The extracted/calculated features are summarized in Table 1. Those features are the only considered in this work, as it has been found that different indicators could affect the emotional transitions involved between customers and the service provider agent such as the length of the tweet, the number of replies, etc. (Ibrahim et al., 2017). Furthermore, it is also important to consider the conversation length when studying the factors that would lead to convert the negative polarity of the customer into a positive one (change of customer attitude). Therefore, it was necessary to calculate some features from the raw data to be included in the features vector as will be explained in Section 3.4. All of the previous features along with the sentiment features in Section 3.3 were manipulated and mapped into several feature groups that will be discussed in Section 3.4.

Conversation labeling process
The conversation labeling process aims to identify the customers' perception at the beginning and the end of the conversation. This is done by detecting polarity of the customers' tweets in an interactive conversation as the customers' tweets carry their perception toward the product/service as well as reflect the brand image. The conversation labeling process provides labels that represents the user's point of view in terms of positive (expressing positive sentiment), negative (expressing negative sentiment) and neutral (expressing unbiased sentiment or not expressing any sentiment). The process depends on applying a sentiment analysis tool on the first and last tweets of the customer in each conversation. Stanford API (Socher et al. 2013) was applied to identify the customer's attitude in the conversation by labeling the first and last customer's tweet. The tool used the Recursive Neural Network (RNN) model for investigating the polarity by considering the sentence structure. The RNN model eliminates the problems of losing the sentence's meaning and semantics.The conversation labeling process is divided into two main steps which will be explained in details: • Tweets labeling • Conversation Polarity Change (CPC) process.

Tweets labeling
Stanford API was used as it provided better accuracy up to 89% on Amazon reviews according to Phand and Phand (2017). The sentence sentiment labeling estimates the strength of polarity which is categorized into five terms 2, 1, 0, -1, and -2 from the very positive to the very negative respectively.
In twitter, it has been found that the source user may express his/her opinion by writing many tweets before the service provider replies. This occurs because of one of two reasons, either due to the limitation of the length of tweets as Twitter allows the user to express his/ her opinion by writing only 280 characters per tweet, or due to the delay of the response from the service provider side. This may drive the customer to deliver his/her perception through many tweets. In such cases, the conversation labeling process considers them as one chunk of text representing the user's point of view provided that the tweets are written sequentially within a specific time window (namely 60 min). This time window was chosen after observing the nature of the collected conversations where most of the first tweets that actually express one long tweet occur within that time period. The following tweets occurring after that time before the customer service replies are usually separate tweets and might have a different sentiment than the early tweets. A sample conversation is shown in Table 2 where the source user started the conversation by 3 consecutive tweets within a 2-min duration. The 3 tweets were merged into one piece of text and labeled to have a sentiment of -1.

Conversation polarity change (CPC) process
In order to examine the change of the customer insights against the customer service provider, it is important to observe the pattern of dynamic emotional transitions that occurred during conversations and how different features would impact and have influence on the customer perception. Therefore, this step aims to investigate the conversation' s emotional direction by taking into consideration the first tweet's sentiment and the last tweet's sentiment. The conversation's emotion direction may be changed positively, negatively, or remain on the same mood during the conversation interaction time. This reflects the customer satisfaction/perception by the end of conversation. The CPC process would end by providing a label to each conversation indicating the change of the customer attitude throughout the conversation. This is done by calculating the difference between the first tweet and the last tweet's sentiments as shown in Eq. 1.
where: CPC i indicates the polarity change of the conversation i, SL_EndTweet i is the sentiment label of the end tweet of the conversation i, and SL_StartTweet i is the sentiment label of the starting tweet of the conversation i. The CPC has 9 potential values that range from 4 to -4 which shows the strength of change in the customer's attitude by the end of conversation compared to the start. For example, if the start of the conversation was very negative (-2) and it has been changed into positive (1) by the end of conversation, this reflects achieving three steps toward positivity and the CPC will provide the value 3 as the conversation label. The values ranging from 1 to 4 reflect the degree of positive customer attitude by the end of the conversation, and the negative attitude ranges from -1 to -4. Higher negative values of CPC show a more negative change in polarity, and higher positive values show a more positive change. According to the example shown in Table 2, the CPC will be Zero if the sentiment of the last tweet remains the same as the first one.
When applying Eq. 1 to the extracted set of conversations, it has been found that a scarcity problem appears in the set of labeled conversations. Therefore, it was useful to merge some values and map them into three different classes representing the CPC as either Positive, Negative, or No change as shown in Table 3. For example, negative CPC values ranging from -4 to -1 are grouped into one class to indicate a negative change of polarity, and positive values were all merged and mapped into one class to indicate a positive change of polarity, while the zero remained the same in a separate "no change" class.

Feature groupings
As mentioned earlier, the main target of this research is to identify how the assistance from a well-known brand of a retailer company can affect their customers positively or negatively regardless of unpleasant complaints sent to them through the online conversation. This is achieved by extracting different attributes which represent both the online conversation as well as the source user as mentioned in Sect. 3.2. The extracted features are split into the following groups, each stored in a separate vector to be used in the experimental section: Conversation Content Features (CCF), Conversation Activities Features (CAF), Conversation Interaction Features (CIF), and Source User Features (SUF). CCF are related to the content of the whole conversation such as text, links, etc. The CAF are related to the activities related to the conversation itself such as time of creation, its length, etc. While, the CIF are related to the interaction between source users (authors) and other users involved in this conversation such as retweets, mentions, as well as other users involved in the whole conversation. Finally, the SUF consider the characteristics of the customer (source tweet author). A summary of all features used is the proposed framework is shown in Table 4, followed by the details of each feature group.

Conversation Content Features (CCF)
One important aspect that is related to conversation is its content which is represented by text, images, links, hashtags, etc. Furthermore, since the sentiments are closely related to how people behave in different contexts, it is important to include the sentiment of the first  tweet in the conversation as one of the input features to the proposed model. This would help in detection of how the conversation's polarity change by the end of conversation. Sentiment is defined as an attitude, thought, or judgment prompted by feeling (Pozzi et al., 2017). This is expressed and is typically measured as positive, negative, or neutral sentiment Therefore, the following features are considered here: • Sentiment of the Source Tweet: which is a sentiment label given to the first tweet of the conversation. • Hashtags: a binary value indicating if the conversation contains hashtag(s). • Media: a binary value indicating if the conversation contains any multimedia.

Conversation Activities Features (CAF)
This group of features is concerned with the body of the conversation such as: • Time Span: which represents the duration of the whole conversation which is calculated based on the difference between the start time of conversation, and its end time. This is one of the important features that has been considered as for example, imagine a scenario where a customer's initial contact with a human agent is during peak hours, resulting in a long waiting time. This previous experience will consequently affect what this customer expects in subsequent encounters (Tran et al., 2021). • Conversation Length: which is calculated based on the total number of tweets in the whole conversation • Average User Tweet Length: this feature represents the average tweets length of the source user (author). • Average Amazon Tweet Length: this feature represents the average tweets length of AmazonHelp • Time Till First Reply: this feature represents the time between the first tweet from the user and the first reply from AmazonHelp. • Number of First Tweets: the number of a series of tweets before the first reply from the service provider at the beginning of the conversation(if any).
The last two features are included in this paper as it has been found that they are crucial to give insights about how fast the response is from service providers. Modern customer service providers should be able to professionally react to customers and to be proactive. For example, clarifying questions when asked to, and being able to know what to do when they don't have the answer right away. These are all skills that help comprise a positive customer perception.

Conversation Interaction Features (CIF)
For each conversation thread in the data set, that contains three or more tweets between customer and service providers, other meta data are extracted regarding the interaction inside the conversation. This metadata includes the following features: • Favorites: Number of people who favorited/liked any of the thread tweets which indicates the degree of influence of the tweet (how acceptable the conversation is). • Retweets: Number of users who shared this tweet (source tweet).

• Number of Replies: The number of reply tweets from external users. • Number of External Users: Number of users included in the thread tweets other than
Amazon and the source user.
In any conversation between a customer and a retailer, many external users may be involved. It is important to find out if they have an effect on the flow of the conversation, therefore we only include their numbers.

Source User Features (SUF)
The tremendous positive/negative effect of reviews on product sales enables online retailers to manipulate the information presented on the conversation platforms by the posts of the users who initiates the conversation, who may be a regular customer, an influencer, or a sponsor. Therefore, it is important to extract features that represent the source user (main customer) during the time interval of conversation. The features related to the source user are: • Number of source user's followers: number of people that currently follow the source user, which is used to measure the popularity of this user among the network. • Number of source user's followees: number of people that are currently followed by the source user. • Verified: Whether the source user has a verified account 4 Applying different machine learning models As mentioned in this paper, two main objectives are investigated. The first one is related to the classification of the customers' sentiments polarity change throughout the conversation with the retails service provider. And the second objective is related to the prediction of the conversation polarity change based on the start of the conversation which could be used to predict the sentiment at the end of the conversation. In this section, using machine learning to classify and predict sentiment change of the source user's polarity is presented.

Classification of the conversation's polarity change
In Sect. 3.3, the conversation labeling process was explained, where each conversation was labeled with a Conversation Polarity Change (CPC) value of either negative, positive, or neutral. In this phase, different machine learning techniques were applied to detect the polarity change of the conversation where conversations are the instances in the dataset. Based on the above three mentioned Conversation Polarity Change (CPC) classes, both classical and ensemble classifiers were applied to classify the change of the conversation polarity. For classical classifiers, we applied: C4.5 Decision Trees, K-nearest neighbor, Naive Bayes, Artificial Neural Network (ANN), Bayes Net, Support Vector Machine (SVM), and Logistic Regression. Ensemble classifiers use aggregated classification/predictions from several models/classifiers to achieve better performance. We use them to prevent the proposed models from overfitting. Therefore, Bagging was applied as an example of ensemble classifier. The evaluation of the above classifiers was performed using accuracy and F-measure. The F-measure metric was used to give an objective evaluation of the classification, specifically when using an imbalanced dataset. Multiple runs on the extracted dataset were applied for the experiments, each with different groups of features as will be shown in details in the experiment section.

Prediction of the polarity change
The previous classification task is based on the full set of features described in Sect. 3.4. Some of those features are only available at the end of the conversation, like the conversation length, which counts number of tweets in the conversation thread. But it is important to expect/predict the sentiment of the customer's future tweets especially the last tweet which reflects the customer's satisfaction as per the service provider behavior. If the customer's satisfaction/sentiment can be predicted at the early stages of the conversation, this will allow the retailer service to provide a better customer experience, and change that predicted sentiment into positive if it's on the negative side. Knowing the initial sentiment of the customer, predicting the polarity change will allow for the prediction of the final sentiment of the customer.
This prediction part will only use the subset of the features that are available at the early stages of the conversation. So, for instance, sentiment of the source tweet will be in this subset of features, while the conversation length will not be included because the first is available at the beginning of the conversation while the latter is not. The same classes of polarity change presented in Table 3 are used for this part as well. Different machine learning techniques were also used here. However, some features discussed in Sect. 3.4 are excluded in the prediction experiment as they can only be calculated after the conversation ends, such as the conversation length and the duration of the conversation (Time Span).

Experiment
Three main experiments were performed, each with a different objective. The first experiment uses all of the feature groups suggested in Sect. 3.4 with different machine learning algorithms to compare the performance of each algorithm and find the most suitable to our problem. The second experiment is to evaluate the worthiness of each feature group on the polarity change of the conversation. The final experiment is for the prediction of the polarity change throughout the conversation using a model trained and tested using only some of the features available by the start of the conversation. Furthermore, within experiments 2 & 3 multiple runs were tested, in order to study the effect of each category of features on the accuracy of the proposed framework. A single feature group was used for the first run. And, for each consecutive run another feature group was added to end up with the set of all of the feature groups. The different groups of features are: conversation content features (CCF), conversation activities features (CAF), conversation interaction features (CIF), and finally the source user features (SUF). The details of the dataset used is in 5.1, train and test sets in 5.2, experiment 1 in Sect. 5.3, experiment 2 in Sect. 5.4, and the final experiment 3 in Sect. 5.5.

Dataset
As mentioned in Sect. 3.1, the conversations were extracted from Twitter API by searching for English tweets that mention the account @AmazonHelp and then get the rest of the conversation tweets. While extracting conversations from the Twitter API, the length of each conversation was calculated as the number of tweets in the conversation. The conversations had varying lengths, including 2-tweet conversations in case a customer had an inquiry that was being answered by AmazonHelp without any further discussions. Since short conversations do not provide any insight regarding the polarity change of the customer throughout the conversation, they were excluded from our dataset, and only conversations of length four or more were included. A total of 6538 conversations were extracted after such filtration over a period of 1 month starting Dec.12 th , 2020 until Jan. 12 th , 2021.

Train and test sets
The second step in our experiment setup is to prepare the train and test sets. The extracted set of conversations suffered from imbalanced classes where nearly half (50.94%) of the conversations didn't show any polarity change. Part of this issue came from the conversations that did not carry any complaints where the customer interacted to ask for guidance. So, for making the dataset more balanced we excluded those cases which were represented in the conversations that had a neutral start and no change of polarity. A neutral start is indicated by the sentiment label 0 of the first tweet of the conversation, and no change in polarity is indicated by a CPC value of zero. This second filtration allowed for a more balanced dataset ready for classification. The final number of conversations is 4498 conversations with a total of 37,414 tweets. So, for the 3 classes of output, 1161 conversations reported a negative change, 1824 reported a positive change, and 1513 reported no change in polarity. The conversations and their features were divided into train and test sets with 70%-30% respectively, where the train set has 3150 instances, and the test set has 1348 instances. The balance of classes in both of the train and test sets was considered during the split to maintain the same proportion of classes in both of the sets.

Experiment 1: Comparison between different classification algorithms
In order to validate the effectiveness of the proposed framework and to compare between different classifiers, the following experiment was conducted to compare between the accuracy of the different classification algorithms. Both classical and ensemble classifiers were used and the accuracy and F-measure of each are shown in Table 5. The Weka tool 2 was used, which is a powerful tool that provides a wide variety of machine learning algorithms for data mining tasks. All of the features were arranged into a feature vector and used for classification in this experiment. The following are the classifiers used and their equivalents on Weka: 2) Ensemble Classifier: -Bagging Some hyperparameters were tuned for the used algorithms after many trials to give their best results. All of the adjusted values are shown in Table 5, while the rest of the parameters were kept as their defaults. The best results are achieved by the classical classifier Decision Trees and the ensemble classifier Bagging.
The performance of the ML algorithms is also shown in Fig. 2 sorted by Accuracy. Both of the best performing algorithms Decision Trees and Bagging were used for experiment 2 to evaluate the application of different feature groups.

Experiment 2: Effect of subsets of features on model accuracy
The main objective of this experiment is to measure the effect of each of the feature groups on the accurecy of the proposed framework. As per the previous experiment, the Decision Tree classifier is one of the best classifiers for our problem. Therefore, in this experiment a Decision Tree algorithm was applied on each set of features to assess the model's performance. As mentioned before, the dataset was split into train and test sets with 70%-30% respectively, but different train/test pairs were generated for the different runs of the experiment, each with a different subset of features. The four subsets of features are as follows:  Before applying this experiment, Weka's''InfoGainAttributeEval'' method which is used to measure the information gain with respect to the class, was applied in order to rank the features/attributes. The result of this step is to identify the highly ranked features which was the feature sentiment of first. Accordingly, the feature set that includes this feature which is Conversation Content Features (CCF) was included in all of the feature subsets in this experiment. As shown in Fig. 3, the most significant two feature groups are the CCF and CAF. They are followed by the CIF and the SUF with minor improvements. Furthermore, it is also important to evaluate the individual effect of each subset of the feature groups on the different classes of polarity change as shown in Fig. 4.

Experiment 3: Prediction of the polarity change of the conversation
The focus of this experiment is to predict the polarity change of the customer based on the early indicating features of the conversation. The same subsets of feature groups applied in Experiment 2 were applied in this experiment, where Experiment 3.1 will use the conversation content features (CCF), the second Experiment 3.2 will use CCF in addition to the conversation activities features (CAF), then Experiment 3.3 will add the conversation interaction features(CIF). Finally, Experiment 3.4 will add the last feature group, which is the source user features (SUF). For the purpose of prediction, both of the train and test pairs contain the subset of features that are available at the beginning of the conversation and the rest of the features are excluded. For example, the time span of the conversation can only be calculated at the end of the conversation, so it was  Evaluating different feature groups using Accuracy and F-measure removed from the feature set in all of the runs, so that actual prediction can happen at the early stages of a conversation. A sample conversation from our dataset is shown in Fig. 5, where only the first two tweets between a customer and AmazonHelp are shown. Only this part of the conversation and its available features will be used for the prediction task. This applies to the rest of the conversations in our dataset. The detailed features of the train and test sets will be listed later for each experiment. The classifiers used are Decision Trees and Bagging since they are the best performing algorithms according to experiment 1. The evaluation metrics used for the prediction were Accuracy and F-measure.

Experiment 3.1: Conversation content features (CCF) Part A. Train and test sets
The train and test sets include all of the the conversation content features since they are available from the early start of the conversation as shown in  Table 6.

Experiment 3.2: Conversation content features + conversation activities features (CCF + CAF) Part A. Train and test sets
The train and test sets include the CCF + CAF feature groups.
The added features of CAF are: -Time till first reply from the retailers service representative -Number of the first tweets by the customer -Tweet length of the customer's tweet -Tweet length of Amazon's reply tweet The additional features used for prediction in this experiment are shown in Fig. 5. The feature groups CCF and CAF are used, but the features Time Span and Conversation Length were both excluded from the CAF feature group because they are not available at the start of the conversation. The feature Time Till first reply measures the duration between the first customer's tweet and Amazon's reply in minutes. Both of the customer's and Amazon's tweet lengths are counted as characters.

Part B. Prediction
The highest accuracy was 0.68 and the highest F-measure was 0.68 and both were achieved by the Decision Trees classifier as shown in Table 6. It is noticeable that the F-measure is highly improved by the conversation activities features. This indicates that this group of features highly affects the customer's polarity, which was also shown in the analysis phase.

Experiment 3.3: Conversation content features + conversation activities features + conversation interaction features (CCF + CAF + CIF) Part A. Train and test sets
The train and test sets include the CCF + CAF + CIF feature groups. The added interaction features (CIF) are: -Number of favorites of the source user's tweets -Number of retweets of the source user's tweets The feature values are shown in Fig. 5 for the sample conversation presented. The features number of external replies and number of external users are excluded since they are not available at the early stage of the conversation. So, the test set at this stage consists of 9 features; 3 from the CCF group, 4 from the CAF group, and 2 from the CIF group.

Part B. Prediction
The Decision Trees classifier achieved an accuracy of 0.67 and an F-measure of 0.68 as shown in Table 6. In this run, the F-measure was not improved, and the accuracy is less than that of experiment 3.2, even though the conversation interaction features showed improvement in accuracy and F-measure in experiment 2. This is probably because of the removal of the features external replies and number of external users, which shows that the features were significant amongst the conversation interaction features.

Experiment 3.4: Conversation content features + conversation activities features + conversation interaction features + source user features (CCF + CAF + CIF + SUF) Part A. Train and test sets
The train and test sets include the CCF + CAF + CIF + SUF feature groups. The added source user features (SUF) are: -User's Followers Count -User's Friends Count -User Verified The included features at this stage are 12 features and are all shown in Fig. 5 with their values for the sample conversation presented.

Part B. Prediction
Both classifiers showed a value of 0.66 for both of the accuracy and F-measure as shown in Table 6. The performance has not improved and gave worse results than experiment 3.3. This shows that the source user features are not effective in predicting the change in polarity in a conversation with customer support.
The prediction experiments showed that the most effective feature groups predicting the polarity change of a customer are related to the content of the conversation and to the behavior of both of the user and the retail service provider. This can be used to give some insight to the factors that can affect the customer's satisfaction after reaching out to customer support.

Conclusion
This research aims to investigate the impact of social conversations on the change of polarity of the customers' replies with retail service providers. It has been shown that reduction of negative sentiments could be obtained after appropriate interactions with retail service providers. The proposed framework applied sentiment analysis to label the conversations extracted from AmazonHelp public account on Twitter and then different machine learning techniques are trained on the labeled data in order to classify the change of polarity of the customer's conversations. Furthermore, prediction of the change of the conversation polarity based on its start is also presented. Analysis of the