1 Introduction

Traveler satisfaction, personalized service and a better hotel experience are among the top priorities for hotel managers, especially since good hotel service quality enhances travelers’ loyalty and influences their repurchase intentions [1, 2]. Most international travelers place a high value on the quality of their accommodations [3]. The choice of hotel also serves as an example of a difficult decision-making process [4]. Thus, strategic planning and decision-making are critical components of business success, and having a thorough understanding of travelers' preferences can assist hotel managers in these endeavors [5]. According to previous research, personal attitudes toward service are affected by customer satisfaction. In turn, it has been revealed that there is a link between attitudes and travel choices, with factors such as hotels’ performance criteria playing an important role [6].

In the age of social media, consumers of products and services prefer to read feedback from other consumers before making a purchasing decision [7]. Customer feedback is vital for business growth, customer experience improvement, and new product and service development. In the tourism and hospitality context, it is critical for hotel managers to satisfy travelers and translate this satisfaction into behavioral commitment to remain competitive [8]. It is possible to assess and respond to customer satisfaction levels as well as behavioral intentions using a variety of methods. Customer feedback is commonly used by managers to identify future managerial objectives and to monitor a company's performance through customer satisfaction scores such as average customer satisfaction scores. Online customer reviews and ratings are the most accessible and prevalent form of electronic word of mouth. The accessibility of travel websites facilitates customer feedback, and travelers are asked to provide both quantitative and qualitative reviews of hotels, destinations, and other travel experiences. According to previous research, over 30% of internet users rate products online [9]. According to previous research, 82% of US adults read online reviews and ratings before buying a new product or service [10].

Due to the exponential growth in the volume of data transmitted on social networking sites and the recognition of data as an economic asset, efforts are being made to actively utilize big social data in a variety of businesses (e.g., tourism and hospitality) and contexts (e.g., customer satisfaction and preference learning). Although many studies have been conducted on the customer satisfaction and preference learning, only a few researchers have investigated satisfaction from the viewpoint of travelers through big social data [7, 11, 12]. Unlike traditional questionnaires, user-generated content is considered free of sample bias, vast, and objective. Travelers can share knowledge, opinions, and information about services, products, and brands online. The online travel content created or posted by users is increasingly read and shared by travelers. These data are also easier for businesses to monitor and manage because they are easily accessible and low cost. Thus, more tourism researchers in recent years have relied on user-generated content to reveal customers’ perceptions of service quality and their level of satisfaction [13]. These studies can be divided into two types based on user-generated content: studies based on textual reviews and studies based on numerical data. Text mining is mainly used to extract useful information (e.g., satisfaction dimensions) from textual reviews [9, 14,15,16]. However, studies based on numerical data mainly rely on the use of clustering and prediction learning approaches [17].

According to previous studies, travelers' preferences and satisfaction can be measured using user-generated content [18,19,20]. Additionally, massive user-generated content can be easily collected online, covering various travelers, regions, and hotel types. Thus, user-generated content is a promising data source for exploring the impacts of performance criteria (e.g., location and rooms) on travelers’ satisfaction. In regard to explaining customer behavior and satisfaction through big social data, machine learning techniques have proven to be effective [21]. Similar to previous studies, this research aims to employ machine learning techniques to analyze online customer reviews of hotel services in Mecca, Saudi Arabia. Accordingly, a new method using clustering and supervised learning with the aid of the text mining approach is developed. The method is able to analyze both qualitative (textual reviews) and quantitative (numerical ratings) data. The contributions of this work are as follows:

  1. i.

    This study investigates an important issue in the hospitality context: market segmentation through the use of online customer reviews. Market segmentation has been an effective and powerful marketing approach for decades in many business contexts [22,23,24], especially in tourism and hospitality. The concept of market segmentation has been widely accepted by the tourism industry [25,26,27]. Big social data analysis has been an important topic for customer satisfaction analysis and segmentation [28, 29]. Many studies on customer satisfaction through qualitative and quantitative survey approaches have been conducted. However, revealing customer satisfaction through the use of big social data in the form of eWOM (electronic word of mouth) can be an effective way to better understand customers’ demands [30]; through eWOM analysis, customers’ preferences can be better detected. Therefore, this study is the first in the tourism and hospitality context to examine customer satisfaction through an online review analysis of hotels in Mecca, Saudi Arabia. The results demonstrate that the proposed method can provide valuable results from online reviews for customer satisfaction.

  2. ii.

    Although there have been many studies regarding customer segmentation in the hospitality context, in the case of Mecca hotels on TripAdvisor, there have been no studies on customer satisfaction and preference learning. Previous studies rely on qualitative or quantitative analysis through the use of survey-based questionnaires. In addition, eWOM analysis has rarely been investigated by previous research for customer satisfaction with Mecca hotels on TripAdvisor. Accordingly, a new method for big social data analysis is developed using supervised learning, text mining, and segmentation machine learning approaches to analyze big social data on travelers’ decision-making regarding hotels in Mecca, Saudi Arabia. Specifically, we use support vector regression (SVR), latent Dirichlet allocation (LDA), and k-means approaches to develop the hybrid method. To better predict customer satisfaction, SVR is optimized by sequential minimal optimization (SMO). The data are collected from travelers’ online reviews of Mecca hotels on TripAdvisor. The data are segmented, and the travelers’ satisfaction is revealed for each segment based on their online reviews of hotels.

  3. iii.

    Evaluation of the market segmentation method is an important task. The method proposed in this study is evaluated through customers’ reviews of Mecca hotels on TripAdvisor. We perform several experiments on and evaluations of real-world data and compare the results with those of other prediction techniques. This comparison is performed to show the effectiveness of the clustering approach for customer segmentation. Specifically, we compare the results of evaluation methods with a multilayer perceptron artificial neural networks (MLP-ANNs), multiple linear regression (MLR), and decision trees. In this comparison, the MLP is optimized with the whale optimization algorithm (WOA).

The remainder of this work is structured as follows. In Sect. 2, we present previous works on online review analysis, customer segmentation and tourism in Saudi Arabia. In Sect. 3, the proposed hybrid method of analysis is presented. In Sect. 4, data collection and analysis are presented. In Sect. 5, discussions of the results and research implications are presented. Finally, this paper is concluded in Sect. 6. The abbreviations used in this research are provided in Table 1.

Table 1 Abbreviations used in this research

2 Literature review

2.1 Related work on online review analysis

There have been many studies on the analysis of online reviews in different contexts. In this section, we provide a discussion of the results of previous research (see Table 2).

Table 2 Previous studies on online review analysis

Chang et al. [3] focused on social media analytics and provided a new framework for extracting and visualizing Hilton hotel reviews and ratings. They collected data from TripAdvisor. They used the support vector machine (SVM) technique for sentiment classification. Ahani et al. [17] focused on market segmentation through the use of spa hotel reviews and ratings. They developed a new method using higher order value decomposition, a self-organizing map (SOM), and decision trees. They used spa hotel reviews and ratings on TripAdvisor for data collection. They revealed that there is a significant relationship between hotels' performance criteria and travelers’ satisfaction. Khorsand et al. [31] focused on the analysis of 8 machine learning models (naïve Bayes, a decision tree, logistic regression, K-nearest neighbors (KNN), a support vector machine, a random forest (RF), a NN, and gradient boosting) to predict rates of new users. They collected data on Tehran hotels from TripAdvisor. Nilashi et al.[32] focused on travelers' decision-making using preference learning for spa hotels in Malaysia. They used TripAdvisor for data collection. They developed a new method using clustering, dimensionality reduction, and decision trees. They performed the traveler segmentation using the k-means technique. Nilashi et al. [33] focused on travelers’ decision-making using travelers’ reviews on SNSs (social network sites). They collected data on hotels from TripAdvisor. They used decision trees and fuzzy logic approaches for method development. Arenas-Márquez et al. [34] focused on online review analysis to identify travel group-type topics. They collected data from TripAdvisor. Their method was developed through convolutional neural encoding. Fernandes et al. [35] focused on measuring restaurant performance by combining historical sales data with online reviews. They developed a new data-driven approach using variable correlation analysis as well as text mining and sentiment analysis. Taecharungroj et al. [36] focused on traveler review analysis for tourist attractions in Phuket, Thailand. They collected data from TripAdvisor. They developed a new method using LDA and naïve Bayes modeling. Gebbels, et al.[37] focused on evaluating customer restaurant reviews of Clink restaurants. They collected data from TripAdvisor and used a qualitative case-study methodology for data analysis. Bigne et al. [38] focused on evaluating the helpfulness of online reviews. They collected data from TripAdvisor. They adopted deep learning for sentiment analysis. Song et al. [39] used online reviews to investigate the sense of place of the Las Vegas Strip. They collected data from TripAdvisor. They developed a machine learning approach using LDA for topic modeling. Borges-Tiago et al. [40] focused on the differences between Booking.com and TripAdvisor in branding co-creation. The authors collected data from both Booking.com and TripAdvisor. They used qualitative and quantitative approaches for data analysis. Ahani et al. [7] conducted a study on customer satisfaction through the use of machine learning approaches, clustering and multi-criteria decision-making (MCDM). They collected data on Canary Island hotels from TripAdvisor. Zhang et al. [41] developed an online review-driven method to prioritize improvements in hotel services. They performed topic modeling using LDA on TripAdvisor data. Korfiatis et al. [42] focused on measuring service quality through the use of unstructured data. The authors developed a topic modeling application by using STMs on online reviews by airline passengers on TripAdvisor.

2.2 Methods of customer segmentation

To achieve business objectives, a company can use customer segmentation to target marketing efforts and resources toward valuable and loyal customers [43]. Customer segmentation is the process of grouping all customers of a context into diverse and distinct groups based on shared characteristics such as patterns, interests, location, or demographics. Customer segments can be created using psychological behavioral, demographic, and geographic data. There have been many studies on customer segmentation. To recognize profitable customers for more targeted communications, [44] looked at a few demographic variables (e.g., income, experience, education, gender, age) for customer segmentation. Rahim et al. [45] applied data modeling techniques and the RFM (recency, frequency, and monetary) model to detect customers’ behavioral patterns. Their scheme was for customer classification through repurchase behavior recognition. Huseynov et al. [46] used several behavioral characteristics for customer segmentation. They used price payment, coupon redemption, product diversity, the shopping rate, and the refund rate to find effective segments from the available data. They found five main segments through the use of these behavioral characteristics for businesses to develop marketing strategies. Wang et al. [47] focused on psychological factors (e.g., attitudes, subjective norms, and personal norms) to identify customer segments through purchase intention. Yadegaridehkordi et al. [48] developed a new method of traveler segmentation through online customer reviews. They used MCDM and a clustering approach to segment travelers for eco-friendly hotels. Ahani et al. [17] focused on traveler segmentation for spa hotels. They performed traveler segmentation using machine learning approaches. Alkhayrat et al. [49] focused on telecom customer segmentation. The authors used deep learning and noise removal techniques to develop the segmentation approach. They used k-means for data clustering and principal component analysis (PCA) to reduce high-dimensional data. Sivaguru et al. [50] focused on dynamic customer segmentation through the use of modified dynamic fuzzy c-means clustering. Sun et al. [51] developed a heuristic clustering method for customer segmentation. They used Gaussian peak heuristic-based clustering (GPHC) and a standardized Gaussian distribution to perform numerical experiments. Wu et al. [52] focused on customer segmentation based on purchase behaviors and developed a method using an RFM model and the k-means algorithm. Wu et al. [53] developed a customer segmentation model using dynamic interval type-2 fuzzy logic and investigated its application in e-commerce. Akar et al. [54] focused on customer purchase intentions and segmentation during the COVID-19 outbreak. Vohra et al. [55] used k-means and SOM clustering based on the RFM model for customer segmentation.

2.3 Tourism in Saudi Arabia

Tourism is the most important part of any economy's service sector, accounting for the majority of revenue. Despite Saudi Arabia’s oil wealth, international demand for petrochemical products is declining year by year, affecting the Saudi economy [56]. As a result, the economy must diversify away from the oil sector to remain competitive. Saudi Arabia's future growth depends heavily on the development of its tourism industry [56]. It is one of the key pillars of Vision 2030's plan to diversify the economy and reduce reliance on oil. Saudi Arabia's tourism revenue reached 5 billion USD in December 2020, compared with 28 billion USD in 2019 (see Fig. 1). Saudi Arabia is well known throughout the world for the holy cities of Mecca and Madinah. Hajjis and visitors from all over the world travel to Mecca and Madinah every year to perform the Hajj and Umrah. The Hajj and Umrah pilgrimages are extremely important to Saudi Arabia's inbound tourism industry. The tourism revenue of the Kingdom of Saudi Arabia is increasing in absolute terms, but its contribution to the country's GDP is decreasing proportionally [56].

Fig. 1
figure 1

Saudi Arabia's tourism revenue from 2004 to 2020

Saudi Arabia's economic development has benefited significantly from religious tourism [57]. During this time, worshipers are charged a fee. As foreigners, they will have to pay for food and housing, and accordingly, they can contribute to the country's foreign reserves and its income, which improves Saudis' quality of life. The government's Vision 2030 plan includes achieving sustainable economic growth [57]. According to studies, the number of people visiting Mecca has increased, and as a result, the city is playing a key role in Saudi Arabia's implementation of its Vision 2030 plan. Housing construction and other related infrastructure are also considered part of economic development. There has been a considerable amount of pressure to construct residential buildings that can house people during worship hours. There has been an increase in the growth and development of the hospitality industry as a result of the increase in the number of people traveling to Mecca and Madinah. As a result of international operations looking for opportunities and increasing the number of pilgrims in line with the Vision 2030, religious tourism is on the rise. According to industry experts, a significant increase in religious tourism will create enormous opportunities in Mecca's hotel and accommodation sectors. Furthermore, the transportation sector has grown to meet the needs of Hajj pilgrims and their belongings. Millions of Muslims from around the world travel to Saudi Arabia and Mecca each year to participate in the Holy Hajj, which takes place during the month of Dhu al-Hijjah. As a result, religious tourism remains a significant source of revenue for Saudi Arabia.

In 2017, approximately 19 million international tourists visited Saudi Arabia. According to the latest available statistics from the United Nations World Tourism Organization (UNWTO), the number of hotels and similar establishments reached approximately 8.3 thousand units in 2019, an increase from 5.5 thousand units in 2008. According to data published by the UNWTO, the total number of rooms in establishments in 2019 was approximately 606.5 thousand rooms, up from 201.6 thousand rooms in 2008. Statistics also show that Mecca's contribution to Saudi Arabia's gross domestic product (GDP) from 2006 to 2016 was significant. According to the Saudi Arabian government, Mecca's direct contribution to Saudi Arabia's GDP in 2026 will be approximately 6.6 billion US dollars.

3 Method development

The aim of this study is to develop a method to segment travelers in hotels in Mecca, Saudi Arabia. Specifically, we aim to provide a hybrid method to reveal travelers’ satisfaction through a set of hotel performance criteria. The method is developed using unsupervised clustering and supervised learning approaches. The proposed method is presented in Fig. 2. The steps of the proposed method are presented as follows:

  • Step 1: This research relied on online customer reviews. Accordingly, it was important to obtain data from a trusted website. The data were collected from TripAdvisor. This platform is widely used for data collection [40, 58, 59]. TripAdvisor provides complete forms for travelers to provide their experience with regard to the quality of services in restaurants and hotels [60]. In Step 1, data preprocessing was performed on the collected data. In this step, we tried to perform a data cleaning process on the textual and numerical ratings. In this stage, records with no criteria ratings were removed from the dataset. In addition, short reviews were not considered in the next stage of data analysis, and accordingly, we removed them from the dataset.

  • Step 2: In this step, we performed data segmentation through the use of unsupervised machine learning. It has been shown that the clustering of data has a significant impact on the accuracy of supervised learning. This research relied on the k-means clustering technique [61,62,63]. The main advantage of k-means is its simplicity [64]. The computational simplicity of the k-means method makes it suitable for processing large datasets [65]. K-means always finds a local optimum for any given initial centroid locations [66]. Segmentation was performed for different k values, and the quality of clusters was evaluated through the use of the silhouette coefficient (SC) or silhouette score [67, 68].

  • Step 3: To discover the main satisfaction factors from the online reviews, in this step, text mining was performed on the collected data. This research relied on the LDA approach [69,70,71] to analyze textual data and perform topic modeling. LDA has been an effective approach for qualitative data analysis [72]. LDA is insensitive to the noise in the data and has powerful computing capabilities for big data [70]. In contrast to the Bayes classifier, which requires the classes to be predefined, LDA generates possible patterns from the data itself and is a data-driven technique. We performed text mining on each segment to reveal the dimensions of traveler satisfaction.

  • Step 4: This stage was performed after data segmentation. In this step, we performed supervised learning to predict travelers’ satisfaction and identify their preferences for each segment. In this research, we relied on the SVR approach [73, 74]. We performed SVR to predict customer satisfaction from travelers’ ratings of the hotels through a set of performance quality factors. In addition, we optimized SVR using the SMO solver algorithm [75, 76]. In this stage, the relationship between the performance quality factors and traveler satisfaction was examined.

  • Step 5: In this step, the prediction models generated by the SVR approach were evaluated. This research relied on several evaluation metrics, such as a sampling approach, the root mean square error (RMSE) [77,78,79,80,81], the mean absolute error (MAE) [77], and the coefficient of determination [82,83,84]. In addition, all models were evaluated by employing the tenfold cross-validation approach [85, 86].

Fig. 2
figure 2

Research method

3.1 K-means clustering technique

The k-means algorithm [87], which minimizes the sum of squared errors (SSE), is the most widely used clustering algorithm [88,89,90]. K-means is an iterative algorithm [91, 92] that can be described in the following steps, as shown in Fig. 3. The silhouette index (SI) is a reliable indicator for measuring clustering performance because it quantifies how similar an instance of a dataset is to instances within its own cluster versus instances in other clusters. Each instance \(j\) has a silhouette index \({SI}_{j}\). The silhouette index is calculated in a range of [− 1, + 1]. \({SI}_{j}\) values close to − 1 denote that an instance was probably assigned to the wrong cluster. \({SI}_{j}\) values close to 0 denote that the instance does not clearly belong to one cluster or another. A value closer to + 1 denotes that an instance was correctly assigned to its current cluster. \({SI}_{j}\) is estimated in the following manner:

$${SI}_{j}=\frac{b\left(j\right)-a(j)}{\mathrm{max}\{b\left(j\right),a(j)\}}$$
(1)

where \(b\left(j\right)\) indicates the average distance from the \(j\)-th instance to all others assigned to the nearest neighboring segment and \(a\left(j\right)\) represents the average distance from the \(j\)-th instance to all others in its segment. When calculating the distance between objects, Euclidean or Manhattan distances are generally utilized, and a silhouette graph can be constructed to demonstrate the performance of a clustering technique. Averaging \(SI\) s over all \(n\) clustered observations can be performed to determine the global quality of a clustering technique. SIs are technique independent and can be used to find the optimal \(k\), i.e., the number of clusters. The procedure for calculating the silhouette coefficient can be found at the following URL: https://www.mathworks.com/help/stats/silhouette.html.

Fig. 3
figure 3

K-means procedure

3.2 Topic Modeling Using LDA

In this research, our aim from textual reviews is to examine the relationships between terms and extract the topics from their structures by using text mining with LDA topic modeling [70, 93,94,95,96]. In LDA, which is unsupervised and probabilistic in nature, it is assumed that all documents in a corpus are formed by a specific and manually definable number of topics. In LDA, each document is equally important and includes a “bag of words”. The words in each document are assumed to be unordered. A probability mass function of words is also used to describe a topic. To select topics, each document uses a probability mass function. The graphical representation of LDA is presented in Fig. 4. As illustrated, the schema contains three distinct levels that correspond to the corpus, the documents, and the terms.

Fig. 4
figure 4

The graphical representation of LDA

In latent Dirichlet allocation, documents are seen as a distribution over the latent topics, which is described by Dirichlet distribution Dir (\({\alpha }_{1},{\alpha }_{2},\dots {\alpha }_{k}\)). A topic is selected based on the document-topics distribution, i.e., Multinomial(\(\theta \)), where \(\theta \) indicates the probability that a given document \(d\) belongs to a certain topic \(k\). A Dirichlet distribution based on \(\beta \) encodes that a topic has a limited number of words. Given \(\alpha \) and \(\beta \), the topic mixture's joint multivariate distribution, the set of \(N\) topics \(z\), and the set of \(N\) terms \(w\) are given by the following [70]:

$$p\left(\theta ,z,w|\alpha ,\beta \right)=p\left(\theta |\alpha \right)\prod_{n=1}^{{N}_{d}}p({z}_{n}\left|\theta \right)p\left({w}_{n}|{z}_{n},\beta \right)$$
(2)

By integrating over \(\theta \), summing over \(z\) and then taking the product of the marginal probabilities of the individual documents, the probability of the entire corpus can be computed as follows:

$$ p(D|\alpha ,\beta ) = \mathop \prod \limits_{n = 1}^{{N_{d} }} \int p(\theta_{d} |\alpha )\left( {\mathop \prod \limits_{n = 1}^{{N_{d} }} \mathop \sum \limits_{{z_{dn} }} p\left( {z_{n} |\theta_{d} } \right)p(w_{dn} |z_{dn} ,\beta )} \right)d\theta_{d} $$
(3)

3.3 Satisfaction prediction using the SVR model

Considering that \(l\) is the number of training samples in a set of training samples \({\left\{\left({x}_{i}, {y}_{i}\right)\right\}}_{i=1}^{l}\) with \({x}_{i}\) as the input sample values and \({\mathrm{y}}_{\mathrm{i}}\) are the corresponding model output values, a regression problem is provided to estimate output \({\text{y}}\) based on the input values (\(x)\):

$$ y = f\left( x \right) + \nu $$
(4)

where \(x\) denotes the input variable, \(\upnu \) denotes noise independent of \(x\), and \(f\left(\cdot \right)\) denotes an unknown nonlinear function. By introducing an alternative loss function (e.g., \(\upvarepsilon \)‐insensitive function, Huber function, Laplace function, quadratic function), support vector machines can be applied to regression problems. It has been shown that the \(\upvarepsilon \)‐insensitive function can be effective in producing sparseness in the support vectors; accordingly, the \(\upvarepsilon \)‐insensitive function is used in this paper for support vector regression. The unknown function \(f\left(x\right)\) in support vector regression is assumed to be expressed as follows:

$$ f_{SVR} \left( x \right) = W^{T} \emptyset \left( x \right) + b $$
(5)

where \(\varphi \left( \cdot \right)\) denotes a prespecified nonlinear mapping\(, {\text{b}}\)denotes a bias, and \({\text{w}}\) denotes a coefficient vector. Accordingly, the bias and the coefficient vector values are identified so that the following functional is minimized:

$$ C\sum\limits_{{i = 1}}^{l} | y_{i} - f_{{{\text{SVR}}}} \left( {x_{i} } \right)|_{\varepsilon } + \left\| {\text{w}} \right\|^{{\text{2}}} $$
(6)

where \( | \cdot |_{\varepsilon }\) denotes an \({\upvarepsilon }\) ‐insensitive loss function, \(l\) is the number of training samples, and \(C\) is a positive constant. The loss function is equal to zero if the difference between the observed outputs and regression is less than \({\upvarepsilon }\). The \({\upvarepsilon }\) ‐insensitive function is calculated as follows:

$$|x{|}_{\varepsilon }=\left\{\begin{array}{ll}0 &\quad if\;|x|<\varepsilon \\ |x|-\varepsilon&\quad \mathrm{otherwise}\end{array}\right.$$
(7)

The following convex QP problem of the dual form of the above problem with \(2l\) variables is defined to find Lagrange multipliers \(\Psi =[{\Psi }_{1},\) \({\Psi }_{2},{\Psi }_{l}{]}^{T}\) and \({\Psi }^{^{\prime}}=\) \([\Psi \) í, \({\Psi }_{2}^{^{\prime}},\) \({\Psi }_{l}^{^{\prime}}{]}^{T}\), which minimize the following:

$$Q\left(\Psi , {\Psi }^{^{\prime}}\right)=-\sum_{i=1}^{l}{y}_{i}\left({\Psi }_{i}-{\Psi }_{i}^{^{\prime}}\right)+\varepsilon \sum_{i=1}^{l}\left({\Psi }_{i}+{v}_{i}^{^{\prime}}\right) +\frac{1}{2}\sum_{i=1}^{l}\sum_{j=1}^{l}\left({\Psi }_{i}-{\Psi }_{i}^{^{\prime}}\right)\left({\Psi }_{j}-{\Psi }_{j}^{^{\prime}}\right)K\left({x}_{i}, {x}_{j}\right)$$
(8)

subject to the following constraints:

$$\sum_{i=1}^{l}\left({\Psi }_{i}-{\Psi }_{i}^{^{\prime}}\right)=0$$
$$0\le {\Psi }_{i}\le C, i=\mathrm{1,2}, \dots ,l$$
$$0\le {\Psi }_{i}^{^{\prime}}\le C, i=\mathrm{1,2}, \dots ,l$$
(9)

where \(l\) is the number of training samples,\(\upvarepsilon \) denotes a user‐specified positive constant \(,\mathrm{ K}\left(\cdot , \cdot \right)\) is called a kernel function, and\(K\left({x}_{i}, {x}_{j}\right)=\varphi {\left({x}_{i}\right)}^{T}\varphi \left({x}_{j}\right)\).

Let \(\left({\Psi }^{*}, {\Psi }^{\mathrm{^{\prime}}*}\right)\) be considered the optimal solution to the above problem. Accordingly, the \({\mathrm{f}}_{\mathrm{SVR}}\left(\mathrm{x}\right)\), i.e., the approximating function, derived by support vector regression can be computed as follows:

$${\mathrm{f}}_{\mathrm{SVR}}\left(\mathrm{x}\right)=\sum_{\mathrm{i}=1}^{l}\left({\Psi }_{\mathrm{i}}^{*}-{\Psi }_{{\mathrm{i}}^{*}}^{\mathrm{^{\prime}}}\right)\mathrm{K}\left({\mathrm{x}}_{\mathrm{i}},\mathrm{ x}\right)+{\mathrm{b}}^{*}$$
(10)

where \(l\) is the number of training samples. The Lagrange multipliers satisfy the Karush–Kuhn–Tucker conditions:

$${\Psi }_{\mathrm{i}}^{*}\cdot {\Psi }_{{\mathrm{i}}^{*}}^{\mathrm{^{\prime}}}=0,\mathrm{ i}=\mathrm{1,2}, \dots ,l$$
(11)

By considering \({\Upsilon }_{i}={\Psi }_{i}-{\Psi }_{i}^{\mathrm{^{\prime}}}\) for all and \(\Upsilon ={\left[{\Upsilon }_{1}, {\Upsilon }_{2}, {\Upsilon }_{\mathrm{l}}\right]}^{\mathrm{T}}\), the aim is to minimize the following problem:

$$W\left(\Upsilon \right)=-\sum_{i=1}^{l}{y}_{i}{\Upsilon }_{i}+\varepsilon \sum_{i=1}^{l}|{\Upsilon }_{i}|+\frac{1}{2}\sum_{i=1}^{l}\sum_{j=1}^{l}{\Upsilon }_{i}{\Upsilon }_{j}{k}_{ij}$$
(12)

subject to the following constraints:

$$\sum_{i=1}^{l}{\Upsilon}_{i}=0; -C\le {\Upsilon}_{i}\le C, i=\mathrm{1,2}, l$$
(13)

where \(l\) is the number of training samples. By considering \({\Upsilon }^{*}\) \(={\left[{\Upsilon }_{1}^{*}, {\Upsilon }_{2}^{*}, {\Upsilon }_{\mathrm{l}}^{*}\right]}^{\mathrm{T}}\) to be an optimal solution, \({f}_{SVR}\left(x\right)\) is calculated as follows:

$${f}_{SVR}\left(x\right)=\sum_{i=1}^{l}{\Upsilon}_{i}^{*}K\left({x}_{i}, x\right)+{b}^{*}$$
(14)

3.4 Evaluating the accuracy of the proposed models

The MAE, RMSE, and R2 were used to describe the predictive abilities of the proposed models. The following are the definitions of these evaluation metrics:

$$\mathbf{M}\mathbf{A}\mathbf{E}=\frac{1}{n}\sum_{i=1}^{n}|{actual}_{i}-{forecast}_{i}|$$
(15)
$$\mathbf{R}\mathbf{M}\mathbf{S}\mathbf{E}=\sqrt{\frac{\sum_{i=1}^{n}({actual}_{i}-{forecast}_{i})}{n}}$$
(16)
$${\mathbf{R}}^{2}=\frac{{\left[\sum_{i=1}^{n}\left({actual}_{i}-{\overline{actual} }_{i}\right).({forcast}_{i}-{\overline{forecast} }_{i})\right]}^{2}}{\sum_{i=1}^{n}({actual}_{i}-{\overline{actual} }_{i})\sum_{i=1}^{n}({forcast}_{i}-{\overline{forecast} }_{i})}$$
(17)

where \(\mathrm{actual}\) denotes the real overall rating, \(\mathrm{forcast}\) denotes the predicted overall rating, \(\overline{\mathrm{actual} }\) denotes the mean of the observed values, \(\overline{\mathrm{forecast} }\) denotes the mean of the predicted values, and \(\mathrm{n}\) is the number of samples.

4 Method evaluation

In this section, we provide the data collection, data analysis and method evaluation. The results are presented and discussed.

4.1 Data collection

This research collected data from TripAdvisor. TripAdvisor is a trusted source of information for travelers and has been providing quality reviews since 1995 [97]. All reviews are passed through a technology-based moderation process and are further screened by the content moderation team. Hence, it was found that TripAdvisor might be a trusted source to investigate customer satisfaction with hotels in Mecca, Saudi Arabia. The hotels in Mecca, Saudi Arabia, that were registered on TripAdvisor were randomly selected. In addition, we selected 4- and 5-star Mecca hotels for data collection. The online reviews were in the form of numerical and textual reviews with other information on the hotels, such as hotel stars, the travel type, the date of the stay, and the date of the review. To better present the impacts of performance quality factors on customer satisfaction, the reviews that only provided overall ratings without criteria ratings were removed from the dataset. In Fig. 5, we provide an example of travelers’ reviews of Mecca hotels on TripAdvisor. Two types of reviews are provided by travelers. They are in the form of textual reviews and numerical ratings for criteria and overall ratings in the range of +1 to +5. When crawling the data from TripAdvisor, it was found that many travelers provided ratings solely for the overall ratings; thus, we removed these records from the datasets, as they could impact the accuracy of the overall rating prediction through a set of criteria ratings. Doing so could improve the quality of the collected data to achieve the objectives of this research. In total, 1724 records were collected from 30 hotels in Mecca, Saudi Arabia. These records of ratings and reviews were further analyzed by the proposed method. A sample of ratings by travelers is provided in Table 3. As shown in Table 3, location, cleanliness, service, and value were the main criteria for overall rating assessment. In addition, traveled solo, traveled with friends, traveled with family, traveled on business and traveled as a couple were the major trip types of travelers in hotels in Mecca, Saudi Arabia. The trip types and the number of travelers are presented in Table 4. In Table 5, information on hotel ratings based on trip types and performance criteria is presented.

Fig. 5
figure 5

Travelers’ reviews of Mecca hotels on TripAdvisor

Table 3 A sample of ratings by travelers
Table 4 Trip types and the number of travelers
Table 5 Information on hotel ratings based on trip types and performance criteria

4.2 Data segmentation and satisfaction analysis

The 1724 records of travelers’ ratings were used in the k-means algorithm for data segmentation. Different k values were considered for data segmentation. The optimal number of clusters was selected by the silhouette coefficient approach. The results are shown in Table 6 for different k values. It is found that the k-means algorithm provides the best segments for k = 6. In fact, when k = 6, the silhouette coefficient is higher than the other number of clusters. The value of the silhouette coefficient in six segments is approximately 0.8969. It is shown that there is a direct relationship between the quality of clustering and the accuracy of prediction learning techniques [98]. Hence, 6 segments were selected for further evaluation by SVR.

Table 6 Silhouette coefficient value segments obtained by k-means

The cluster centroids based on the performance criteria are presented in Table 7. High ratings were provided for location, cleanliness, and value in Segment 1. In Segment 2, the ratings were high for all criteria. In Segment 3, cleanliness, service, and value received high ratings from travelers. In Segment 4, location, service and value received high ratings. In Segment 5 and Segment 6, location and value obtained a high level of ratings by travelers in these groups. Overall, the results of Table 5 show that the majority of travelers are in Segment 2, Segment 3, and Segment 4. In Table 6, we present information on the overall ratings and trip types for the six segments. It is clearly found that in Segments 1–4, travelers' overall ratings were mainly in the range of 3 to 5; however, in Segment 5, the overall ratings were more in the range of 3 to 4. In addition, in Segment 6, the overall ratings show that travelers provided ratings mainly in the range of 1 to 3. Furthermore, the results in Table 8 show that the majority of travelers in all segments were in the traveled as a couple group. The segments and the criteria for each segment are visualized in Fig. 6.

Table 7 Cluster centroids and size
Table 8 Overall ratings and trip types for the six segments
Fig. 6
figure 6

Segment visualization

We also performed textual data analysis on each segment using LDA. The online traveler reviews in the form of textual data were analyzed to discover the main dimensions of satisfaction for each segment. The results are shown in Fig. 7 for the top 15 keywords. The results of textual data analysis show that travelers are satisfied with the quality of services in hotels in Mecca, Saudi Arabia, for 5 segments. Segment 6, in which the satisfaction level is low, is the exception. The results show that travelers are mainly dissatisfied with the price of hotels, the slow service in hotel preparation, room size, wifi and internet access, and their hotel’s distance from shopping centers.

Fig. 7
figure 7figure 7

Top 15 satisfaction dimensions for the six segments

4.3 Satisfaction prediction using SVR

After traveler segmentation through the use of online reviews, it was important to predict customer satisfaction using a set of performance criteria for each segment. To do so, we implemented SVR on each segment and predicted customer satisfaction based on the service quality on four performance criteria: location, cleanliness, service, and value. To better perform the prediction, we optimized the SVR algorithm using the SMO solver algorithm [99]. In fact, we adopted the SMO solver algorithm to train the SVR models. In addition, we adopted two different kernel functions, RBF and linear functions [100, 101], to find the best SVR models based on the RMSE, MAE and R2 metrics. We used tenfold cross-validation to train the SVR models. The parameter specification of the models of the support vector regression is shown in Table 9.

Table 9 SVR parameters for epsilon-SVR

Based on the above SVR parameters, we trained the models on a training set and then provided the results for the test set in each segment for the RMSE, MAE and R2. The results are shown in Fig. 8 for the RMSE and MAE for the RBF and linear kernel functions. Based on the MAE and RMSE results, SVR effectively constructed the prediction models for the six segments based on low RMSE and MAE values. The results in Fig. 8 demonstrate that the RBF kernel function provided better RMSE and MAE values for the six segments in relation to the linear kernel.

Fig. 8
figure 8

The MAE and RMSE in 150 trials using the training set

The R2 results for the six segments are presented in Fig. 9. In addition, the average evaluation results of the method based on the test sets for the six segments are presented in Table 10. In this table, we also compare our results with those of other prediction learning techniques. Specifically, we compare our results with the multilayer perceptron artificial neural network (MLP-ANN) [102], multiple linear regression (MLR), and decision trees [103]. In this comparison, MLP is optimized with the whale optimization algorithm (WOA) [104, 105].

Fig. 9
figure 9

R2 results for the six segments

Table 10 Method evaluation results

The R2 results for the six segments show that the method that used clustering and SVR with SMO optimization provided accurate prediction results for the overall rating. In addition, on average, k-means-SVR-SMO (RMSE = 0.1042; MAE = 0.2823; R2 = 0.9526) was more accurate in the prediction of overall ratings than MLP-ANN-WOA (RMSE = 0.6741; MAE = 0.8213; R2 = 0.8278), decision trees (RMSE = 0.8321; MAE = 1.1024; R2 = 0.7925), and MLR (RMSE = 1.2132; MAE = 1.3751; R2 = 0.7324). In this comparison, MLR, decision trees and MLP-ANN-WOA were trained on the whole dataset with tenfold cross-validation, and MLP-ANN-WOA was trained for 150 epochs.

We also aimed to compare the computation time of the proposed method with that of other prediction learning techniques. Accordingly, at each time point, 5% of the data were considered to test the prediction models, and the computation time was calculated. The results are presented in Fig. 10. The results show that the computation time of k-means-SVR-SMO was less than that of MLP-ANN-WOA, decision trees and MLR. This finding indicates the effectiveness of integrating the clustering approach with SVR in reducing the prediction computation time. Note that all implementations were performed in MATLAB software on an Intel(R) Core(TM) i7-6700HQ CPU clocked at 2.60 GHz and with 8 GB RAM.

Fig. 10
figure 10

Computation time of different methods

5 Discussion and research implications

Tourism has long been acknowledged as one of the world's most important socioeconomic factors [106, 107]. Research and practice have been more prevalent in recent decades as a result of the expansion of online review sites. In the e-commerce and big data era, online customer reviews of hotels have enormous business value. Online textual reviews have an open structure in which customers can better present their preferences, feedback, and opinions on the quality of services. In fact, online textual reviews are a direct measure of a customer's view of a product or service, and in this way, they can more fully reflect the consumption experience and the customer’s perception. In machine learning, they can be useful for revealing customer satisfaction. This has led to the growth of studies on online reviews and the application of new analytical methods. Online review analysis has been effective for customer preference learning. Traveler segmentation has recently been an important topic in the tourism and hospitality context [108,109,110]. Preference learning has a significant impact on effective customer segmentation. Machine learning has played an important role in business, customer preference learning, and segmentation [111, 112]. Data availability, the speed and simplicity of data collection, and nonintrusiveness with regard to human participants are only a few of the benefits of performing customer segmentation through the use of online reviews. Accordingly, there have been several attempts to develop new methods in this context [36, 113, 114]. This study also tried to develop a new method for traveler segmentation with online customer reviews.

Our major contribution to this research was the development of a new framework for collecting and processing big social data, extracting the dimensions of customer satisfaction, predicting customer satisfaction, and identifying customer preferences for several segments. The method was effectively evaluated based on online traveler reviews. We collected 1724 online reviews of 30 hotels in Mecca, Saudi Arabia. To collect such a large amount of data for customer satisfaction analysis, the use of a questionnaire survey could be difficult. The proposed method could be suitable for large dataset analysis, as it includes a clustering approach before predicting customer satisfaction. In fact, the segmentation stage could play an important role in the effectiveness of the proposed method. This can be verified by comparing the results of our method with those of methods that are not based on clustering approaches. In addition, when large datasets of online reviews are available, clustering may be beneficial for decreasing the computation time of predicting customer preferences and satisfaction. Therefore, in future studies, it could be interesting to improve the segmentation part of the proposed method for large datasets with additional performance criteria.

The results of the customer satisfaction analysis through the use of the segmentation approach were interesting. We found six main segments from the collected data. Each segment provided valuable information on travelers’ preferences. It was revealed that location, cleanliness, and value were more important for Segment 1. In Segment 2, the rating analysis showed that all criteria were important to travelers. In Segment 3, cleanliness, service, and value were the most important performance criteria for Mecca hotels. In Segment 4, location, service and value were the most important performance criteria; in Segment 5 and Segment 6, they were location and value. From textual review analysis, it was found that the most important dimensions of satisfaction were hotel, room, stay, service, makkah, view, mecca, breakfast, great, nice, best, staff, amazing, thanks and food for Segment 1. In Segment 2, clean, haram, experience, excellent, helpful, delicious, kaaba, like, restaurant, rooms, really, services, especially, positive, and easy were the most important dimensions of satisfaction. In Segment 3, the most important dimensions of satisfaction were experience, check, enjoyed, fast, friendly, location, reception, recommend, well, access, come, back, kind, manager and special. In Segment 4, they were memorable, customer, brilliant, everything, floor, help, kabaa, palace, convenient, guests, people, receptionist, suite, buffet and coffee. In Segment 5, the most important dimensions of satisfaction were definitely, saudi, fantastic, crowd, happy, holy, love, mosque, distance, outstanding, comfortable, perfect, place, quite and quality. In Segment 6, they were expensive, far, shop, small, slow, issue, wifi, noise, view, complain, dissatisfied, tried, waiting, care and money. Overall, it was found that the results of numerical ratings are in line with the textual review analysis. The results of textual review analysis showed that of the six segments discovered by k-means, only the last segment includes travelers with low satisfaction levels, which was also identified from the numerical rating analysis.

Overall, the analysis of online reviews through the implementation of new machine learning approaches is an important task. It is shown that the results of online reviews can effectively help hotel managers better detect traveler demands. In addition, hotel managers can be aware of their level of quality service and limitations in regard to different aspects of their hotels’ performance criteria. Online reviews present a significant eWOM effect, which affects future customers' booking selections. The outcomes of this study can serve as a motivator for hotel managers to extract additional attributes from customer textual reviews and to conduct in-depth investigations into customers’ decision-making behavior through their online reviews and their relationship with satisfaction and purchase intention. As online reviews include both negative and positive feedback on the quality of services, responding quickly and efficiently to negative online customer evaluations is an effective way to retain consumers and conduct service recovery steps. In addition, hotel managers' involvement in online reviews can significantly impact customers’ perception of their hotels' services and increase customers’ trust to obtain future purchases.

6 Conclusion

This research developed a new method of customer segmentation using support vector regression with sequential minimal optimization, latent Dirichlet allocation (LDA) and k-means approaches. The data on travelers’ online reviews of Mecca hotels were collected from TripAdvisor. The data were segmented using k-means, and traveler satisfaction was revealed for each segment. The LDA technique was used for textual data analysis and to discover the dimensions of satisfaction for each segment. Finally, SVR-SMO was applied in each segment to predict customer satisfaction (overall ratings) based on several performance quality factors. The effectiveness of the proposed method was measured using the RMSE, MAE and R2. The results of the evaluations were then compared with those based on the ANN, MLR and decision tree approaches. The results revealed that the predictive accuracy of the method that used k-means and SVR-SMO was higher than that of the other methods. This research has some limitations that need to be addressed in future studies. This study used TripAdvisor for data collection. Although the results were provided based on online reviews of hotels in Mecca, Saudi Arabia, the results could be further verified based on additional data collected by a questionnaire survey. In the case of the proposed method, the R2 results show that the predictive accuracy of the method could still be improved. As standard SVR was used for satisfaction prediction, the use of online SVR could be interesting research for enhancing the predictive accuracy and customer satisfaction analysis. In addition, the use of other clustering approaches for traveler segmentation is suggested. Furthermore, this research did not investigate customer satisfaction with each travel type through hotel performance criteria. Accordingly, it could be interesting to find the relationships between hotel performance criteria and customer satisfaction for each travel group. Doing so may reveal customers’ preferences for different trip types based on their experience with hotel services. Moreover, the silhouette coefficient was used in this study to find the optimal number of clusters based on the collected data. Accordingly, other approaches to evaluating clustering quality could be investigated.