Introduction

Forecasts of future sales trends or global demand are essential for any company. In the short term, in fact, the company can organize the resources and the strategic business functions, while in the long term, it is possible to establish the investment programs.

The forecast that allows management to make strategic decisions is configured as a hypothetical construction aimed at reproducing, approximately, a model of behavior concerning one or more phenomena. On the other hand, it is necessary to consider that any model, identified for the study of socio-economic phenomena, cannot include all the variables and/or relationships necessary for a complete interpretation of the results.

Note that random events and the inadequate qualitative and quantitative data used for the forecast could affect the validity of the results obtained.

Moreover, it seems reasonable to assume that where more historical information is available, uncertainty about future events should be reduced. As a consequence, if a company is interested in forecasting sales, it is necessary to estimate the future evolution starting from the current condition and from its past evolution (Tandon et al., 2018).

Furthermore, by identifying future sales, it is possible to reduce unnecessary investments in the production of non-salable products and/or avoid the risk of stock depletion, with consequent loss of profit and competitiveness on the market. Through the forecast, it is possible to identify existing relationships between marketing policies and sales increases, thus allowing to correct and/or improve sales strategies (Wu et al. 2017).

In addition to marketing policies, in contemporary society, special emphasis should be placed on social media, through which the way public opinion is expressed in society has changed. Indeed, people are increasingly turning to the Internet for information and research purposes, and this behavior could result in a snapshot collective consciousness, reflecting interests, concerns, and intentions of the population.

Therefore, what people are looking for today is predictive of what they will do in the near future. However, the predictive power should be judged in relation to statistical models fit with traditional data sources or prediction markets.

Note that if the demand is stable, the forecast can be performed using reliable statistical methods. On the other hand, if the question is random, it is unlikely to accurately predict. Demand forecasting is, in fact, influenced by the volatility of demand. Furthermore, using fewer resources, the forecasting process can be improved. Therefore, it is necessary to find a trade-off between accuracy of the forecasts and complexity of the system.

In addition, it is important to highlight that in recent years, there has been a significant revolution in the fashion industry due to the development of big data and the use of artificial intelligence technologies, which have contributed to changes in both the behavior of customers and companies, including companies in the fashion industry. The fashion industry is a dynamic business, characterized by the uncertainty of demand resulting from the high variation in customer tastes, fashion trends, and consumption behavior, which makes it difficult to accurately forecast demand (Ren et al., 2020).

Current forecasting methods should incorporate, in addition to traditional constraints such as price level (Ren et al., 2015), end-of-season sale, and sales promotion (Choi et al. 2011a; Choi et al. 2011b), customer information generated from big data, such as customer reviews, social media, and search traffics (Choi et al. 2018).

The aim of this research is to provide an accurate and reliable approach to forecast clothing sales. In fact, based on existing literature and to fill research gaps, the innovative aspect of this paper is to study how certain exogenous variables, including social communication, may influence clothing sales. In order to describe the contents and results of this study, the “Theoretical Background” section analyzes the factors that influence the demand and accuracy of clothing retail forecasts, and in the “Research Hypothesis” section, the research hypotheses are defined. In the “Research Method” section, the analysis of the main data sources used, the method and data collection, and the research methodology used in this study is explained, highlighting how it differs from other conventional methodologies. In particular, in this study, a SARIMA model is proposed with external factors (SARIMAX) to overcome the disadvantage of the traditional SARIMA model, in providing sales forecasts. In the “Findings” section, different models are provided for forecasting sales, and subsequently, in order to validate the accuracy of the proposed methodology, the prediction performance of the two methods identified is compared and the best model is chosen based on the goodness of adaptation of the model itself. In the “Theoretical and Practical Implications” section the main theoretical and practical implications are discussed. In particular, from economical perspective, the proposed model can help the retail managers to forecast the sales with better accuracy, so that the stock-outs can be reduced. Finally, the “Conclusion, Limitations and Future Research” section presents the conclusions and discusses the main limitations and future research.

Theoretical Background

In recent decades, most researchers in the field of sales forecast proposed new forecasting techniques, evaluated the performance of existing techniques, or modified the existing ones.

Asur and Huberman (2010) have shown how social media content can be used to predict results in the real world. In particular, the data extracted from Twitter are used to forecast the revenues of the films: the film’s title was introduced as research topics to ensure that the tweets are referred to the films under investigation. Therefore, after verifying the existence of correlation between the box office revenue and the tweet-rate (number of tweets that refer to a specific film per hour), the regression model and the forecast of box office sales were identified for the weekend in which the film was projected for the first time.

According to Tumasjan et al. (2010), tweets extracted from Twitter with content related to German political elections can predict the outcome of the same election. The authors use Linguistic Inquiry and Word Count (Pennebaker et al. 2007) to calculate the sentiment of the analyzed tweets. The analysis shows that the simple volume of tweets reflects voters’ preferences and approaches traditional polls.

Zhang and Li (2010) analyze the correlation between social media controversy and sales performance and provide a measure of the sales forecast of the products through a linear regression analysis.

Choi and Varian (2012) use Google Trends data related to car sales, homes, and travel to predict short-term economic indicators by using simple seasonal autoregressive models and fixed effects models.

The forecast of future iPhone sales was proposed by Lassen et al. (2014). In particular, the hypotheses underlying the research are based on the correlation between smartphone sales and tweets, or tweets can be used as a proxy to identify a user’s attention to a product and the intention to buy and/or recommend it. For the forecast of these sales, a multiple regression model has been identified that transforms iPhone tweets into a quarterly sales forecast with an average error of 5–10%.

To predict monthly car sales, Ahn and Spangler (2014) propose a predictive model based on social media analysis and time series analysis; the predictors of the model are the sentiment and the frequencies of the keywords.

The use of analysis techniques for forecasts of future sales of a product/service based on traditional historical series produces unreliable results because it does not take into account the impact that recent events may have on sales. On the other hand, incorporating the opinion expressed by users on social media in predictive analysis would increase the accuracy of the forecast. The social media technique prediction and analytical forecasting models, if used separately, focus only on one aspect, while their integration would improve the analysis.

Dijkman et al. (2015) propose a correlation analysis and a Granger test to predict sales operating printing, using the tweets and the sales data.

Arunraj et al. (Arunraj et al., 2016) propose a seasonal autoregressive integrated moving average model to predict daily sales in the food retail industry using the daily sales of a discount retail in the region of Lower Bavaria in Germany.

Kim et al. (Kim, 2016) compare the opinions of consumers and the sales performance of two competing products, iPhone 6 and Galaxy S5, using opinion mining, sentiment analysis, and statistical analysis.

On the other hand, Moon et al. (2016) improve the awareness and preference (AP)-based approach in entertainment industries in the movie market in order to forecast a movie’s box-office performance. In particular, the survey data concern 166 movies released during the 2-year period between 2007 and 2009. The movies include 95 imported movies and 71 Korean movies. The data contains responses from 3553 valid panelists representative of the market, and the survey contains questions on weekly changes to APs on new upcoming movies.

Furthermore, information on the movie’s features relevant to the movies from a popular Korean movie consumer forum is used in the analysis. For analysis, authors use multiple AP models that vary in terms of dependent variable and the temporal window of the weekly AP measures. Their research shows that there are distinct segments of movie consumers that react to movie quality and marketing efforts differently in accordance with AP.

Zapata et al. (2017) answer the problem of controlling the logistical expenses based on a technological model through a reengineering process improvement using a predictive model based on the analysis of the sentiment of Twitter. The results demonstrated that a control vision provides utilities for an industry in an indirect way, reducing the losses in logistical expenses, saving time creating packages and money as a result.

Several other authors, such as Kahn et al. (2017), Pai and Liu (2018), Elshendy et al. (2018), and Yuan et al. (2018), have addressed the issue of sales forecasting by regression analysis, autoregressive integrated moving average, neural network regression, and other methods.

Lyu et al. (2019) propose an autoregressive heat-sentiment model to predict daily domestic box-office revenues.

Bogaert et al. (2021) explore the power of social media data in predicting box office sales, indicating which platform and type of data add more value and why and which variables are driving predictive performance. To answer these questions, they use a two-stage social media analytical approach. In the first stage, the predictive performance of different models, including Facebook and Twitter data, is evaluated using 7 algorithms (k-nearest neighbors, decision trees, regularized linear regression, neural networks, bagged trees, random forest, and gradient boosting). In the next stage, they apply information-fusion sensitivity analysis to summarize the information from all algorithms and determine the most important variables.

Finally, Javed Awan et al. (2021) employed different machine learning models, such as learning models: linear regression algorithms, decision tree algorithms, generalized linear regression, random forest, Naïve Bayes, logistic regression, to predict stock price movements.

Table 1 shows a detailed overview of each article. The information was categorized across different application domains in terms of social media platform (Facebook, Twitter, etc.), independent and dependent variables involved, and statistical method used.

Table 1 Research publications on predictive analytics with social media data

The Retail Clothing Sales

In addition to sales forecasting models of widely used products such as cell phones, vehicle and fuel, and food, considerable interest arouses the clothes sales. Specifically, Mukkamala et al. (2013) conduct an empirical analysis about the relationship between the quarterly revenues of an H&M clothing company and social data from Facebook. In particular, the authors used Google Prediction API to calculate the positive, negative, or neutral sentiment of posts and comments posted on Facebook. Subsequently, the existence of a correlation between quarterly revenues and sentiment expressed on Facebook was analyzed and confirmed.

The following paragraphs give a brief description about the clothing sales and the factors that influence these sales.

Specificities of Clothing Sales

In the clothing market, consumers are very unfaithful and generally direct their choices on competitive prices (Gazzola et al., 2020). To meet the needs of consumers, companies, particularly in the fashion industry, have to reduce production costs to remain efficient. As a result, distributors need to rigorously manage the supply chain to avoid delays, out of stock, and unsold items and to maintain the right level of inventory (Muthu, 2020). Recently, there are many tools for supply chain management to improve the planning and synchronization of material and information flows (Lee, 2021). These tools can be customized to the specific retail conditions, but the efficiency of these tools depends largely on the accuracy of sales forecasts (Ren et al., 2020). An efficient sales forecasting system is based on knowledge of the product, sales characteristics, and future use of forecasts by distributors.

Life Cycle of Clothing Products

Typically, the evolution of sales follows the product life cycle, through the phase of launch, development, maturity, and decline. However, in the clothing market, different trends can be seen on the basis of article type; for example, the basic items are sold all year round or every year; fashion items are sold in a short time and are generally not restored; the best-selling items are sold every year based on fashion trends and could be reinstated during seasons of the year. Sales forecasts are performed for basic and best-selling items, while “one-shot” fashion items are often not taken into account in the forecasting process (Thomassey, 2010).

Factors Affecting Demand in Clothing Market

The forecast analysis allows us to analyze two main aspects: the expected demand and the degree of accuracy of the same. The expected demand can be a function of different types of variations (Stevenson et al. 2007), because the market, in particular that of clothing, is strongly impacted by numerous factors, called explanatory variables, often not controlled and sometimes unknown; consequently, it is difficult to exactly identify them and especially to quantify their impact.

The main factors (but non-exhaustive list) (van der Vorst et al., 1998) influencing the expected demand are:

  • Trend and seasonal demand patterns, categorized into day-of-the-week (weekly seasonality), day of the month (monthly seasonality), month-of-the year (yearly seasonality), yearly seasons, and yearly quarters.

  • Price, categorized into normal price and reduced price (Choi et al. 2011a; Choi et al. 2011b). The price reduction is classified into promotion and discount. The price reduction of the selected product may sometimes produce cannibalization effects on similar products in the same store and/or in the competitive stores. The changes in the market may lead to price changes that in turn affect customer demand patterns (Armstrong, 2001).

  • Events: changes in demand may be caused by holidays periods and festivals (Ekambaram et al., 2020). Indeed, the retail stores located in/or near touristic places or festival locations may have higher demand variability due to greater number of visits. It should be noted that holiday periods have been classified into two main categories, regular holidays and school holidays. In particular, regular holidays include the 12 official holidays in Italy, such as Epiphany, Easter and Easter Monday, Anniversary of Liberation, Workers Day, Republic Day, Mid-August, All Saints’ Day, Immaculate Conception, Christmas, Saint Stephen Day, and New Year’s Day. On the other hand, school holidays coincide with the days when the compulsory schools are closed in Italy, excluding regular holidays, such as Christmas, carnival, and Easter holidays. Finally, the before holidays and after holidays effect coincides with the day preceding or following a regular holiday or school holiday. The school holidays and two other variables taken into consideration, the before holiday effect and after holiday effect, are represented as three dummy variables (0 or 1) in the models, i.e., if there is an affect, then 1; otherwise 0.

  • Weather: the customer purchase behavior may be disturbed by extreme weather such as rainfall, snowfall, and very hot and cold temperatures (Ekambaram et al., 2020). In this case, if the weather variables are used in the sales forecasting, the quality of weather forecast influences the forecast accuracy.

  • Substitution: the sales of a product may be affected by the presence of a similar product out of stock or in promotion (Ren et al., 2015).

  • New and expanding distribution channels may change customer demand patterns (Muthu, 2020). This operation could be expensive, but could also lead to an increase in customers.

Research Hypothesis

An appropriate forecasting model should integrate the main explanatory variables into the forecast calculation, taking into account that:

  • The variables are many and it is not possible to establish an exhaustive list

  • The impact of these variables is difficult to estimate and is not constant over time

  • These variables can be correlated, complicating the understanding and modeling of their impact on sales

  • Some variables are not available or predictable (i.e., meteorological data) and therefore cannot be integrated into the forecasting

Note that short-term shocks in time series (variation in seasonality) can be caused by a price reduction of a product and/or other related products. On the other hand, long-term variation (shift in existing products demand) may be caused by the changes in distribution channels and introduction of new products.

The forecast accuracy depends on the availability and quality of the data: the availability of complete and longer historical data is significant to identify and understand the external factors which affect the sales. In addition, quality of input data is important to identify the forecast model.

On the basis of the considerations presented in the “The Retail Clothing Sales” section, the research hypothesis (H) this paper aims to address are the following:

  • H1: Discounted sales significantly and positively affect the clothing sale

The uncertainty of consumer behavior makes it difficult to predict sales in the fashion industry. However, the literature assumes that end-of-season discounts or promotional sales influence sales (Choi et al. 2011b). In this case, it is assumed that discounted sales can be positively related to clothing sales.

  • H2: Regular holiday significantly and positively affects the clothing sale

In general, trend-driven retail sales forecasting, such as fashion, is important for efficient downstream supply chain planning as well as assortment and stock allocation planning (Liu et al., 2013). An influence on sales of certain factors, such as regular holidays, should be foreseen to better manage the abovementioned aspects.

  • H3: School vacation significantly and positively affects the clothing sale

In the literature, it is possible to find some contributions showing that events where young customers and their families have more free time available, such as school holidays, influence the sales of clothing (Ekambaram et al., 2020).

  • H4: Facebook social communication effects significantly and positively affect the clothing sale

Another important aspect to investigate concerns the social communication, such as the search and exchange of information of social media users (Cheung et al., 2011). This information can help formulate more objective judgments about the company and its products or services (Flavián & Guinalíu, 2006). In addition to disseminating information and assessments among customers, a social network can help companies to collect information about customers’ preferences (Hsu, 2012) and know if this information exchange affects future sales (Chong et al., 2009; Xu et al., 2019). In fact, numerous studies in the field of computational social sciences have shown how the data resulting from the adoption and use of social media channels can be used to predict future sales of products (Picasso et al., 2019). This study shows how Facebook social data can be used to predict future clothing sales; then the following research hypothesis is added to the sales forecast model.

Research Method

Data Collection

In this study, a predictive analysis was conducted using the daily clothing sales of an Italian company for the year 2021. The data sets available in the company datalake provided by the Italian clothing company analyzed included a variety of data types with different structures, such as item barcode, branch code, receipt ID, date and time of sale, reason for movement (sale or return), quantity sold, unit price, discount applied, and loyalty number.

Among the explanatory variables considered, the effects due to the product substitutions are not taken into account due to lack of data and information. In this context, the extreme weather conditions play a minor role in sales increase and decrease (Agnew & Thornes, 1995). Therefore, the effects due to the weather conditions are not taken into account, due to heterogeneity of the meteorological conditions throughout a vast territory like that of the Italian peninsula.

In addition to daily sales data, in order to consider the opinion expressed by users on social networks, Facebook posts were analyzed. These data were extracted in the same period of time (during 2021) from the company’s official Facebook page using Facebook’s API Graph. Since web texts are often very noisy and include different topics, a text segmentation task was performed to select sentences around the target mark.

Regarding daily sales data, a phase of data codification and cleaning was carried out. A total daily turnover of clothing items, without distinction by product category, was obtained in order to use it in the time series analysis.

In the following subsections, the forecasting sales method and the classification of the sentiment of the analyzed posts are explained in detail.

The Forecast Model: Autoregressive Integrated Moving Average

The analysis of clothing sales follows the well-known Box-Jenckins approach (Box et al., 2015), including model identification, parameter estimation, and testing the fitness model. In particular, the identification of the considered series starts with the stationarity analysis through the augmented Dickey–Fuller unit root test (Dickey & Fuller, 1979) and the KPSS stationarity test (Kwiatkowski et al., 1992). The existence of seasonal unit roots can be checked through the DHF test (Dickey et al., 1984).

A non-seasonal autoregressive integrated moving average (ARIMA) \((p, d, q)\) model represents a time series with \(p\) autoregressive terms, \(q\) moving average terms, and \(d\) non-seasonal differences [11], expressed as follows [11]:

$${\upphi }_{{\text{p}}}(B)(1-B{)}^{{\text{d}}}{X}_{{\text{t}}}=c+{\theta }_{{\text{q}}}\left(B\right){\varepsilon }_{{\text{t}}}$$

where:

  • \(B\) is the delay operator, observation period of time series \(k\), and can be formalized as follows: \({B}^{{\text{k}}}{X}_{{\text{t}}}={X}_{{\text{t}}-{\text{k}}}\)

  • \({\phi }_{{\text{p}}}(B)\) is the autoregressive operator of order \(p\): \((1-{\upphi }_{1}\left(B\right)-{\upphi }_{2}\left({B}^{2}\right)-\dots -{\upphi }_{{\text{p}}}\left({B}^{{\text{p}}}\right))\)

  • \({\theta }_{{\text{q}}}\left(B\right)\) is the moving average terms of order \(q\): \((1-{\theta }_{1}\left(B\right)-{\theta }_{2}\left({B}^{2}\right)-\dots -{\theta }_{{\text{q}}}\left({B}^{{\text{q}}}\right))\)

  • \((1-B{)}^{{\text{d}}}\) is the differencing operator of order \(d\) to remove non-seasonal stationarity

  • \({X}_{{\text{t}}}\) represents sales of a product at the time \(t\)

  • \({\varepsilon }_{{\text{t}}}\) represents the residual error in the model ARIMA

  • \(c\) is the constant of the model

Similarly, the seasonal autoregressive integrated moving average (SARIMA) model, an extension of ARIMA, that explicitly models the seasonal element, can be formalized as follows:

$${\upphi }_{{\text{p}}}\left({\text{B}}\right){\Phi }_{{\text{p}}}({{\text{B}}}^{{\text{s}}})(1-{\text{B}}{)}^{d}{\left(1-{B}^{s}\right)}^{D}{X}_{t}={\theta }_{q}\left(B\right){\Theta }_{q}\left({B}^{S}\right){\varepsilon }_{t}$$

where:

  • \({\upphi }_{{\text{p}}}\left(B\right)\) is the seasonal autoregressive operator of order \(p\)

  • \({\Theta }_{{\text{q}}}(B)\) is the seasonal moving average terms of order \(q\)

  • \({\left(1-B\right)}^{{\text{D}}}\) is the seasonal differencing operator of order \(D\)

  • \((1-{\text{B}}{)}^{{\text{d}}}\) is the differencing operator of order \(d\)

  • \(S\) is the seasonal length

The advantage of the SARIMA models derives from the ability to manage both stationary and non-stationary time series with seasonality elements.

The time series forecasting, either using SARIMA models or using other models, is affected by the presence of outliers. Indeed, outliers could have a potential impact on the estimates of the model parameters. Moreover, outliers in a time series may indicate significant events or exceptions and provide useful information.

For this reason, it is important to consider external variables, which provide significant responses to outliers. Therefore, as an alternative to the \(Y\_t\) time series modeling with only a combination of past values, \(Y\_t\) can be explained by both SARIMA and external variables (regressors). In this study, the SARIMAX model is used to predict daily time series using the SARIMA Box-Jenkins approach and multiple linear regression (MLR). Therefore, the SARIMAX model is a SARIMA model with external variables, called SARIMAX \((p, d, q) (P, D, Q) s (X)\), where \(X\) is the vector of external variables. External variables can be modeled from a multi-linear regression equation formalized as follows:

$${Y}_{{\text{t}}}={\beta }_{0}+{\beta }_{1}{X}_{1,{\text{t}}}+{\beta }_{2}{X}_{2,{\text{t}}}+\dots +{\beta }_{{\text{k}}}{X}_{{\text{k}},{\text{t}}}+{\omega }_{{\text{t}}}$$

where:

  • \({X}_{1,t}, {X}_{2,{\text{t}}}, \dots , {X}_{{\text{k}},{\text{t}}}\) are observations of the \(k\) number of external variables corresponding to the dependent variable \({Y}_{{\text{t}}}\)

  • \({\beta }_{0}, {\beta }_{1}, \dots , {\beta }_{{\text{k}}}\) are regression coefficients of external variables

  • \({\omega }_{{\text{t}}}\) is the stochastic residual

Note that the residual series \(\omega \_t\) can be represented as an ARIMA model formalized as follows:

$${\omega }_{{\text{t}}}=\frac{{\theta }_{{\text{q}}}\left(B\right){\Theta }_{{\text{q}}}({B}^{{\text{s}}})}{{\phi }_{{\text{p}}}\left(B\right){\Phi }_{{\text{p}}}\left({B}^{{\text{s}}}\right){\left(1-B\right)}^{{\text{d}}}{\left(1-{B}^{{\text{s}}}\right)}^{{\text{D}}}}{\varepsilon }_{{\text{t}}}$$

The general SARIMAX model equation can be formalized as follows:

$${Y}_{{\text{t}}}={\beta }_{0}+{\beta }_{1}{X}_{1,{\text{t}}}+{\beta }_{2}{X}_{2,{\text{t}}}+\dots +{\beta }_{{\text{k}}}{X}_{{\text{k}},{\text{t}}}+\frac{{\theta }_{{\text{q}}}\left(B\right){\Theta }_{{\text{q}}}({B}^{{\text{s}}})}{{\phi }_{{\text{p}}}\left(B\right){\Phi }_{{\text{p}}}\left({B}^{{\text{s}}}\right){\left(1-B\right)}^{{\text{d}}}{\left(1-{B}^{{\text{s}}}\right)}^{{\text{D}}}}{\varepsilon }_{{\text{t}}}$$

In the results paragraph, different SARIMAX models will be presented, in order to identify the best model.

In particular, models with different parameters are compared to each other, and the best model is chosen in terms of minimizing the information criterion of Akaike (AIC) [12] and the Bayesian information criterion (BIC) [13].

AICFootnote 1 provides a measure of the quality of the statistical model, considering both the goodness of fit and the complexity of the same, offering a measure of the amount of lost information when a model is used to describe reality. By estimating the parameters of a model using the maximum likelihood method, it is possible to increase the likelihood by adding parameters, but this implies over-fitting problems.

The BICFootnote 2 criterion, also called Schwarz criterion (SBC or SBIC), penalizes over-fitting models. Therefore, the model will be penalized when compared to another, if the value of the BIC index is high.

BIC and AIC differ in the theoretical foundation; in fact, the purpose of the Bayesian approach of BIC is the identification of the model that has the highest probability of being the true model that gave rise to the data.

In contrast, the AIC negates the existence of a true model and uses the model’s ability to predict data as a criterion for evaluation by its adequacy. Therefore, the comparison among models is based on the BIC index, while the differences in AIC are considered only in order choosing the best of two models when \(\mathrm{\Delta BIC}<2\).

On the other hand, the predictive capacity of the models is evaluated using the mean absolute percentage errorFootnote 3 (MAPE) (Brown & Rozeff, 1979; Hyndman & Koehler, 2006) and the root mean square errorFootnote 4 (RMSE) (Willmott & Matsuura, 2005), (Dickey & Fuller, 1979).

MAPE provides a relative measure, because it is divided by the observed value, so this index does not depend on the unit of measurement of the series. This index can only be used if the phenomenon is measurable on a ratio scale. For the calculation of MAPE, difficulties arise when the observed series contains null or near-zero values.

RMSE takes only positive values; indeed, this index is calculated as the mean squared prediction error. Its theoretical minimum value is zero, which would occur if the predictions reproduce the observations perfectly. Therefore, the best forecast is associated with the lowest RMSE.

Finally, Theil’s U-statisticsFootnote 5 (Theil et al., 1966) is used to measure the predictive capacity of a model than the naïve model. Therefore, the forecasts will be more accurate if Theil’s U value is low. Theil’s U test can be interpreted as the ratio between RMSE of the proposed prediction model and the RMSE of the naive model. In particular, the naïve model produces \(U = 1\); values lower than 1 indicate an improvement compared to the naive model, while values higher than 1 indicate a deterioration of the model.

Facebook Data Analysis: Sentiment Classification

Sentiment analysis (SA) studies, analyzes, and classifies documents containing opinions expressed by the people about a product, a service, an event, an organization, or a person.

The main classes of SA algorithm, lexicon-based and machine learning-based, have different fields of application and are based on different requirements. In the lexicon-based approach, the creation of a dictionary of terms previously classified has low accuracy of results, considering the continuous changing of the language and the change in the meaning of words in different contexts.

Surely, the strength of this type of approach is the speed with which the results can be obtained, allowing obtaining real-time analysis.

The supervised machine learning approach allows obtaining an improved accuracy but requires a training phase. Using specific training sets, it is possible to adapt the method to the context of interest. If a supervised machine-learning algorithm is trained on a language model, specific for a given context and for a particular type of document, it will provide less accurate results by moving the analysis to another context of interest.

In order to respond to the context-free classification and, above all, to avoid the use of specific training sets, an unsupervised algorithm was used (Bisconti et al., 2019).

In addition to the text input (a string representing an arbitrary text), the classifier needs an auxiliary input consisting of:

  • One dictionary of pre-classified sentiment words, mainly containing adjectives and nouns, with each item consisting of the following key-value structure: word: sent score (Basile et al., 2013). Sent score is a complex number, whose argument denotes the sentiment polarity (from 0, positive, to 90, negative)

  • One dictionary of modulator words (mainly adverbs), with the same item structure, and numeric coefficients which can be positive (sentiment amplification) or negative (sentiment decrement)

For a detailed explanation of the sentiment classification procedure, see the paper by Bisconti et al. (2019). The output of this classifier consists of a floating-point value indicating the overall sentiment score of the text object; the value varies between − 1 (totally negative) and 1 (totally positive).

Findings

After pre-processing the data received from the Italian clothing company and then selecting the most significant data for analysis, the average daily sales per clothing from the same company was calculated to be 1,425,029.57 euros. In addition, the time trend of daily sales for the year 2021 was graphically represented, as shown in Fig. 1.

Fig. 1
figure 1

Time series of daily clothing sales for a retail store

In particular, Fig. 1 shows the presence of high periodicity; however, it is difficult to observe seasonal patterns. The seasonal variability can be investigated by analyzing the impact of day-of-the-week and month-of-the year seasonality on daily sales data. In this case, the average sales of clothing during the winter and summer months is higher than other seasons (Fig. 1). In addition, the box-and-whisker plot displays the median daily sales from Monday to Sunday, as shown in Fig. 2.

Fig. 2
figure 2

Box-and-whisker plot for daily sales of clothing from Monday to Saturday

From Fig. 2, it is evident that the highest sales usually occur on Sunday and Saturday; Friday is the next highest sales day. The box-and-whisker plot also shows information about extreme values (o) and outliers (*), which occur mainly due to promotions. However, the median of daily sales for Friday and Saturday is similar; Saturday has more dispersion for sales than Friday.

On the other hand, with respect to Facebook social communication, the daily distribution of the sentiment expressed through posts and comments on the company’s official Facebook page for the same time period, the year 2021, is shown in Fig. 3. It is interesting to note that a positive sentiment has been recorded in investigated time, with few negative sentiment values, recorded especially in March, June, and September.

Fig. 3
figure 3

Time series of the daily distribution of sentiment expressed on Facebook official page

To predict the future sale of clothing, the SARIMAX model has been developed. SARIMAX modeling implies the development of a SARIMA model. Subsequently, the predicted values of the SARIMA model are included as input variables together with external variables. Finally, parameter significance testing is performed to check that the residuals are white noises.

Note that, imagining the buying process of a generic consumer, it can be assumed that the potential buyer, before buying (at the time \({\varvec{t}}-{\varvec{k}}\)), seeks information through social media. In order to be able to identify how much is \({\varvec{k}}\), it is necessary to calculate the cross correlation. Consequently, if the exogenous variable influences the reference variable, then one or more significant correlations will be present in the periods preceding \({\varvec{t}}\). On the other hand, if significant correlations occur for the times following \({\varvec{t}}\), then the hypothesized relationship should be reversed.

To analyze the relationship between the company’s total turnover and the Facebook sentiment score, the existence of a significant correlation will occur through the cross-correlation function (CCF) that will be verified. The CCF allows identifying in which time period the correlation between the two variables is greater. This allows identifying if and how much the exogenous variable, Facebook sentiment score, precedes or follows temporally the total turnover of the company of the case study.

The analysis of the CCF results shows a delay of 15 days, since the highest correlation is reached at the fifteenth time instant before the purchase. This result confirms the hypothesis previously expressed and validates the existence of a lag in the serious overall revenue compared to the average sentiment expressed on Facebook.

After the preliminary research analyzes, a suitable model is identified to represent the time series related to the total turnover of the company, including the exogenous variables. To determine the requirement of differencing, the stationarity of time series can be checked using the Augmented Dickey-Fuller test (ADF test). From the ADF test results, it is found that the time series of sales of clothing is stationary, i.e., the p-value is less than 0.05.

Based on the insights from the cross-correlation analysis, different sales predictive models have been compared, focusing on SARIMAX models.

Using the forecast “package in Gretl” for building SARIMAX models, two different experiments have been performed.

In the first experiment (Experiment 1), given the sales and the external variables (excluding average sentiment score), the best model parameters \(({\varvec{p}},\boldsymbol{ }{\varvec{d}},\boldsymbol{ }{\varvec{q}},\boldsymbol{ }{\varvec{P}},\boldsymbol{ }{\varvec{D}},\boldsymbol{ }{\varvec{Q}})\) have been selected and related coefficients for SARIMAX \(({\varvec{p}},\boldsymbol{ }{\varvec{d}},\boldsymbol{ }{\varvec{q}})\times ({\varvec{P}},\boldsymbol{ }{\varvec{D}},\boldsymbol{ }{\varvec{Q}})\) have been estimated.

The second experiment (Experiment 2), the sales forecasting model including average sentiment score as exogenous variables, is identified.

In both experiments, the external variables are Epiphany, Easter, Anniversary of Liberation, Workers Day, Republic Day, Mid-August, All Saints’ Day, Immaculate Conception, Christmas, New Year’s Day, school vacation, before holiday, after holiday, discounted sales, and average sentiment score. These variables with their notation and units are listed in Table 2.

Table 2 Overview of variables in the analysis

The day-of-the-week effects were not included in the SARIMAX model, since the seasonal AR and MA parameters were considered in the SARIMA model.

In this study, models with different parameters are compared, and the best model is selected in terms of the minimization AIC and BIC, as well as the value of the RMSE and MAPE index. In particular, to identify the orders of the SARIMAX model for both experiments, autocorrelation (ACF) and partial autocorrelation (PACF) were investigated. After investigating ACF and PACF plots, the SARIMAX \((1,0,0)\boldsymbol{ }(0,1,0)\) model for Experiment 1 is identified as the best model from the SARIMAX family. On the other hand, for the Experiment 2, the best model is SARIMAX \((2,0,2)(0,1,0).\) Using the method of maximum likelihood, the parameters of SARIMAX models are estimated. In Table 3, the performance measures of the model are presented.

Table 3 Performance of SARIMAX for data set of daily sales of clothing

From the comparison between the two models, it shows that the best model is the SARIMAX \((2,0,2)(0,1,0)\) model identified for Experiment 2 that considers all the external variables, including the average sentiment score expressed by users on the official Facebook page of the company. In terms of both BIC and AIC, the model for Experiment 2 is the best. This model also presents lower values for both MAPE and RMSE indexes. Furthermore, the accuracy of the estimate using this SARIMAX model increases by 3.28 percentage points. Moreover, even considering Theil’s U-statistics (\({{\varvec{U}}}_{1}\) and \({{\varvec{U}}}_{2}\)), the SARIMAX model of Experiment 2 is preferred. In Table 4, the coefficients of the SARIMAX model factors for Experiment 2 are estimated by the ordinary least square method. The variables analyzed are significant at the 5% level, excluding the anniversary liberation variables, before holiday and discounted sales. Therefore, the first research question (H1) aimed at verifying the positive effect of discounted sales on turnover is supported.

Table 4 Coefficients, standard error, t, and p-value of the model of Experiment 2

With reference to the second research question (H2), the effects of the regular holiday such as Easter, Workers Day, August 15th, Christmas, and New Year’s Day have a negative coefficient, as shown in Table 4, which suggests that in the presence of such regular holidays, there is a reduction in clothing sale when other variables remain constant. On the other hand, the effects of regular holidays such as the Republic Day, All Saints’ Day, and Immaculate Conception have a positive coefficient, meaning that sales increase in the presence of such regular holidays, when the other variables remain constant. Therefore, the H2 search question is partially supported.

The third research question (H3) is verified, in fact, school vacation and after holiday effects have positive effects on clothing sales.

Finally, the fourth research question (H4) on the positive effects of social communication on Facebook on clothing sales is supported. In fact, social communication on Facebook, measured by the average sentiment score, has a positive coefficient; therefore, what is written by users on the official Facebook page of the analyzed company increases the sale of clothing.

Figure 4 shows the actual and predicted sales amounts of the best model in Experiment 2. In particular, the red line shows clothing sales from January to December, while the blue line represents the estimated sales of clothing based on the model fitted for Experiment 1; finally, the black line shows the estimated sales of clothing based on the model fitted for Experiment 2.

Fig. 4
figure 4

Sales history and forecasts of the two models

Theoretical and Practical Implications

This study provides empirical results on the main factors influencing future clothing sales. The analysis focuses on an Italian clothing company. The fashion industry is a dynamic business, characterized by the uncertainty of demand resulting from the high variation in customer tastes, fashion trends, and consumption behavior, which makes it difficult to accurately forecast demand (Ren et al., 2020).

To contribute in such a direction, the internal (such as sales turnover) and the external characteristics (such as Epiphany, Easter, Liberation Anniversary, Labor Day, Republic Day, August Bank Holiday, All Saints Day, Immaculate Conception, Christmas, New Year’s Day, school holidays, before the holidays, after the holidays, discounted sales, and average sentiment score) were analyzed to explain consumer buying behavior and generate more accurate sales forecasts in the clothing industry.

What emerges from the analysis is a higher accuracy of sales forecasts by including the opinion sentiment expressed by consumers on the company’s official Facebook page, as well as external variables such as price (discounted sales) and events (regular and school vacations).

With regard to theoretical implications, the innovative nature of the proposed model is to be found in the use of Facebook social communication in the forecasting model together with the traditional variables for forecasting clothing sales. Specifically, the Facebook social communication variable searches for content corresponding to posts or reviews published on the analyzed company’s official Facebook page and, using a rule-based approach, quantifies intrinsic sentiments (Bisconti et al., 2019). This research not only determines the polarity of the text of the post or review, but also captures the strengths of the sentiments expressed. The proposed sentiment quantification method is generic and can be applied by researchers/companies in new domains with little customisation.

On the other hand, with reference to practical implications, the proposed model can assist the manufacturers in decision-making in several ways. First, it is possible to assist the manufacturers with more effective sales forecasting, enabling better production planning, efficient supply chain management, and better relationships with end customers (Ivert et al., 2015; Mascle & Gosse, 2014).

Indeed, trend-based retail sales forecasting, as in the case of fashion, is very important for efficient downstream supply chain planning as well as assortment and stock allocation (Liu et al., 2013).

In particular, some practical implications emerge from the conducted study that can be a starting point for future strategic choices, such as the need for users to search for information on social media on average 15 days before actually purchasing products. About that, dynamic marketing strategies should be adopted, e.g., marketers could encourage consumers to share their experiences online. Various discussion forums could be activated and, with some dynamic tactics, the discussion could be constantly animated.

In addition, the results show us that some variables belonging to the event category, such as the Anniversary of Liberation and days before the vacations, are found to be non-significant. On the other hand, discounted sales, school vacations, some ordinary holidays, such as Republic Day, All Saints Day, and Immaculate Conception, and social media communication have a positive impact on sales. Finally, the variables related to ordinary holidays, such as Easter, Labor Day, August Bank Holiday, Christmas, and New Year’s Day, have a negative effect on sales.

This is certainly a new element compared to the results presented by the various studies in the literature so far and is driven by a different application context (Ekambaram et al., 2020; Ren et al., 2020).

Through this study, an attempt is made to avoid providing overly optimistic sales forecasts, which could cause serious economic damage. For these reasons, the proposed model can be used for marketing planning in apparel markets without specific econometric expertise. In addition, endogenous estimates, that is, estimates of variables that are explained by other variables belonging to the same model, can provide useful insights into the existence and intensity of forward-looking buying behavior. This particularly supports medium- and long-term planning of future social communication in the context of profit maximization.

The results also have several direct and indirect implications for the company under study. Predicted sales from social media data are the direct implications; strategy, analysis, and management of social media platforms are the indirect implications. Conscious and intelligent use of social media for organizational purposes generates competitive advantages.

Conclusion, Limitations, and Future Research

In recent years, the Internet has become an important element in everyday life. Today, it represents an important source of information. In this research, particular attention is paid to data extracted from the Internet and in particular from the social media Facebook. This data, in fact, reflects the needs and interests of people, thus representing a potential for improving the prediction of some economy-related variables. From the analysis of the state of the art, therefore, a new research line emerged, which is based on the use of unstructured textual data generally available on the Internet in order to predict economic variables.

This paper aims to contribute to this new field of research paying particular attention to the use of these data to forecast future sales of an Italian clothing company. With this aim, the Box-Jenkins time series approach has been used and integrated with the inclusion of explanatory variables on the market demand side.

More specifically, the effects of price (discounted sales), events (regular holiday and school vacation), and Facebook social communication have been included in the analysis. These variables are found to be statistically significant to explain the clothing sales, and their estimated coefficients reveal that discounted sales, school holidays, the days after holidays, and some regular holiday, such as Epiphany, Republic Day, All Saints’ Day, and Immaculate Conception, as well as the social communication on Facebook, have a significant and positive impact on clothing sale. On the other hand, some regular holidays, such as Easter, Workers Day, August 15th, Christmas, and New Year’s Day significantly but negatively affect clothing sales; in fact, in the presence of these holidays, a decline in clothing sales is recorded. This last result suggests that people buy clothing before the aforementioned holidays and not on the day of the feast day itself. Therefore, the results obtained confirm the potential of the proposed approach, as it significantly improves the estimate and forecast of clothing sales. Thus, overall, the evidence presented highlights the importance and usefulness of internet search–related data for predicting economic variables. These results are particularly interesting because more and more activities will take place online, and internet-related data will increase in volume and importance. It will be appropriate to further explore the application of these new data sources in order to improve forecasts of economic statistics.

As with any other study, this one has some limitations that should be taken into account when applying its results more generally. In particular, a first limitation of this study is the lack of a longer time history on which to perform the analyses. Second, sentiment polarity is necessary but not sufficient for predicting business outcomes. Analytical approaches based on set theory can help better understand not only the categories of sentiments but also their dimensions.

The findings from the present study also suggest several avenues for fruitful further research. Future research studies are needed to investigate the role of sales in other sectors. Even a focus on the type of product may prove to be interesting.

Finally, the sentiment algorithm could also be improved by using a supervised approach; using a training set, a set of short texts manually classified at sentiment level, and as words to classify a set of modulators, one could exploit a multivariate classifier to obtain the vectors modulating to associate with each word, based on the occurrences of various phrases in which this modulator appears in the Italian language.