1 Introduction

The film industry has been responsible for about 30% of the total revenue of films since the 2000s [56] and reflects a film’s economic success, due to the significant consumption of related goods during and after the release and of the film being consumed with other complementary goods, TV, cable, and others. According to the Motion Picture Association of America, in 2019, ticket revenues alone in the US and Canada were around $11.4 billion, while 76% of their populations could be classified as moviegoers [54]. At the same time, the motion picture and television industries support more than 2.5 million jobs in the United States.Footnote 1

According to economic theory, film is both an information and an experience good. As an information good, it has a high fixed cost (actors, directors, editors, and others) and almost zero reproduction (marginal) cost [70]. As an experience good, its quality is not known until the time of consumption, which explains the uncertainty in its production [64]. These characteristics and recent technological changes make it difficult for an entrepreneur to know in advance whether a new film will be successful as an economic venture [3].

The rapid growth of the Internet and digitization, led by technological innovations in information and communication technologies (ICTs), has reduced production and distribution costs, creating a golden age for creative economic endeavors, such as information goods like music, movies, and books [71]. For example, today, a film can be consumed on any device with Internet access, such as mobile phones and tablets.

In addition, there are multiple substitute ways to consume a movie since it can be watched at home or virtually in any place and at any time just after the theater release, or simultaneously in some situations. Specifically, films’ concurrency at movie theaters has increased due to Internet downloads and online streaming platforms [35, 71]. The same ICT development that allowed the reduction in film costs also incentivizes other markets to establish concurrence. Netflix, for example, is using consumer data and artificial intelligence to target consumption tastes to maximize its returns.Footnote 2

Given the effects of these new technologies and the high risk of film production [35, 48, 65], we employ a decision support system to produce guidelines for film producers and their stakeholders such as studios distributors, and their shareholders. A film is a risky endeavor since it is very expensive to produce - including expenses for actors, directors, and marketing among others - and may not find enough viewers to pay for itself. In this sense, an application that allows and indicates how producers can change decisions like budget, distributors, and film duration among others, can reduce the risk of not being profitable. Such a tool, thus, can prevent heavy losses and improve productivity. To try to obtain this tool, we revisit the literature and focus on at least three main issues to contribute to and improve the performance of the previous studies regarding whether a film will have enough consumers to make it profitable.

First, we follow the economic literature to sample our procedures according to short and long periods to deal with a potential change in the “regime” that models a film’s profits. The model that describes film success can change due to ICT evolution over time, which can be seen as an exogenous shock on the model’s parameters [10]. In this sense, innovations occurring in a sector could change the model that generates the best classification/prediction, and just increasing the number of observations by using the information of the distant time, as is usual in the case of films, will not necessarily improve accuracy. In this sense, using small samples (near specific date) could produce a more homogeneous sample or one free of outliers. We also explore results using only wide-released film samples since they are more similar (a wide-released movie is very different from a limited-released one in costs, consumers, and so on), thus creating a more homogeneous sample [22].

Second, we measure a film’s success based on its profit deflated by the CPI (the US Consumer Price Index). Using profit as a success measure allows us to account for revenues and costs of production since even a colossal box office cannot be profitable if the costs of production are also high. The literature mainly uses total revenues as a measure of economic success and does not control for the effects of inflation (at least, they do not explicitly mention them). Not correcting for the effects of inflation may lead to inaccurate classification of success since the more recent films have higher profits and revenues in current values. Following the still scarce literature, we investigate two measures of success based on profits at theaters as a measure of success in two experiments. The first is a binary measure of film profits, where we consider box office revenues and costs of production (budget) to account for the film’s success; the second is a 6-class classification in profit ranges to be closer to reality and more directly comparable with the literature.

Third, we evaluate whether economic success in the theatrical film market can be predicted by a small set of readily observable features available after the film’s financial plan and the green lights, that is, before or at the time of film production and release [22]. The literature, on the other hand, tends to not take into account the timing when the features are available, employing features indistinctly observed before and after a film release – such as critic reviews, consumer reviews, the time a film is kept on screens – and there is no room to change features to get better results before a film release. Using variables available at the time of production allows the producers to have a higher degree of freedom in timing to control investment decisions [76].

In connection with our first contribution, we employ a uniquely configured set of data according to the shortest or longest sample in time, and total and wide-released films, to the three most popular machine-learning (ML) algorithms – Random Forest (RF), Support Vector Machine (SVM), and Neural Network (NN). The results allow us to properly compare the performance of the methods and datasets with the existing literature. From our knowledge, the three issues addressed jointly, as we propose, offer an additional contribution to the literature.

Employing a dataset scraped from the Box Office Mojo and IMDB sites and features available before a film release, we get about 96% and 97% accuracy and F1-score, respectively, in binary break-even (BE) classification and about 90% of Average Percent Hit Rate (APHR) for profit ranges (PR). Our results indicate an improvement in accuracy compared with the literature. Moreover, the results are more compelling – considering the use of a stricter measure for film economic success (the profits in a constant/deflated dollar value) and a reduced set of features – and more reliable due to the several tests with different numbers of observations and cuts in time. Finally, the results suggest that our models produce a better performance than those in the literature to date, indicating that our small number of features were appropriately chosen and that RF may be a better tool for predictions of movie profitability.

Given the limited number of movies released per year, the increase in sample size implies an increase in the time window in numbers of years. This, however, means the possibility of ignoring shocks that change the conditions of consumption of films each year.

Therefore, using a larger sample in time should be explored with caution similar to the econometric literature, which can open a new agenda for future studies. The literature of ML applied to film success suggests a trend; small datasets (few years), contrary to expectations, perform as well as or even better than larger datasets (more observations based on longer periods) to classify films by economic success, and our results support this conclusion. We attribute these results to a possible change of the “regime” that could drive the economic performance in theaters. In this sense, technological innovations, changes in individuals’ preferences, and other shocks, as COVID-19, could cause these regime changes;

Following this introduction, Section 2 summarizes the literature; Section 3 presents our data and methodological strategy; Section 4 comprises our results and discussion, while the last section summarizes our main findings.

2 Literature on movie success

Due to the uncertain returns of films, many scholars have attempted to predict the economic success of a film at theaters aiming to guide producers, studios, distributors, and theater chains. Most of these studies are explanatory, investigating factors and their relations with movie box office performance through regression analysis, and have been published in different fields: Economy [11, 16, 25, 28, 37, 45, 58, 66], Business and Information [41, 52, 55], Marketing [14, 21, 44, 53] and Computer Science [2, 6, 19, 51, 69].

Recent ICT developments have reduced the costs of producing films and increased the number of films produced, and this has resulted in more film data available. These data and new computational methods have increased the number of studies predicting movies’ success [71]. Most of these ML studies use features available along the whole movie lifecycle to predict a film’s success. Yet, as the greater part of data is available only after a film release, most of the studies use these data to predict success. In this case, however, there is no room to change film production decisions.

In this sense, the literature on predicting movie success usually employs post-release features like critic reviews, ratings, nominations, awards, other forms of word-of-mouth (WOM), and awareness information [18]. For example, studies employ social media microblogging to forecast box office revenues using ML in China [63] and the Korean market [34]. There are also similar studies using other methods of classification. For example, one study uses online user reviews applied to Support Vector Machine Regression (SVR) to predict box office revenues according to the genre [33]. Another involves text mining on Twitter to get insights on customer preferences to predict box office revenues with CART and NN regression [47], both in the US movie market. Some authors transform movie box office predictions into a classification problem [23, 38]; in particular, these authors also employ user opinion mining. For example, a study uses critic ratings and visual elements from movie posters, besides other movie metadata, to classify film success employing deep NN for 6-class box office prediction [78]. Another study uses data extracted from visual elements in trailers and text features from film abstracts, employing a NN to predict box office revenue [73]. Finally, another study explores daily box office patterns through the clustering approach and after-release features [72]. The literature also reports studies using alternative movie success measures, like critic reviews. For instance, some studies implement ML methods and social media to predict movie ratings [1, 5, 17].

Among studies exploring features before the film release, some use the “hype” generated online immediately before the film release through comments, search patterns, and other “buzz” around the movie. Even in this case, however, production and marketing expenditures are already made, leaving no time to reverse decisions. For example, studies utilize social media mentions as proxies for WOM to predict box office returns in the Korean market [39, 40, 43]. Another study mines popularity and purchase intentions from social media in China to predict box office [49]. Yet another uses Gradient Boosting Decision Tree and daily gross revenues to predict daily box office gross [75]. Finally, another study [32] employs ML binary classifiers and Tweet patterns for the US movie gross.

Still considering post-release features, a study that predicts economic success with profit classes instead of gross revenues develops Multilayer Backpropagation NN to predict movie profitability in a binary classification approach [60]. The authors include ratings from users and critics and the volume of reviews by film in their model for 375 movies released in the US and achieve an accuracy of 88.8%. Along the same line, [68] employs SVM and features after release to explore a film’s return on investment (ROI) as a 4-class problem – the data were obtained from 138 movies released in 2015 in the US market, and the result is about 56% accuracy.

An ML seminal study reduced the information set to variables observed before a film’s release [65]. The authors employ a Multilayer Perceptron NN to solve a 9-class box office problem. Their set of features is composed of competition degree, genre, MPAA rating, star power, number of screens in the first-week release, and a binary feature for a sequel for 834 movies released in the US market. The authors get a performance of 36.9% in APHR accuracy. A comparison study improves [65]‘s results with backpropagation, showing 68.1% of APHR in a 6-class 241-sized dataset [76]. In the same way, [24] also improves [65]‘s results using a Dynamic NN in a smaller dataset, getting 74.4% Bingo APHR accuracy as a result for the same box office gross 9-class problem. The authors also perform an additional test in an even smaller dataset (354 movies) and add marketing expenditures to the feature set, which resulted in 94.1% Bingo APHR accuracy with the same Dynamic NN.

More recently, other studies have been updating the methods and features for early box office prediction at earlier stages of the film lifecycle. For example, one applies pruned RF and different comparative ML classifiers to predict 8-class first-week box office using Chinese theater-level data and theaters’ revenues as the economic success measure [27]. A second study focuses on animated movie gross, with a 3-class NN and basic movie metadata [61]. A third work uses CART to predict 7-class box office revenue in the Chinese market [77]. A fourth study analyzes the differences between movie features while using RF regression, having the early box office prediction as the economic success measure [4]. Lastly, [3] develops an ensemble with several ML classifiers to predict box office revenues in nine classes.

Finally, very few studies explore profit as the success measure and features available before the film’s release or during its production simultaneously (Table 1 – bolded). Employing SVM and NN to predict profitability in five range classes, [57] uses budget, the number of screens, release month, MPAA, and star and director power in a 755-observation dataset to get 49.54% of Bingo APHR. The work most similar to ours, however, uses a 2506 sample size to predict who, what, and when a film could be profitable [42]. The authors explore cast relationships, movie abstracts, and release season to classify American movies according to their raw profit and ROI. The authors perform a few experiments, including binary for ROI and profit and 3-class for ROI. Their best result is 90.4% of accuracy for binary profit.

Table 1 Literature summary

This study distinguishes three main aspects performed simultaneously from the previous closest studies summarized in Table 1.Footnote 3 First, we account for the effects of ICT advances or another possible shock in the recent period; then, we design sub-datasets to account for differences in the short and long run as similar as possible to these studied datasets to compare performance. Second, profits were deflated and used as the measure of economic film success. Third, the sets of our features are smaller and more intuitive than the ones used by those studies and available at the time of film production (see the arguments in Section 3.2.)

In addition, considering Table 1, it is notable that we employ a decision support system to classify and forecast film profits using RF, SVM, and NN. Using this set of tools differs from the literature and could also be viewed as a marginal contribution (see Section 4).

3 Data and methodological strategy

3.1 Data

Around 22% (3167) of the movie releases between 1980 and 2019 (14,510) available at the Box Office Mojo and IMDB sites – the most common data sources used by literature – have budget information.Footnote 4 The smaller amount of information on film costs is due to “industry trade secrets” [76]. The collected sample, however, is far larger than the data size average used in the literature, which is 361 observations/movies [39].

All monetary values used in this study were deflated by the 2019 CPI (CPI-2019) to control for inflation over the years; we keep prices of 1980 constant. This procedure is not usual in this literature. Not correcting for inflation, however, can mislead decision makers and compromise results since comparing revenues over time demands control of price inflation to avoid the more recent films being classified wrongly as more profitable or with higher revenues. Figure 1 shows the evolution of revenues and budget by year, between 1980 and 2019, both controlled for inflation.

Fig. 1
figure 1

Distribution of deflated gross and budget over 40 years of data

We collected budgets and worldwide gross revenues to create the profit measures, which means we are considering the box office revenues in all countries where the film premiered. According to Box Office Mojo, all information received from countries is reported. Then, we follow the scarce literature that uses profit measures to create success classes for binary classes [42, 73, 78] and multiclasses [60].

3.2 Methods

Figure 2 presents the general workflow of the methodology described in the following sections.

Fig. 2
figure 2

Methodology workflow

3.2.1 Variable selection

Unlike most previous studies, we use a reduced set of features easily observable during film production to classify film success. Furthermore, we limit the features to include only those available before the film release, particularly at the film production stage. Thus, differently from previous studies, we can offer a policy guide for producers and stakeholders that allows changes while a movie is still in production. Additionally, the variables were chosen carefully and based on the literature to bring the most meaningful features for an optimal classification given the dimensionality course. Table 2 summarizes the features and their preprocessing step based on the literature.

Table 2 Variable names and description

Compared to [42], we employ more straightforward, less costly, and directly observable features or at least the ones that industry agents have to bet on. For instance, during the production process of a film, it is possible to know the planned runtime, the season to be released, the distributor, and the genres. Thus, if the proposed tool’s prediction is faulty, there is time to change characteristics to increase the chances of success. Among the studies, [42] is the most similar to ours regarding using binary profit, but not deflated, as a success measure and variables before a film’s release (see Table 1).

3.2.2 Predicting methods

We choose the three most popular ML classifier algorithms in the film literature –SVM, Multilayer Perceptron Neural Network (MLP-NN), and RF – to conduct our experiments.

SVM is a supervised classifier based on the statistical framework proposed by Vapnik and Chervonenkis (VC Theory). It aims to find the best hyperplane to maximize the separation between data points; it can perform linear and nonlinear classification by applying kernel tricks. For further information and the math behind it, please refer to [7, 12].

The MLP-NN is also a supervised classifier that approximates functions that lead the entered data to the output class by adjusting weights between layers (forwards and backwards). For further information about MLP, see [30].

The RF classifier [8] is an ensemble of decision trees and can perform very well in different tasks [26, 36, 59], in particular, regarding the heterogeneity of data, including continuous and discrete variables, as the binary/dummy features employed. Besides being versatile in binary and multiclass classification, RF is also simple to build, train, tune, and the method is robust and less sensitive to noise [29]. Additionally, RF can outperform other non-ensemble methods [62]. Finally, since most previous movie prediction studies focused on NN, this makes the RF method still little explored (Table 1); we can thus consider its use as a marginal contribution to the domain. In addition, RF is very well suited to preview movie financial success thanks to its capacity to handle mixed data (dummies/binaries and continuous/discrete) Fig. 4.

In our samples, RF is less sensitive to noise (giant blockbusters or flops) and is explainable, allowing us to assess the feature’s importance in the models and evaluate whether samples of different ranges of time matter to predict success. Thus, it works as an indirect measure of shocks effects. We brought this idea from the economic literature in time series [10, 50], which states that a process generating a model, in this case, film profits, can change its regime throughout time due to shocks. To implement and test this, we created different data sets using different timing and a complete full dataset including year dummies to test the Gini importance effect of the years on RF (Fig. 5). RF’s lower sensitivity to noise is also suitable to compare total and wide-released film sets to preview success.

3.2.3 Experiments

Following the literature, the prediction problem was transformed into a classification problem, aiming to classify the movie into its profit success or failure based on its worldwide gross revenues and budget. Two different class arrangements were designed: Break-even (BE) and Profit Ranges (PR).

Break-Even (BE): Similar to [42, 60], the output is binary, 1 when the film’s profit is zero or positive – the worldwide gross is equal or greater than its budget – and 0 in the contrary case. In this sense, a movie only has to collect (in terms of box office gross) the exact amount spent in production (announced budget).

Profit Ranges (PR): To get results closer to actual profit values and comparable with previous literature, we created a 6-class problem considering the total amount of profits of a given movie following [73, 78].

3.2.4 Sets

Although the results for the full dataset (1980–2019) were good (see Section 4), we noticed that the literature uses much smaller datasets. Therefore, we also analyze different slices of the dataset to explore possible heterogeneous results among the smallest and greatest samples in time, which could capture changes in consumer behavior over time due to technical changes, for example, and between and within datasets. We also explore wide-released film subsets since they are more homogeneous in box office revenues. Thus, we created 12 subsets of data, considering the years of film releases and wide and total releases.Footnote 5 Tables 3 and 4 present the thresholds for classification and the rules to separate data in these subsets.

Table 3 Classifications thresholds for each class arrangement (break-even and profit ranges)
Table 4 Datasets, slice rules and number of observations for binary class

Figure 3 shows the class distributions of the full sample (A) over the years for BE (panel a) and PR (panel b), while Fig. 3 presents similar class distributions for the wide-released movies (B) over the years.

Fig. 3
figure 3

a Break-Even Classes’ distributions of dataset A (full sample) over years. Profitable movies are majority, especially in last two decades. b Profit Ranges Classes’ distributions of dataset A (full sample) over years. Class 1 diminishes over time, while class 6 grows

Fig. 4
figure 4

a Break-Even Classes’ distributions of dataset B (wide releases) over years. Profitable movies are majority. b Profit Ranges Classes’ distributions of dataset B (wide releases) over years. Distributions are similar to those shown in the Fig. 3b

Fig. 5
figure 5

Average accuracy performance (10-cv) of the three classifiers (RF, MLP and SVM) in binary experiment (BE) for sets B, G, H, I and L. Confidence interval shows that RF has the largest both lower and upper bound

Fig. 6
figure 6

Random Forest feature importance for break-even experiment in dataset K (left figure) and in dataset L (right figure). In both First Week Theaters feature is the most important, while other features vary (genres, budget and runtime among others)

Fig. 7
figure 7

Random Forest feature importance for break-even experiment and Full sample (A) added of years’ dummies to catch their importance. Years 1999, 2014 and 2001 figured in the top 20 features

It is necessary to sort out the imbalanced class problem as observed in Fig. 3a (862 unsuccessful vs. 2305 successful films for dataset A) to classify a film according to the profitability’s BE classes to avoid biased results toward the success/positive class. This imbalance in our sample is mainly due to budget information, a feature generally disclosed only from big studios. We use SMOTE to oversample the minority (negative) class and address the imbalance. SMOTE is an algorithm that creates, by mimicking, synthetic new observations. The new observations are not duplicated; they are similar to the examples by selecting records and altering one column in that record by a random amount within the difference to the neighboring records [13]; note that the synthetic instances are used only in training folds. Thus, we balanced all BE experiment datasets and the SMOTE proved, through tests, to be better than class weight and near-miss methods.

To obtain the best hyperparameter set, we use the Grid Search tool to optimize all experiments, models and sets. We start with big ranges and different configurations of hyperparameters and refine them to get the best scenario. The best sets of hyperparameters are in the footnotes following the results.

For both experimental setups (BE and PR) and all datasets (A to L), we use 10-fold cross-validation. This validation method allows a decrease in the train dependency and creates a more fairly comparable method [67]. Therefore, the results are presented based on the average of these 10 executions.

Finally, to evaluate, present, and discuss the results properly, we use accuracy (Eq. 1) and F1-score (Eq. 2) metrics for both binary (BE) and 6-class (PR) experiments. In addition, APHR is used for multiclass sets (PR), following the most common literature approaches. APHR (Eq. 3) is the total correct classifications to the total number of samples, averaged for all classes in the classification problem – or precision in multiclass problems.

$$ Accuracy=\frac{True\ Positives+ True\ Negatives}{True\ Positives+ True\ Negatives+ False\ Positives+ False\ Negatives} $$
(1)
$$ F1=2\ast \frac{1}{\frac{1}{precision}+\frac{1}{recall}} $$
(2)
$$ {APHR}_{Bingo}=\frac{number\ of\ a\ class\ samples\ correctly\ classified}{total\ number\ of\ a\ class\ samples} $$
(3)

4 Results and discussion

Table 5 presents the BE results under all datasets (A to L) for the three ML methods: RF, SVM, and NN.

Table 5 Break Even (BE) experiment 10-fold cross validation median Accuracy (Acc) and F1 score average results for RF, MLP and SVM and for each dataset

The results show a good performance of the model in predicting whether a movie will pay its production costs compared with the literature. The best accuracy result for BE is 96.7% in the B dataset (wide release only) and datasets G, H, I, and L with 95%, 93.3%, 92.1%, and 94.2%, respectively – all with RF. For more details of parameters, see Table 10 in the Appendix. The referenced datasets also have an F1-score above 95%. Except for set J, RF performed better than MLP and SVM. Figure 5 presents the performance of the three classifiers along with their confidence intervals for the best result sets; the confidence intervals reinforce the superiority of RF for the cases presented.

Most studies classify film success employing their revenues as the main measure of success; thus, regarding studies that use revenue net of costs, our best binary experiment result, 96.77%, outperforms the literature with significant margins, 88% in [60], and 90.4% in [42].

For the multiclass experiment, PR, the best average accuracy is from dataset I, with roughly 50% accuracy with RF, followed by sets A, C, D, H and L – all with about 46% accuracy. As shown in Table 6, RF has the best performance for all datasets. The APHR results for PR-I are presented in Table 7.

Table 6 Profit Range (PR) experiment 10-fold cross validation median Accuracy (Acc) and F1 score average results for RF, MLP and SVM and for each dataset
Table 7 APHR of experiment PR in set I with RF

As Table 7 shows, we obtained 89.8% of the APHR-Bingo average, therefore being better than APHR 56% from [68] and APHR 49.5% from [57]. Broadly comparing these results with literature that uses information before film release to classify, since their measure of success is raw revenues and we use deflated profits, our models also have better performance in prediction than 54.4% from [78], 36.9% from [65], and 68.1% from [76]. Considering that these authors utilize some NN architecture as a predictor method in a multiclass problem, we conclude that RF performs better to support movie stakeholders’ decisions. Table 8 shows a better view of the comparison between our results and the literature.

Table 8 Summary table of results and comparison

Overall, the four BE results (B, H, I, and L – Table 5) have excellent scores in predicting the profitability of movie theaters since their metrics are better. In addition, results suggest that profits can be more adequate measures of a film’s success because they account for the tradeoff between revenues and costs. Yet, as the exclusive use of features available before the film release or during its production process significantly reduces the number of features available in classifiers, the results are much more significant since we are not using information like critics, user reviews, and WOM data.

These best datasets – H, I, and L – include only wide-released films and brief periods after 2000, explaining their similarities (See Table 3). These datasets perform better than dataset F, which contains all wide-released movies after 1999. The difference may shed light on the timing in which a window slice is designed, consequently on the sample size, where smaller samples and more recent datasets had better performances. Another way to discuss these findings is the homogeneity underlying the data slices, since set B covers all wide-released movies, with no time slice, and the model got the best performance. The same occurs for set G, which has no outliers (Isolation Forest), reinforcing the importance of homogeneity in the predictions.

These results suggest the model generating data might have changed due to structural breaks [10]. Shocks – like technological innovations, changes in consumer preferences, political and economic interventions, and natural shocks like COVID-19 – cause a structural break. To evaluate this possibility, we explore the feature importance generated by RF, via Gini Index, for BE experiments with different sample sized datasets to check whether there are changes in their relative feature importance since such changes indicate a different model. Using datasets distant in time – K (wide releases between 2000 and 2004, 567 observations) and L (wide releases between 2010 and 2014, 536 observations) – we extract the Gini Feature Importance for each case, as Fig. 6 shows.

Comparing the features in K and L datasets, it is possible to notice a clear change in the relative importance of budget, runtime, crime and adventure genres, number of markets, and other features. This change in theatrical consumption can result from technological innovations, as an alternative way to consume a movie brought by the streaming videos or the availability of other new goods, like games, leading to a change in consumption behavior. To check the robustness of these changes and better comprehend a possible shift in the “regime” that governs the data generation, we also included another BE experiment with year dummy variables (from 1981 to 2019) as features to dataset A and performed RF classification. Note that we included all years because we are using worldwide revenues, having many countries, and it is difficult to define a specific year shock. If the year dummies are relevant determinants to film success, however, it means evidence of the regime’s change since it is supposed that the time would not affect the classification. The relative importance score of the first 20 features is shown in Fig. 7.

Additionally, by exploring different data samplings and the importance of features in each dataset, we find that the number of theaters, budget, runtime, and the number of market releases are the main features to explain a movie’s economic success. Note that the number of theaters, however, may bias the results to be suitable only for wide-opening films since these types of movies disclose budget information more commonly. Alternatively, the two least significant are the MPAA ratings of NC-17 and G; this may be because of their low representativeness in data. Apart from these last two, our models were able to classify very well by using a few variables easily observable or available in a movie’s planned production/pre-production period.

5 Concluding remarks

Uncertainty in new film production is high, with failure rates ranging between 25 and 45% [46]. Therefore, a large portion of movies are unprofitable, and productions with large budgets and impressive star power are not guarantees of profit [15]. We, thus, evaluate three classifiers to determine the economic success, measured by profits free of price inflation, of film release at theaters using few and simple observable types of information at the time of production stage. We consider economic success as the movie revenue over its costs (or profits) in two different approaches; binary classification (BE) and 6-class classification (PR). For binary classification, we use SMOTE to solve problems of class imbalance.

Forecasting film profitability based only on the early stages of film production is a complex task, mainly due to eliminating several other relevant determinants of film quality and economic performance available only at or after the release. Nevertheless, our results show better performance than the previous studies, mainly using RF and small datasets (accuracy of 96% and F1-score of 97% for binary and about 50% APHR for 6-class).

In addition, the analysis of feature importance suggests that the movie market model changes over time. The theoretical literature in ML and statistics [74] indicates that more data (more instances/information) improves performance. Our findings, similar to the literature on applied film success, show that limiting data to brief periods of timing supports patterns of similarity over time, thus resulting in better learning. We, therefore, argue that shocks like technological innovations, which change supply and demand behaviors, and other shocks can alter a model regime to classify film success.

Therefore, our study contributes to the productive sector and related academic studies. It can guide studios, producers, and other stakeholders to make better investments and decisions when there is room to change plans. In this case, they can count on the low cost of obtaining inputs to make predictions (directly observable features), excellent accuracy in prediction, and time enough to make changes in movie plans in case a poor contingent prediction occurs.

Regarding the literature contribution, we envisage five novelties that can be summarized in three main issues. First, we use deflated profits as the measure of film success instead of non-deflated film revenues as in most literature, which allows us to balance the trade-off between film revenues and costs. In addition, the few studies that employ profits as a film success measure do not deflate them, which can mislead the classification towards considering the most recent films as the most profitable. Second, to preview success, the proposed tool uses a small number of simple features that are not pre-processed and are directly observable. In addition, the features are available mainly at film production time; thus, when some bad results are predicted, there is room to change the production course to increase the chances of the film’s success. Third, it calls attention to the regime’s potential changes that describe a model over time due to shocks like technological innovations. In this sense, considering all the previous items and the cuts of sample to compare with the literature, the use of RF, and the higher scores obtained, we believe we have contributed to the literature.

Regarding the potential “regime” changes, more investigation is needed. In this sense, structural breaks should be analyzed through specific statistical tests – a future work to be explored – to develop exogenous tests to guarantee the future predictability of the film market and other social time-related domains. Another line of investigation is to exploit the differences between more homogeneous and heterogeneous samples employed to predict film success. For instance, eliminating outliers is a way to make a sample more homogeneous and improve binary predictions. In this sense, reducing a film sample in a shorter time makes films more homogeneous and improves results on success prediction, as we and other authors have found. Also, using samples with only wide-released films, a more homogeneous dataset, resulted in better prediction in our results. In addition, we might improve feature selection for future works, removing those that are minimally informative and adding others like a sequel and/or star power – for example, in agreement with the literature – and experiment with different computational models to estimate missing budget data to enhance data size.