1 Introduction

Predicting the effects of advertising is a critical issue for advertisers, but factors that contribute to the impression or emotion of viewers are still unknown. In TV advertisements, the gross rating point (GRP) is a standard measure to estimate the effect. GRP is simply calculated by summing the viewer ratings of individual advertisements. GRP is believed to be correlated with the recognition rate (i.e., the percentage of people who remember the advertisement) because the frequency and the popularity of the TV program would be more correlated with the number of people who watch the advertisement and, in turn, recognize (remember) it. However, as we will show in Section 5, the correlation coefficient between the GRP and the actual recognition rate is quite small. Other factors such as how the advertisement will influence the viewers’ desire to buy the products are much harder to predict because such impressions and emotions have little to do with GRP. To better understand the impacts on the audiences, we proposed new metrics to evaluate the impacts of an advertisement based on the real impressions from the audience.

The contributions of this paper are as follows. First, we construct a large-scale dataset of TV advertisements. The dataset contains 14,490 video clips with 23 kinds of subjective labels, which can be easily used in a supervised learning manner in the computer vision community. As far as we know, this is the largest TV advertisement video dataset with such rich labels. All the advertisement videos are created by professionals and actually broadcast on TV programs. The videos are evaluated by actual residents of the same country, not by crowd workers all over the world, which means these TV advertisements are promoted to the target audience without any cultural gap. This dataset can be shared with researchers in academia who are interested in proper contracts. Second, we present a multi-modal baseline which takes multi-modal features into consideration, analysis among each modality will also be discussed. In this paper, we tackle the task of predicting the following four impressional and emotional effects:

  • Recognition rate: how much participants remember the advertisement.

  • Willingness rate: how much participants feel like buying the product/service.

  • Interestingness rate: how much participants become interested in the product/service.

  • Favorableness rate: how much participants like the content of the advertisement itself.

These four dimensions out of 23 kinds of labels are much more significant according to the professionals from the advertisement industry. To the best of our knowledge, such high-level impressional effects and predictions for ad videos have never been tackled before.

Experimental results using 11,373 videos in our dataset show that our model can predict the impressional and emotional effects with correlation coefficients of 0.69-0.82, which is a vast improvement compared to the current de-facto standard of predicting using the GRP. When using individual models to predict each target, the results can be further improved to 0.69-0.85. Our experimental results show that the high-level impressional and emotional effects can be predicted with moderately high correlation coefficients, which we believe is beneficial for the advertisement industry and can be used as a baseline for future studies.

The remainder of this paper is organized as follows. Section 2 summarizes the related work in this research field. Section 3 describes our dataset. Our baseline of multi-modal network architecture is explained in Section 4 and the experimental results are shown in Section 5. Some of the possible applications are discussed in Sections 6 and 7 discusses limitations and future works. Concluding remarks are given in Section 8.

The techniques used in this paper might not novel, but the main scope of this paper is the world-largest TV advertisement dataset with detailed labels and metadata and a high baseline for such subjective effects.

2 Related work

2.1 Machine learning on advertisements

Hussain et al. [37] constructed a dataset describing what actions the viewer is prompted to take, the reasoning that the ad presents to persuade the viewer, and the symbolic references that the ad makes. The authors concluded that the prediction of topic and sentiment in video advertisements is still a very difficult problem. They only achieved 35.1% accuracy for predicting the video topic, and 32.8% accuracy for predicting the sentiment.

Impact analyses in advertisements have been studied in [76, 77]. In [76], immediate recall, day-after recall, and user experience are evaluated in addition to conventional effective measures such as valence and arousal. As compared to these works, the number of our raters is larger (620 vs 17 [76]), and much more detailed affective effects are labeled (23 vs 5 [76]) in our dataset. Adamov and Adali [2] employed sentiment analysis to find the most relevant advertisement to the main topic of the web page and the sentiment (positive or negative) of the author. This can be used for more context-aware advertisement, but [2] only analyzed the sentiment and not the effects of the content as we do in this paper.

One of the big issues of previous works is that the video dataset is collected from YouTube and the questions are answered by crowd workers. Therefore, the cultural background of the advertisements and viewers are not necessarily aligned. On the other hand, our study directly predicts the effects of the advertisements such as how much the consumers would be aware of the advertisements, how much the advertisements will succeed in persuading viewers to buy the products, and so on. This has been made possible by constructing a larger TV advertisement dataset and conducting extensive research on their impressional and emotional effects.

A lot of affective advertising generation methods have also been discussed. The authors in [95, 100] proposed video-in-video frameworks. In [95], object recognition and human detection are used for retrieving the optimal advertisement in terms of the advertisement’s attractiveness and intrusiveness to the viewers. Zhang et al. [100] used eye-gaze tracking data to minimize the user disturbance because of advertisement insertion while enhancing the user engagement with the advertising content. Similar works aiming at text insertion onto videos can also be found in [53]. Yashima et al. [97] proposed a joint image and language DNN model to learn from the noisy online data and produce product descriptions that are closer to human-made descriptions with more subjective and impressive attribute expressions. They succeeded in generating more emotionally impressive descriptions according to the crowd workers. A similar study can be found in [56].

Advertisement video classification is also a fundamental problem of video categorization [36]. On the other hand, such category information can be used as one of the features.

Studies on infographics have also been conducted for designing more visually appealing poster advertisements. Saleh et al. [74] demonstrated that the combination of histograms of oriented gradients (HoG) [19] and color histograms was the most efficient method in retrieving similar style infographics, which can also be used for classification. Bylinskii et al. [10] proposed using neural networks trained on human clicks and importance annotations on hundreds of designs for predicting the relative importance of different elements in data visualizations and graphic designs. The prediction model of importance could be used for automatic design retargeting and thumbnailing [11]. However, it focused only on making more eye-catching static posters. In [102], a deep learning framework was proposed for exploring the effects of various design factors on perceived personalities of graphic designs and applications to element-level design suggestion and example-based personality transfer were demonstrated. Color enhancement for advertising images has also been discussed [14].

Recently, research on online banners are also reported [16, 28, 40, 65, 91, 92]. The first DNN-based CTR model was reported in [16]. In [28], the influence of the image ad’s dimensions such as width and height on CTR were discussed. Xia et al. reported that the click through rate (CTR) for online image advertisements can be predicted with high accuracy [91, 92]. The correlation coefficient of the CTR prediction model was 0.828 using a multi-modal regression model. A similar idea of multi-modal fusion is also reported in [65]. The CTR prediction for videos was also achieved with the correlation coefficient of 0.70 [40] recently.

Producing suitable advertisements for the target audience is also close to the field of recommendation system, which also tries to show appealing information to users. Deep autoencoder-based methods [61, 62] can handle sparse data and produce good feature representations. Pan et al. [63] used a stacked denoising autoencoder to obtain compact representations. For sparse input data, graph convolution network (GCN) has also been applied and shown its effectiveness in [35, 98] in this research field.

2.2 Physiological analysis on advertisements

Multiple physiological studies on impressions and emotions invoked by advertisements have been conducted. Aaker et al. [1] analyzed the “informativeness” of 524 prime time TV advertisements that were rated by more than 250,000 viewers. The correlation between informativeness and other personal relevance adjectives suggested that an informative advertisement tends to be worth remembering, convincing, effective, and interesting. Vaughn [87] proposed a model based on a matrix of consumer thinking-feeling and high-low involvement behaviors. He suggested that some purchase decisions are well thought out whereas others are more impulsive, requiring less involvement on behalf of the consumer.

Yang et al. [96] demonstrated that users’ zapping behavior has a close relationship with their facial expressions. Based on such investigations, an advertising evaluation metric, Zapping Index (ZI), was proposed to measure a user’s zapping probability. ZI could also be used to measure a user’s preference for different categories of advertisements, which would assist advertisers as well as publishers in understanding users’ behavior.

The relationship between the higher-level impressions as we focus on in this paper and the eye-gaze [100] or multimodal features including the RR Intervals of heart beat [60] is also investigated.

Those technologies can be used for predicting the emotional effects of advertisements, but the experimental setup is troublesome. From that point of view, automatic prediction of effects without extra interaction with humans is preferred.

Our work is different from the above in that our DNN-based model can predict the effects even before actually broadcasting (or watching) the content or conducting a large-scale subjective study. Therefore, it would become possible to give insightful suggestions to the advertisement industry.

2.3 Emotion, Sentiment, and Memorability

Sentiment classification and effective analysis have been conducted for text [3, 6, 22, 39, 64], speech [31, 44, 57, 89, 101], audio [12, 80, 88], image [50, 52, 55, 66, 99], face [15, 17, 25, 72, 103], and video [7, 41, 58, 59, 94]. There are also large-scale datasets of images annotated with their emotion/sentiment such as the BAM! dataset [90], EMOTIC [49], SentiBank [9], and so forth. SentiBank is a large-scale Visual Sentiment Ontology (VSO) dataset consisting of more than 3,000 Adjective Noun Pairs (ANP). The authors also presented a visual concept detector library that can be used to detect the presence of 1,200 ANPs in an image.

Most of the works use Ekmans’ atlas of emotions [23] or Plutchik’s wheels of emotions [68]. Plutchik’s wheels of emotions are similar to those of Russell’s model in this regard and differ from Ekman’s basic emotions. Plutchik’s theory allows us to clearly perceive the proximity between arbitrary pairs of emotion categories. In either case, however, the target emotions are elemental ones such as joy, anger, and so on. Higher-level impressions such as those in our study have rarely been discussed so far.

Some researchers have discussed the memorability of images [5, 21, 26, 47, 67] and image/video manipulation techniques to enhance memorability [27, 78]. Kim et al. [48] showed that heart rate and galvanic skin response (GSR) are major predictors of memorability for photographers. In many cases, memorability is measured after showing the content once. On the other hand, we assume that the participants have watched the advertisements multiple times somewhere else before answering the questionnaires.

Compared to the memorability of images, the memorability of videos is still an emerging field of research [18, 30, 43, 75]. Han et al. [30] proposed a computational model that can correlate low-level audiovisual features with brain activity decoding using fMRI. Shekhar et al. [75] discussed efficient features for video memorability and applied it to video summarization.

On the other hand, the memorability of advertisements is complexly affected by their visual content, audio, narrations, GRP, and so on.

3 TV advertisement dataset

3.1 Data

As is well known, a large amount of data is required in order to properly train DNN models, but such datasets of TV commercials are not available as long as we know mainly because of copyright issues. We constructed a large-scale TV advertisement dataset that contains 14,490 video clips that were actually broadcast in Japan on TV from January 2006 to April 2016. All the advertisements are repeatedly broadcast on TV. Those broadcast only a few times (e.g., special advertisements dedicatedly designed for a big sports event) are not included. The quality of the archived videos is also carefully controlled: the advertisements are digitized by using the same series of analog-digital converters of the same brand for nine years in order to make the encoding quality deviation as small as possible. After the digital broadcasting service started, we used digital-analog converters to encode the videos. The resolution of the videos is 320 × 240. The number of videos per clip-length is summarized in Fig. 1. Examples of videos in our dataset are shown in Fig. 2. It can be noticed that most of the videos are 15 seconds, which is standard in the country.

Fig. 1
figure 1

The number of advertisement video clips in our database. The most of TV advertisement videos are 15 seconds, which is standard in the TV advertisement industry in Japan

Fig. 2
figure 2

Sample images of TV advertisements in our dataset

There is also information on the advertiser, GRP, characters/animals/objects in the video, speech script, service category (e.g., food, car, etc.), and broadcasting pattern. There are eight types of broadcasting patterns. A typical pattern is broadcasting all day for the whole week, and another one is broadcasting only in the evening on weekdays and all day on weekends. Advertisements for compact or family cars usually take the first strategy and those for sports or luxury cars take the latter strategy based on the target consumers. There are 16 business categories: material, foods, medicine, cosmetics, fashion, publishing, industrial machines, office supplies, electronics, transportation, house supplies, houses, shops, finance, service/leisure, and others. All the mentioned information can be treated as metadata. And all these twenty-three kinds of metadata are provided as shown in Table 1.

Table 1 Available metadata

The popularity of the casts in advertisements such as actors, actresses, singers, etc. is also an important factor that can influence the recognition rate and so on. Therefore, the recognition rate and popularity rate of famous talents were also investigated by asking a different set of 565 people twice a year.

We generated the cast features as follows. We first extracted the recognition rate and the favorability for famous talent in the advertisement. The number of famous talents in the advertisement will be set as an input feature vector, together with the average popularity rate of casting famous talents. As a result, a two-dimension feature vector is obtained to represent cast features for each advertisement.

Advertisements contain visual and audio information. And text information exists in images/frames and audio. Though visual and audio feature extractors can extract features with semantic information, these features cannot include detailed descriptions because it is beyond the color combination or the rhythms of audio. And the text-related information should also play an important role in TV advertisements. Therefore, text information should be set as another important element in the advertisement dataset. Because text information exists in both frames and audio, we collected two kinds of the text information according to the source: texts in frames and narration data.

To summarize, our dataset contains 14,490 Japanese TV advertisements which were broadcast on TV repeatedly some time from 2006 to 2016. Each advertisement in the dataset not only includes the video and audio information, but also contains metadata information, the popularity of acting casts, and other related texts information. The impressions of a TV advertisement are based on different perceptions and should combine different data modalities, and we tried our best to make it informative.

3.2 Annotation

Each video was evaluated by a different set of around 620 participants, who can be divided into 11 categories according to their ages and genders. Eight out of eleven categories are males/females from 13 to 19, those from 20 to 34, those from 35-49, and those from 50-59. Note that males and females are in different categories. In addition, we also use three global categories to represent different interests of different groups, which are all males, all females, and all the ages/gender.

The questions are listed in Table 2. Participants answered each question on a different-point scale. The survey measures advertising recognition (question ID 1) using a three-point scale (3 = have seen it, 2 = probably have seen it, 1 = have not seen it), purchase intention (question ID 2) using a five-point scale (5 = want to buy very much, 4 = want to buy somewhat, 3 = neutral, 2 = do not want to buy very much, 1 = do not want to buy at all), advertising likability (question ID 3 and 4) using a five-point scale (5 = like very much, 4 = like somewhat, 3 = neutral, 2 = dislike somewhat, 1 = dislike very much). Other advertising perceptions used multiple-choice questions for survey.

Table 2 Questions in the evaluation form

We would like to note that this dataset has been collected by taking more than 10 years, and therefore the social conditions at the time, e.g., tie-up with famous music, the popularity of actors/actresses, etc., are also reflected in the dataset.

Our dataset is four times larger, more quality controlled, and contains much richer labels than the dataset presented in [38]. The other differences are summarized in Table 3. Data are available to researchers in academia under proper contract with the data provider.

Table 3 Comparison of the dataset

4 Prediction of advertisement effects using attention-based multimodal framework: A baseline

Because we have 23 kinds of annotations on impressional/emotional effects in our dataset, we can totally have 23 individual prediction tasks. We consulted the professionals in the TV advertisement industry in Japan, and the first four tasks are considered the most valuable: recognition rate, willingness rate, interestingness rate, and favorableness rate.

Here we provide an attention-based multimodal framework to combine information of TV advertisement together for the prediction of the targets, in which way the model can focus on the important part for better inference.

4.1 Components in the Attention-based Multimodal framework

The network structure of our baseline is shown in Fig. 3. Six kinds of multimodal features are used in our model: visual, audio, metadata, cast data, and two kinds of text data, i.e, texts in frames and narration. Each of these parts vary from one to another because of different data formats and modalities. We apply different networks which have been validated effective for each data modality to cope with different parts. By taking advantage of attention mechanism, our framework can efficiently make use of different data. And the attention weights can indicate which part is more important, which can also inspire the TV advertising industry to generate more appealing advertisements.

Fig. 3
figure 3

Our DNN structure for advertisement effect prediction. Six kinds of input data are taken into consideration for final prediction

For visual features, key frames are extracted every second and fed to a ResNet-50 [33]. The ResNet is pre-trained using ImageNet [71] and no fine-tuning is conducted because we think this is enough to extract necessary semantic features from images; ResNet is simply treated as an image feature extractor. We also think to take the chronological order of the frames into account using networks such as Recurrent Neural Networks (RNNs). However, for TV advertisements, the scenes are changing in seconds, which makes them more complex. A three-dimensional (3D) Convolutional Neural Network (3D-CNN) based network [42, 69, 83] was tried, but we confirmed that this approach did not work as well as our architecture in our preliminary experiments.

For the audio features, we employed SoundNet [4] because the network is designed for scene classification using sound, instead of human voice recognition. Although there are many other DNN-based and non-DNN-based approaches to sounds [12, 24, 80, 88], most of them were proposed to focus on human voice and did not work well for our purpose because advertisement videos usually contain not only human voices, but also music and sound effects. The SoundNet was firstly pre-trained using UrbanSound8K [73] and fine-tuned with our dataset.

The metadata and cast data have no spatial or temporal structure like video and audio, so they are directly fed to a multi-layer perception (MLP) with ReLU activation and dropout function.

Text information is in sentences in sequence. Traditional processing procedure split first split sentences into words and conduct word embedding to construct word vectors. Then Long Short-Term Memory (LSTM) or recurrent networks (RNNs) networks are will be used to encode word vectors to obtain the sentence vector. Recently, transformer [86] has drawn great attention in natural language processing. And Bidirectional Encoder Representations from Transformers (BERT) [20] has become one of the most famous network for word embedding. We make use of pretrained Japanese BERT modelFootnote 1 to embed sentences and use the [CLS] token as the sentence embedding for post-processing.

4.2 Two-step attention

The most important technical contribution in this paper is the two-step attention mechanisms, which are represented as α and β in Fig. 3: one for visual feature importance and the other for multimodal feature importance. Attention-α assigns weights to the key frames to conduct weighted-average pooling:

$$ \begin{array}{@{}rcl@{}} \mathbf{F}_{\text{frame}} = \sum\limits_{i=1}^{15} \alpha_{i} {f}_{i}, \end{array} $$
(1)

where Fframe is a visual feature of the input video and fi is the feature extracted using ResNet for each frame. Fframe is calculated by (weighted) sum of key frame features rather than concatenating them, which is employed in literature [46, 54]. α is the attention vector indicating the importance of each frame, i.e., αi denotes the attention weight for ith frame. α is calculated by the following equation:

$$ \begin{array}{@{}rcl@{}} \alpha = \sigma_{2} \mathbf{W}_{2} (\sigma{1} \mathbf{W}_{1} \mathbf{f} + b_{1}) + b_{2}, \end{array} $$
(2)

where σ1 and σ2 are sigmoid and softmax functions, respectively, and W1 and W2 are the weight matrices of the neural network. f is the concatenated feature from [f1,...,f15]. b1 and b2 are the biases in the linear layer.

The other attention, attention-β, is used when combining the different multimodal features:

$$ \begin{array}{@{}rcl@{}} \begin{array}{c} \mathbf{F}_{\text{multimodal}} = {\upbeta}_{1} \mathbf{F}_{\text{visual}} + {\upbeta}_{2} \mathbf{F}_{\text{audio}} + {\upbeta}_{3} \mathbf{F}_{\text{meta}} \\ \quad\quad\quad\quad\quad\quad+ {\upbeta}_{4} \mathbf{F}_{\text{cast}} + {\upbeta}_{5} \mathbf{F}_{\text{text}} + {\upbeta}_{6} \mathbf{F}_{\text{narration}}, \end{array} \end{array} $$
(3)

where Fvisual, Faudio, Fmeta, Fcast, Ftext, and Fnarration are deep features for visual, audio, metadata, cast data, text in frames, and narration data, respectively. And Fmultimodal is the generated multimodal feature for the input video, which is fed to the full-connection (FC) layers for the advertisement effects prediction. β is the attention vector indicating the importance of each kind of feature. β is calculated in the same manner as α:

$$ \begin{array}{@{}rcl@{}} \upbeta = \sigma_{2} \mathbf{W}_{2} \left( \sigma{1} \mathbf{W}_{1} \mathbf{F}^{T} + b_{1} \right) + b_{2}, \end{array} $$
(4)

where W1 and W2, b1 and b2 are the weights and biases in the linear layer. Note that these values are from a different linear layer from those in (2), we do not use additional symbols to distinguish them. F is obtained by

$$ \begin{array}{@{}rcl@{}} \mathbf{F} = \mathbf{concat(F}_{\text{visual}}, \mathbf{F}_{\text{audio}}, \mathbf{F}_{\text{meta}}, \mathbf{F}_{\text{cast}}, \mathbf{F}_{\text{text}}, \mathbf{F}_{\text{narration}}). \end{array} $$
(5)

The four kinds of impressional and emotional effects listed in Section 1 (ID 1-4 in Table 2) which are represented as percentages are jointly trained and all the errors are back-propagated to these networks. Note that the ResNet-50 and BERT are used only as feature extractors without parameter changing during training. We also conducted experiments for each task individually, meaning only one model is trained for one target, which requires Ntarget models for Ntarget targets. In this way, the model will focus on the targets individually. Therefore, for the targeted emotional effects, individual models can usually obtain better performance than using one model to predict all effects.

4.3 Detailed settings

For better understanding of the network architectures, we list the detailed information for each data part in Table 4, which is also corresponding to each part in Fig. 3.

Table 4 Details of each part in our solution. Texts in bold indicate that pre-trained models are used for feature extraction without parameter tuning. Texts in frames and narrations are under the same processing procedure, and we use Texts here to show the corresponding network architectures

For different parts, they are in different data modalities. Therefore, the input dimensions vary from one to another. For example, there are 15 frames in shape 224 × 224 × 3 for visual information, where 224 is the size of width and height and 3 represents the RGB channels; audios are converted into raw waveform and then sampled to a fixed dimensional (i.e., 661,500) vector using a global pooling strategy [85], as the original SoundNet [4] did; the dimension of metadata and cast data corresponds to the construction in our dataset; and for texts in frames and narrations, sentences are split into tokens and converted to token ids where the vocabulary size is 32,000, and the maximum sequence length is 512.

Many pre-trained models prove their effectiveness for feature representation. Therefore, in our work, we directly use some pre-trained models for pre-processing. ResNet-50 is used for frame feature extraction and BERT-base is used for sentence embedding, whose parameters are fixed without fine-tuning. The pre-trained SoundNet (5 Layer) model weights are also used for faster convergence. The dimensions of these models vary from one to another. Therefore, additional FC-BN-ReLU layers are necessary to convert them into the same shape.

In the two-step attention modules, attention weights will be generated using a sigmoid function for each branch. After aggregating all features, a projection head is used for the prediction of our targets.

During training, the batch size is set to 16. The initial learning rate is 0.1, and will multiply a decreasing factor every 5 epochs until it becomes 0.01 for the last epoch. The model is trained for 150 epochs. If not mentioned, the model will be trained to predict the recognition rate, willingness rate, interestingness rate, and favorableness rate at the same time. When training individual models, only one prediction is activated.

5 Experimental results

In the experiments, we used 11,373 video clips whose lengths were 15 seconds out of 14,490 videos in our dataset. We focused only on 15-second data because they were dominant. We eliminated two kinds of advertisements: informative advertisements that were not designed to sell products and those whose GRP was less than 10 (i.e., broadcast only a few times). The metadata employed for these 14,490 videos in our experiments were GRP, business category (i.e., 14 categories), series type (i.e., 4 types), and broadcasting pattern (i.e., 7 time slots), forming totally 26 dimensions for the input data. Namely, only the metadata that can be available before actually broadcasting were used in order to facilitate before-broadcast prediction. Images were sampled every second, and therefore 15 key frames were extracted from each video clip. The frames were resized to 224 × 224. The numbers of video clips used for training, validation, and testing were 9,373, 1,000, and 1,000, respectively.

The relationship between the predicted and actual recognition (R), willingness (W), interestingness (I), and favorableness (F) rates are shown in Fig. 4. As we mentioned in Section 3, we calculated the percentage of the participants who answered 1 (strongly yes) or 2 (weakly yes) to the questionnaire and regarded it as ground truth. Since the rates are predicted by regression, the agreement ratio among participants is not discussed in this paper. In addition, the correlation values and mean squared error (MSE) in terms of the percentage of the impressional/emotional effects are summarized in Table 5. We can observe that GRP is not a very good measure for predicting the impressional/emotional effects. Even the correlation value for the recognition rate, which seems somewhat correlated with GRP, is only 0.35. The other rates show only a non-important or very weak correlation with GRP. As a result, we have to say that there is little correlation between GRP and our target impressions.

Fig. 4
figure 4

Predicted and actual effects for our proposed model

Table 5 Correlation coefficients and MSE of each approach on the test dataset. R, W, I, F represent rates for recognition, willingness, interestingness, and favorableness, respectively. For the individual model setting, one number represents the result from the corresponding model, and totally four models will be trained for R, W, I, F, respectively

The image features show the greatest predictive performance among the three individual features. We do not conduct experiments using cast data, texts in frames, and narrations individually because it is unreasonable to use cast data only when analyzing the impact of TV advertisements, and the text data here come together with visual/audio. Although each modality seems useless when it is used alone, the prediction accuracy is greatly improved when they are combined with the image features. Because we are coping with videos, in video understanding, there are many existing methods using 3D convolutional networks (3D CNN) to extract motion features [13, 32, 81,82,83,84, 93]. These methods were usually trained and tested in action recognition datasets [45, 51, 79]. We also applied C3D [83] to test its effectiveness in our TV advertisement dataset. The toy experiments only used visual, audio, and metadata. The results are in Table 6. We can find that better results are obtained with image-level features for visual embeddings. This may be caused by the complexity of TV advertisements because compared to the action recognition dataset, one TV advertisements contain many scenes and complex actions as well as human-object interactions, and the impression on the audience may also reply on the color, atmosphere, etc.

Table 6 Comparison between ResNet-50 (2D CNN) and C3D (3D CNN). R, W, I, F represent rates for recognition, willingness, interestingness, and favorableness, respectively

When we look back to Table 5, the best predictive performance is obtained when all features are used. Our proposed model can achieve remarkable improvement over a naive model which only uses a single modality. For one model which predicts four targets, the best correlation coefficients are 0.75, 0.85, 0.73, and 0.73, respectively. It is also interesting to see that predicting the interestingness rate is exceptionally difficult as compared to predicting recognition, willingness, and favorableness rates. This might be because that interestingness is judged by a different layer of emotions. By using our model, the prediction of willingness to buy and favorableness can become as accurate as that for the recognition whereas the GRP has little correlation with those factors. We also tried to train our models for each target task individually, reaching even higher performance (the last row in Table 5).

To better understand our proposed baseline prediction model, the attention weights used in our network are visualized in Figs. 5 and 6. Figure 5 shows the attention values (α) assigned to each frame of video for randomly sampled 10 videos. It can be observed the attention value is dynamically changing depending on the content. For the 4th, 5th, and the 10th samples, the attention weights paid for the first frames are very high. For the 1st, 5th, and the 9th samples, the attention weights paid for the last two frames are also apparently higher than average. These frames are usually where the product or the company logo appears. For other parts where the model pays more attention, such as frame 10 for 2rd sample and frame 4 for 8th sample, these situations are highly related to the specific content. Therefore, we can conclude that the first and last few parts in TV advertisement are comparably important. We also confirmed that our model pays attention only to a representative frame when a similar scene lasts for seconds.

Fig. 5
figure 5

Attention values for video frames. The importance for different frames vary from one to another. And our model can apply adaptive weights to each frame for better performance

Fig. 6
figure 6

Attention values for different data modalities. The most important features are visual and audio, which is similar to the common sense because TV advertisements are videos

The attention values for each data modality are shown in Fig. 6. As we can see from the figure, visual and audio play very important roles in advertisements because good vision or audio sound can attract the audience’s attention at the very beginning of an advertisement. Texts in frames and narration data exist in both visual and audio modalities, which can not be ignored. Compared to metadata, cast data is more important, making it reasonable to pay high prices for superstars in advertising. We discussed with some professional creators of advertisements. They told us that they usually pay more attention to sound and music compared to visual content whereas movie creators had the opposite tendency. Therefore, this phenomenon coincides with the creators’ common sense.

6 Applications

6.1 Online A/B testing

Advertisers usually create two or three versions of each advertisement and conduct A/B testing to decide which one to broadcast. A lot of people will be recruited to watch different versions and report their responses, and the final version will be chosen according to the responses. However, conducting such surveys is usually costly and time-consuming, with a risk of information leakage. Our system can predict recognition rate, willingness to buy, etc., giving advertisers a rough idea of how much impact each advertisement would have. An example is shown in Fig. 7, where two types of pet food advertisements are shown. Predicting which advertisements will be more remembered is a very difficult task, but our model successfully predicts the right one is more remembered than the left one.

Fig. 7
figure 7

Example of A/B testing. If two optional TV advertisements are provided, our model can predict the impression score of the given advertisements, which can help to choose better one for release

Creators would also benefit from our system by obtaining objective opinions when they have multiple ideas that they want to compare. Of course, this does not necessarily lead to only a single fully optimized advertisement because creators still have choices to make. And as a matter of course, the prediction accuracy cannot become 100%.

6.2 Scene factorization

The questionnaires asked questions about the entire video, not about each key frame. By introducing the technique in [8], we can estimate how much each scene or frame contributes to the impressional/emotional effect. The flow is illustrated in Fig. 8. By blackening one of the key frames and comparing the predicted impressional/emotional effects, the effect of the single key frame can be estimated. We denote the difference between the predicted value in the original video and the predicted value without a particular key frame as the importance score of the key frame. The calculated importance score of each key frame for the recognition rate is shown in Fig. 9.

Fig. 8
figure 8

Approach for scene factorization. This is only conducted based on frame-level regression

Fig. 9
figure 9

Importance score for each key frame in terms of recognition rate. (a) advertisement of a canned coffee (predicted: 0.45, actual: 0.41) and (b) that of beer (predicted: 0.41, actual: 0.45)

Strictly speaking, our model calculates the estimated impressional/emotional effects for the case where a certain frame is replaced with a black scene, not for the case where the frame did not exist. Therefore, the absolute score is not reliable but the relative scores and their sign (negative or positive) can give us some insights. For instance, the existence of actors or actresses is said to be important for the advertisement to be remembered [29], and a similar tendency can be observed in Fig. 9. The scenes where the casts appear to yield higher scores for recognition. It is interesting to see that our DNN model can produce similar insights to conventional rules of thumb that are shared among professional creators. In addition, the company logo or the zoom shot of the product in the last scene gives a relatively large impact. Our system also detected scenes that might negatively contribute to the recognition rate as shown in Fig. 9. Note that the scores are relative importance values and the sum of the scores does not match the predicted final score.

7 Limitations and future work

The ads are all broadcast on TV and the participants are supposed to have watched them several times before the experiments. This is why we can ask them whether they have watched (or they remember) the ads. In other words, our dataset is not designed for before/after comparison. Investigation of such short-term effects is our future work.

We expect our model to be able to properly evaluate new-style advertisements because it is a learning-based approach. An algorithm that can evaluate novelty or creativity [70] might be needed.

We would also like to work on the optimization of advertisements and the creation of new advertisements using our proposed model. There are many scenes in one TV advertisement and one scene may not be sampled if it lasts less than 1 second in the current solution, and scene changes may also need to be considered.

8 Conclusions

In this paper, we presented a new dataset of TV advertisements with 23 kinds of annotations on impressional/emotional effects, which we believe is the largest and the richest data on impressional/emotional effects of advertisements. Our experiments using 11,373 15-second video clips showed that by combining visual data, audio data, metadata, cast data, and texts, we can predict impressional/emotional effects such as the willingness to buy the product in the advertisement with a correlation coefficient of 0.84. We showed some applications of our regression model.

We believe that our paper can contribute to the multimedia as well as the computer vision community by providing a large impression-related dataset of video and some baseline results. Furthermore, our technology can give creators insightful information to assist them in their creative work.