Multi-modal humor segment prediction in video

Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.


Introduction
Humor provokes laughter and provides amusement. It is an important medium to demonstrate our emotions and has become an essential tool in our daily life [1]. Humor can be used to draw people's attention and relieve stressful or embarrassing situations. By properly using humor, communication between people will become easier and smoother.
Understanding humor is also impor tant for human-machine communications (e.g., robots [2,3] and virtual agents [4,5]). A machine may interact with us in a more comprehensive manner, ultimately taking our emotions into its decision-making to respond to our various needs. Meanwhile, understanding humor is a challenging task for a machine in both computer vision and natural language processing communities because it requires a deeper knowledge of signals from people in visual (e.g., poses, gestures, and appearances), vocal (e.g., tones), and linguistic (e.g., puns) modalities, as well as their combinations [6], which can induce humor.
In recent years, some methods have been proposed to predict humor using both single modality and multiple modalities, which are often accompanied by a dedicated dataset [7][8][9][10][11][12]. Single modal humor prediction mainly uses the linguistic modality [13][14][15], while multiple modal humor prediction combines the information from different modalities [6,[16][17][18]. The ground-truth labels of these methods are usually associated with blocks of text, like sentences and dialogues, while signals from other modalities are often treated as supplementary. In the real world, however, humor is not necessarily tied to text; it can be invoked even in silence with funny actions and facial expressions, which are often ignored in the tasks driven by the linguistic modality. To cover broader variations of humor, we need another problem formulation of humor prediction.
In this work, we present a new humor prediction task. Unlike previous tasks that provide humor-related annotations based on a single sentence or a set of dialogues [16,17], our proposed task provides temporal segments that are associated with humor as ground-truth labels, as shown in Fig. 1. We also propose a new method for humor prediction, which makes predictions with a sliding window. The method uses multimodal data within each window, i.e., video frames and subtitles. Our method aggregates subtitles as well as pose and facial features from video frames, which are then fed into our model. We convert these sliding-window predictions to temporal segments comparable with the ground-truth segments.
The main contributions in our work are three-fold. 1. We give a new definition to humor by setting up temporal segments that are associated with humor as ground truth labels. Such definition covers a wider variety of humorous moments, even without associated text (or utterances). 2. We also propose to find temporal segments, which can handle humor invoked solely by the visual modality. Our method uses the visual modality through poses and facial features in video frames as well as the linguistic modality through subtitles as input. Prediction is done over a sliding window, which is comparable with our ground truth. 3. We compare different combinations of input features to show which feature combination is the best for our humor prediction task.
The rest of this work is arranged as follows: Sect. 2 shows some previous work related to humor prediction; Sect. 3 introduces our task and datasets; Sect. 4 presents our method to predict humor; Sect. 5 shows the experimental results and Sect. 6 is the conclusion part.

Related work
Methods for humor prediction usually take features obtained from text, images, and audio as inputs, giving a prediction of whether the input is associated with humor or not as an output. Single-modal humor prediction methods mainly use the linguistic modality. For example, Weller et al. [13] proposes a task that takes the text from Reddit pages as input and judges whether it is humorous or not based on the ratings. Fan et al. [14] uses an internal and external attention neural network for short text humor detection. Czapla et al. [15] applies a pre-trained language model to predict humor in Spanish tweets. All these methods make their prediction based only on text input. However, in the real world, humor can be invoked by other modalities. A multi-modal approach is necessary to broaden the application of humor prediction. Multi-modal humor prediction methods combine information from different modalities together. For example, Hasan et al. [6] uses subtitle, visual, and audio features in TED talk videos. Patro et al. [17] builds a dataset based on a famous sitcom The Big Bang Theory and gives several baselines to predict humor based on both visual and language modalities. Kayatani et al. [16] also uses the same TV drama series as their testbed and presents a model to predict whether an utterance of a character causes laughter based on subtitles as well as facial features and the identity of the character. Yang et al. [18] obtains humor labels in videos based on user comments together with visual and audio features. The ground-truth humor labels in these methods are mainly associated with texts and a prediction is made for a sentence. Our ground-truth annotation, in contrast, is given as a segment specified by start and end time stamps, which allows covering humor invoked by various modalities.

Dataset and task
In this work, we give new annotations to humor labels by setting up temporal segments of humor based on the dataset in Patro et al. [17] and Kayatani et al. [16], which use a famous sitcom called the Big Bang Theory. The videos in this sitcom TV drama series contain canned laughter (or laughter tracks). Though such canned laughter is not equivalent to humor in general, we still believe that laughter is added if and only if humor is presented in a sitcom. This  means that, at least in such a designed circumstance, laughter can be a good proximity of the presence of humor and gives a relatively objective criterion to identify where humor happens 1 Hence, we use canned laughter to make groundtruth humor segments automatically (i.e., our ground-truth humor segment annotations are formed based on the laughter track).
To do this, we follow Kayatani et al. [16] and subtracted the left and right channels of the audio track to cancel the characters' speech. Then we apply the Hilbert transform after low-pass filtering to the subtracted signal to obtain its wave envelope. This envelope basically gives larger values for canned laughter, jingles, music, etc.. Unlike [16] that annotates a humor label to each sentence, we want to make temporal segments of humor as shown in Fig. 2. We thus set up a threshold in the wave envelope, and define those samples above the threshold as humor to form raw temporal segments of humor. We then review all the extracted segments manually to remove non-laughter segments to finalize the humor segments (i.e., fixed humor segments). Our dataset thus consists of video frames, subtitles, and humor segments with start and end time stamps.
The statistics of the dataset are shown in Table 1. We can see that the number of humor segments is quite large. A single episode has almost 140 humor segments on average, and more than one-fourth of the total duration contains laughter. As for the linguistic modality, we call subtitles that end within a humor segment as humor subtitles. Subtitles that start within a humor segment but end outside any humor segments are not counted as humor subtitles but are referred to as non-humor segments. The table shows that more than 44% of subtitles are associated with one of the humor segments. Figure 3 shows the distributions of top-20 words (counted over humor sentences) for humor and nonhumor sentences, where stop words and characters' names are removed. Considering the difference in the number of humor/non-humor subtitles, we would say that these two distributions do not differ a lot.
Different from previous work that merely judges whether a sentence or a set of dialogues is humorous or not, our task requires localizing humor segments based on video frames and subtitles. Note that there can be humor segments caused solely by acoustic signals (e.g., making a funny noise that cannot be transcribed); however, our task does not use the audio tracks since they have canned laughter, which is used to obtain the ground-truth humor segments. Figure 4 is an overview of our method. We cast our humor segment prediction task to humor/non-humor prediction over sliding windows to model the dependency among them. To predict humor over sliding windows, we represent video frames by sequences of poses and faces of characters, which are handled by the pose flow and the face flow, respectively. The subtitles within each sliding window also go through the language flow. We use late fusion to summarize the prediction scores from different flows to obtain the per-window predictions. Then, per-window predictions are converted to temporal segments.

Finding humor segments in video
For the i-th window w i , we aggregate video frames V i = {v ij | j = 1, … , J} and subtitles S i = {s ik | k = 1, … , K i } within it as input, where J and K i are the numbers of frames and subtitles in w i ( K i can vary for different windows), respectively. Note that, as in Sect. 3, we include the subtitles that end inside the window, while we do not include those subtitles that start inside but end outside the window. We use a neural network-based model to make humor/non-humor prediction h i for w i .
Humor is sometimes induced by funny poses and facial expressions. Previous work [19] found that non-verbal humor based on gestures, facial expressions, or whole-body movement makes the robot more human-like and more entertaining. Motivated by this finding, we use two flows in the visual modality to represent poses and facial expressions of the characters in w i respectively. For the linguistic modality, previous work [16,17] used BERT, a famous language Transformer, to represent subtitles and achieved good

Wave envelope Raw humor segments
Thresholding

Fixed humor segments
Non-laughter removal Fig. 2 We use the wave envelope from the audio track and set up a threshold to extract raw humor segments. We then watch the whole movie manually and remove the music and noises to form the fixed humor annotations, which are finally adopted as the source of our humor segment annotations performance. Thus, we follow them to model the dependency among all subtitles s ik in S i with BERT.

Pose flow
Some funny actions can make people laugh. Poses in the video frames can be seen as reflections of them, and their features can be crucial for humor prediction. We com-  2) We convert the 2-D joint coordinates to 3-D coordinates with OpenPose 3D baseline [21] pre-trained on the Human 3.6 M dataset [22], which maps M = 25 into M � = 17 to fit the Human 3.6 M model. We obtain a 51-D vector containing all coordinates of joints in the 3-D space. In either kind of pose feature, the entries in the vector for undetected joints are set to 0. Confidence score c P m for joint m = 1, … , M is related to the visibility of the key point. We believe such a confidence score may somehow represent the importance of the corresponding person in the scene since the main characters in a scene tend to be placed around the center of the frame in bigger sizes. We thus calculate the average confidence score c P for each person by: Then, we rank characters in the scene based on c P and select top-3 characters for both 2-D and 3-D poses. Note that we still use the confidence scores obtained with OpenPose for 3-D poses (i.e., same x P is used for both 2-D and 3-D poses) because 3-D poses are based on 2-D poses.
Let x P denote the vector of pose features (either 2-D or 3-D) for a single character, we fed x P into FC layers and max-pool them to obtain a 128-D pose vector p ij ( j = 1, … , J , and J denotes the number of video frames in the sliding window) for each frame. We concatenate each frame's pose vector and fed them into an long short-term memory (LSTM) layer with

Face flow
Exaggerated facial expressions can also cause laughter. We model such facial expressions using a similar way to the pose flow. We adopt two facial features (landmark positions and action units (AUs) [23]): For landmark positions, we use a variant of OpenPose to detect facial landmarks in the video frames in V i . For each person in V i , we obtain a 3N-D vector containing the 2-D coordinates of the face landmark with the confidence score c F n given by OpenPose, where N = 70 . For AUs, we use Openface [24] to extract a N ′ -D vector of AUs from each character in a video frame and an average confidence score c F , where N � = 35 . We calculate the average confidence score c F of each person for landmark positions: For both types of features, we select three characters with the largest c F scores. We fed their landmarks or AUs, x F , into FC layers and max-pool them to obtain a 128D face vector f ij ( j = 1, … , J , and J denotes the number of video frames in the sliding window) for each frame. Then we concatenate each frame's face vector and fed them into an LSTM layer. The hidden state corresponding to the last frame is fed into an FC layer to get score vector e F i for the face flow. (2)

Language flow
The subtitles in the video contain the transcript of what the characters say, which is the primary source to make people laugh. We use BERT [25], which has been widely applied in many similar tasks with outstanding results [16,17,[26][27][28]

Prediction and training
Our model uses late fusion for the final prediction, all prediction scores are summed together to get the final score vector: This score vector contains scores for humor e h i and nonhumor e n i per-window, i.e., e i = (e h i , e n i ) . When e h i > e n i , the final binary prediction h i = 1 , otherwise, h i = 0.
Per-window ground-truth label y i for w i can be derived from the ground-truth temporal segments in the dataset. Considering that laughter happens after a triggering part, we deem window w i be associated with humor ( y i = 1 ) if the end time of w i falls within any ground-truth humor segment, and otherwise non-humor ( y i = 0 ). Note that if w i contains a triggering part but ends before the laughter segment, it is still considered as non-humor.

Video Frames
Face detection  Fig. 4 Our model for multi-modal humor prediction. We cast this task to humor/non-humor prediction over sliding windows to model the dependency among different modalities and within the window.
It consists of the pose, face, and language flows to handle visual and linguistic modalities

Converting frame predictions to temporal segments
We call prediction h i of window w i a frame-level prediction. We convert frame-level predictions to humor segments. Let t be the shift amount between consecutive windows in second.
If w i is predicted as humor, we deem the following t second a humor segment. Consecutive segments are merged together to form a humor segment.

Experimental results
We resample the video frames to 2 fps and split the dataset into training (80%), validation (10%), and test (10%) sets. We implement our model with Python 3.7 and PyTorch. The hyper-parameters in the experiments are shown in Table 2.
For the subtitle input, we use the bert-base-uncased model with 12 layers, 768 hidden sizes, 12 self-attentions, and 110 million parameters for feature representation. The model makes no distinction between upper-case and lowercase tokens in the input sequence. Cross-entropy loss is applied for training. We set the length of our sliding window to 8 s and the shift to 2 s. These values are based on the average duration of humor segments (2.254 s) and non-humor segments (6.377 s) as shown in Table 1. Since a humor segment is always preceded by a non-humor segment and the nonhumor segment serves as a context (setup and punchline) of the humor segment (laughter), the window of 8 s, which roughly corresponds to the sum of the averages of humor and non-humor segments, can cover the most part of these consecutive non-humor and humor segments to model humor. The shift of 2 s is sufficiently small for this window.

Quantitative results
We show the performance of frame-level predictions and segment-level predictions in Tables 3 and 4, respectively. We use accuracy (Acc), precision (Pre), recall (Rec), and F1 as metrics for frame-level predictions; and precision (Pre), recall (Rec), and F1 under different IoU thresholds for segment-level predictions. For comparison, we show the performance of two naive baselines (all positive and all negative labels), as well as the subtitle baseline, which solely uses the language flow for prediction i.e., the prediction is done based on e S i . [16] can also be our baseline, although their task is to predict humor in the sentence level. To make their sentencelevel predictions comparable to ours, we convert them into frame-level predictions by making use of the time stamp of each sentence. For a sentence predicted as humor, we set all frames within the range from the end of the sentence to 2 s after the end of that sentence. We show the frame-level and segment-level performance in Tables 5 and 6, respectively, where "Char" denotes character features in Kayatani et al.'s method. Table 3 shows our ablation study results. We can see the accuracy of different input modalities in ours are all more than 60%. The recall and F1 scores are low when we use only pose and/or facial features. The language features improve the accuracy, recall, and F1 score. The best-performed modalities under our metrics are linguistic. We think this is because most humor in the dataset is triggered by what the actors are saying, while visually-induced humor segments are fewer than linguistically-induced ones. We also find that the visual modality does have some contributions in humor prediction, as the precision scores with the visual modality only are more than 65%. The model that uses 3-D poses, face landmarks, and subtitles as input has better accuracy, while the model that uses face landmarks and subtitles as input has better recall and F1 scores than other input modalities.
We report segment-level evaluation in Table 4 by setting up different IoU thresholds between the predicted segments and the ground-truth segments. We can see that the language features contribute much to improving the predictions: When our input contains subtitles, the recall and precision scores are much better than those with only visual features. Among the results, the one with 3-D poses, face landmarks, and subtitles has the best recall score when IoU = 0.25 , IoU = 0.50 and IoU = 0.75 . It also has the best F1 scores over other inputs when IoU = 0.50 and IoU = 0.75 , being the bestperforming input in the segment-level evaluation.
We also analyze the results between ours and the method by [16]: Our sliding-window-based method improves the accuracy and precision by 4.41% and 10.47% respectively in frame level compared with [16]. However, the F1 score is not as good as expected compared with [16] using BERT only or the combination of action units and BERT. For the segment-level predictions, when IoU is 0.75, the recall of our method using 3-D poses, face landmarks, and subtitles is 4.15% higher than the method at the sentence level. This implies that our method has a better alignment of predicted humor segments than the method at the sentence level.

Training with different lengths of sliding window
We evaluate the performance of different lengths of sliding windows to quantify their impact. We use 3-D poses and face landmarks as pose and face flow inputs along with the language flow input and show the frame-level and segmentlevel results for the lengths of sliding windows being 4 s, 8 s, 12 s, and 16 s in Tables 7 and 8, respectively. Note that we keep the shift of the sliding window to 2 s. From Tables 7 and 8, we can see that when the sliding window gets longer, the scores in both frame-level and segment-level increase at first, and then decrease. When the length of our sliding window is 8 s, the results in both frame and segment levels have the highest score. This means that a longer sliding window may lead to a performance drop. We think the reason is that the humor is often triggered in a relatively short period. When the sliding window is long, some information that is not related to humor is fed into the network, making the performance drop. Also, our dataset contains many language-based humor segments; when the sliding window is too short, the model only sees a single subtitle without any context that builds up the humor. Thus, the sliding window should be at an appropriate length for better predictions.

Training time
Our experiments are performed on a computer with Intel Core i7-8700K, 32 G RAM, and an NVIDIA Titan RTX GPU. We show the training time of different inputs in Table 9. From the table, when subtitles are not used, a single only takes no more than 3 min. However, when we take subtitles as input, it costs more than 19 min. This is because BERT is a much larger network compared to the other part of our method.

Qualitative results
We show some example predictions to demonstrate the superiority of our method quantitatively. These examples

Conclusion and outlook
In this work, we proposed a multi-modal model to predict humor segments in videos. We first automatically annotated temporal segments of humor in a dataset and then presented a framework to predict video humor in multiple modalities. Our method used features from subtitles along with different kinds of pose and face features in the videos to carry prediction. BERT was used to model the subtitles and LSTM networks were set up to model pose and facial expression features, respectively. Experimental results showed that our method outperformed the previous method by 4.41% in accuracy at the frame level and by 4.15% in the recall at the segment level and gave a better understanding of humor. However, this work still faces some limitations. First, our model can hardly predict humor based on specific relationships between characters and other objects. Second, the data source is only limited to sitcom videos, whose ground truth laughter is easy to find. Third, our method only sums results from multiple modalities for the final predictions,  In future work, we want to model the relationship between objects and people in the videos. We also need to broaden the source of the dataset, apply new fusion techniques, and migrate the method to other kinds of emotions to cover a wider range of applications.