FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.

We understand the world by using our senses, especially in multimedia areas.All signals can stimulate one's feelings and reactions.Funniness is universal and timeless: in 1900 BC Sumerians wrote the first joke and it is still funny nowadays.However, whereas humans can easily understand funny moments, even from different cultures and eras, machines do not.Even though the number of interactions between humans and machines is growing fast, identifying funniness is still a brake on making these interactions spontaneous.Actually, understanding funny moments is a complex concept since they can be purely visual, purely auditory, or they can mix both cues: there is no recipe for the perfect joke.
Recently, there have been attempts to understand the nature of jokes, humour, and funny moments [2,99].However, most of these works have relied solely on textual cues, with only a few incorporating videos [71,41].The limitation of these approaches lies in their dependence on external transcripts in the form of manual subtitles, which are not naturally available with raw video data.In contrast, advancements in the field of speechto-text have made it easier to extract accurate tran-scripts from raw audio waveforms that naturally accompany videos.This enables processing natural language to better understand the overall context.Furthermore, including audio as a modality in the funny moment detection pipeline is essential, as raw audio carries essential and complementary cues, including tones, pauses, pitch, pronunciation, and background noises [105,9].When speaking, the way people convey their message is as important as the actual content being delivered.Similarly, visual content plays a crucial role.For example, the same phrase spoken by the same person can elicit different emotional responses depending on the context (see Figure 1).Facial expressions, body gestures, and scene context contribute to a better understanding of the intended meaning, thereby influencing the perceived funniness.
Therefore, in this paper, we introduce FunnyNet-W, a multimodal model for predicting funny moments in videos.It comprises three encoders: (a) visual encoder, which captures the global contextual information of a scene; (b) textual encoder, which represents the overall understanding of a scene; and (c) audio encoder, which captures voice and language effects; and the Cross Attention Fusion (CAF) module, i.e., a new module that learns cross-modality correlations hierarchically so that features from different modalities can be combined to form a unified feature for prediction.Thus, FunnyNet-W is trained to learn to embed all cross-attention features in the same space via self-supervised contrastive learning [11], in addition to classifying clips as funny or not funny.To obtain labeled data, we exploit the laughter that naturally exists in sitcom TV shows.We define as 'funny-moment' any n-second clip followed by laughter; and 'not-funny' the clips not followed by laughter.To extract laughter, we propose an unsupervised labeling approach that clusters audio segments into laughter, music, voice and empty, based on their waveform difference1 .Moreover, we enrich the Friends dataset with laughter annotations.
Our extensive experimentation and analysis show that combining audio, visual and textual cues (that all come naturally with videos) is suitable for funnymoment detection.Moreover, we compare FunnyNet-W to the state of the art on five datasets including sitcoms (TBBT, MHD, MUStARD, and Friends) and TED talks (UR-Funny), and show that it outperforms all other methods for all metrics and input configurations.Note that even by using only automatically gen-erated text from audio, FunnyNet-W outperforms all other methods that rely on ground-truth text in the form of subtitles.Furthermore, we examine the difference between our proposed FunnyNet-W and automatic chatbots based on Large-Language-Models (LLMs).Our findings show that without specific prompt engineering under the few-shot setting, chatbots cannot understand the funniness of texts.Instead, our proposed FunnyNet-W significantly outperforms chatbots in prediction accuracy, highlighting the importance of specific multimodal training for this task.We also apply FunnyNet-W to data from other domains, i.e., movies, standup comedies, and audiobooks.For quantitative evaluation, we apply FunnyNet-W on a sitcom without canned laughter manually annotated.It shows that FunnyNet-W predicts funny moments without fine-tuning, revealing its flexibility for funny-moment detection in the wild.
Our contributions are summarized as follows: (1) We introduce FunnyNet-W, a model for funny moment detection that uses audio, visual, and textual modalities that come automatically with videos.FunnyNet-W combines features from the three modalities using the proposed CAF module relying on cross and selfattention; (2) Extensive experiments and analysis highlight that FunnyNet-W successfully exploits audio, visual and textual cues; (3) FunnyNet-W achieves the new state of the art on five datasets.We also demonstrate its generalizability by comparing it to automatic LLM chatbots and its flexibility by showcasing in-thewild applications.The code is available online on the project page: https://www.lix.polytechnique.fr/vista/projects/2024_ijcv_liu/.
A preliminary version of this work has been published in ACCV 2022 [53].We significantly extend it in the following ways: -Motivation.We propose FunnyNet-W, a multimodal model for funny moment detection in videos.FunnyNet-W follows the same motivation from Fun-nyNet, i.e. leverage modalities that come with videos for free.Given that most funny moments are inherently associated with language, in addition to the audiovisual features of FunnyNet, FunnyNet-W leverages speech-to-text features.For this, we automatically generate text from speech by leveraging Automatic Speech Recognition methods, and then pair it with the rich representation capability of Large Language Models (LLMs), thus enabling to better understand the specificities of language.This is motivated thoroughly in the Introduction, in Section 3.4 and experimentally evaluated in the new Sections 5.2.1, 5.2.2, 6.1, and 6.2.
-Architecture.FunnyNet uses audio, visual and face encoders to process the multimodal signals.The face encoder, however, is cumbersome and requires an external face detection model.For this reason, in FunnyNet-W we do not use a face encoder.Instead, FunnyNet-W uses an LLM text encoder to process textual data that are automatically transcribed.Moreover, in FunnyNet-W, we use a more modern visual encoder.The differences between the two models are described in Section 3.4 and experimentally compared in Sections 5.1 (Table 1) and 5.2.1.-Experiments and analysis.We provide more insights and content to explain the performance of FunnyNet-W.Specifically, we experimentally demonstrate and discuss the benefits of the new encoders, each modality and their fusion module, of the length of the input time window, of the losses used as opposed to alternative ones (Section 5.2).Furthermore, we provide a thorough qualitative and intuitive analysis of each modality and their fusion, as well as failure cases (Section 6).-In the wild applications.In addition to experimenting on other domains as in FunnyNet (Section 7.1), we perform two in-the-wild applications: first, we compare FunnyNet-W against chatbots based on LLMs and show that relying solely on language with or without prompt engineering is insufficient for detecting funnyness (Section 7.2); and second, we replace real speech by synthetic speech and showcase the importance of real vocals for funny moment detection (Section 7.3).

Related Work
Sarcasm and Humor Detection.Sarcasm and humor share similar styles (irony, exaggeration and twist) but also differ from each other in terms of representation.Sarcasm usually relates to dialogues; hence, most methods detect sarcasm by processing language using human efforts.For instance, [14] collects a speech dataset from social media using the hashtag and manual labeling, while others [76,89] study the acoustic patterns related to sarcasm, like slower speaking rates or higher volumes of voice.In contrast, a humorous moment is defined as the moment before laughter [9,31].Hence, such methods [7,9,31,41,30] process audios to extract laughter for labeling.Nevertheless, for prediction, most such approaches focus solely on language models [2,99] or on multiple cues including text [31,30].For instance, LaughMachine [41] proposes vision and language attention mechanisms, while MSAM [71] combines selfattention blocks and LSTMs to encode vision and text.
[32] use first an advanced BERT [17] model to process long-term textual correlation and then vision for the prediction.Following this, [75] propose a Multimodal Adaptation Gate to efficiently leverage textual cues to explore better representation for sentiment analysis.OxfordTVG-HIC [48] proposes a dataset with 2.9 M image-text pair for humor detection.A few methods also explore audio.For instance, MUStARD [9] and URFUNNY [31] process text, audio and frames using LSTM to explore long-term correlations, while HKT [30] classifies language (context and punchline) and non-verbal cues (audio and frame) to learn crossattention correlations for humor prediction.They combine audio with other information (video and texts) in a simple feature fusion process without investigating the inter-correlations in depth.Specifically, they stack multimodal features to learn the global weighting parameters without considering the biases in different domains.
In contrast, we believe that funny scenes can be triggered by mutual signals from multimodalities; hence, in this work, we explore the cross-domain agreement of cues with contrastive training.Moreover, FunnyNet-W eliminates the need for external textual annotation by relying solely on raw audiovisual cues, and extracts textual cues directly from the audio that naturally accompanies videos.
Sound Event Detection and Laughter detection.Sound event detection aims to identify and timestamp sound events within audio recordings.Most attempts either rely on annotated data [58] or use source separation techniques [15,77].The choice of input representation is crucial, and most methods use Mel spectrograms [57,96,64,65,81] instead of audio waveforms.This choice is motivated by their computational efficiency, interpretability, and effortless integration into conventional vision models.In our work, we focus on a specific acoustic event: laughter.We leverage these detected laughter as pseudo-labels to train FunnyNet-W.
Laughter detection.The literature in this domain remains relatively scarce.Some methods rely on physiological sensors [5,85], while others [79,25] follow the conventional supervised learning paradigm to train deep neural laughter detectors.Nevertheless, the latter approach requires annotated datasets, a challenging endeavour in the context of this specialized domain.For instance, the authors of [25] experiment with the Switchboard dataset [34], which contains manually annotated laughter timestamps from phone conversation, and also manually annotate laughter timestamps from 1000 clips of AudioSet dataset [24].In contrast, our laughter detector is unsupervised, robust and straightforward, by leveraging the specific attributes of multichannel audio data.Our method sidesteps the need for complex annotations, presenting a promising alternative within the laughter detection landscape.
Multimodal tasks.Over the past decade, the number of tasks that require multiple modalities has increased either due to their intrinsic multimodal nature or due to the potential performance enhancements of adding extra modalities.Here, we review some approaches that are directly related to our work in terms of multimodality and modality fusion.Audio+Video.For instance, [22,1] recognize the facial movements to separate the speaker's voice in the audio.[83,90] temporally align the audio and video using attention to locate the speaker.The former [83] proposes a triplet network to process the query, positive and negative samples to encourage the query to be close with positive samples and far from negative samples.The latter [90] collects an Audio-Visual Event (AVE) dataset to better handle audio-visual alignment.Several methods extend this to other applications, such as audiovisual generation [106] that generates an audiodriven talking face from a single source image and pose video.
Video+Language.Several tasks involve combining language and visual-in particular video-modalities.One notable category encompasses video-to-text tasks, including video captioning [97,50], which entails generating natural language captions for video sequences.A more challenging, yet very similar task is video question answering [49,103,107], where the goal is to comprehend the content well enough to respond to queries effectively.In contrast, Singer et al. [86] propose an approach focusing on text-to-video generation.Finally, video-text retrieval [18,4,21] aims to facilitate bidirectional exploration of both video and textual content.
Audio+Language.Numerous research directions focus only on audiovisual modalities.A major audiovisual task lies in speech emotion recognition [104,72], which aims to connect audio and text to categorize emotions.A speech emotion recognition pipeline consists of modality fusion followed by classification.Parallel to the well-established image-text retrieval task, the domain of audio-text retrieval has also received substantial attention [55,43,101,56].This task employs similar techniques based on measuring feature similarity between the modalities.Another complex audiovisual challenge is audio captioning, where the objective is to generate textual descriptions from acoustic inputs.Most approaches rely on the classical encoder-decoder architecture [44,84,42].
Video+Language+Audio. Some works have extended previous tasks by combining the three modalities: au-dio, video and text.For example, certain approaches incorporate the acoustic modality into the conventional video captioning pipeline [37,52].Additionally, [16] introduce an acoustic modality to enhance emotion recognition.[78] learns a shared audio-visual embedding space directly from raw video inputs via self-supervision.[29] use CLIP [73] to align audio-visual signals to produce audio descriptions.[35] propose a hyperbolic loss to align audio-visual features in a tree-shaped space.All these works show improvements in comparison to unimodal baselines.In contrast to previous methods that depend on distinct annotated sources and ground-truth modalities (for instance subtitles for text or groundtruth annotations), our proposed FunnyNet-W extracts multiple additional modalities-audio and text-from a single modality source, namely video, using non-perfect extraction techniques, such as speech-to-text models.
Modality alignment.Recently many works [73,28,26] have shown promising efforts for acquiring shared multimodal embeddings by leveraging large-scale datasets.Notably, the first breakthrough of text and image embedding was achieved with CLIP [73].Comparable milestones have been reached in diverse modalities; for instance, [60] proposes to learn a powerful audio-visual representation from videos.Other works extend the original CLIP language-image representation with new modalities such as audio in [28], or video in [51,102].Recently, the ImageBind [26] unifies six distinct modalities into a shared embedding space.The key to success consists in aligning features from the different modalities.
Attention mechanisms [95] is natural for connecting multimodal signals.For instance, [98] employ crossattention to model inter-and intra-modality relationships, [88] leverage contrastive cross-attention, [38,47] use iterative cross-attention.In addition, Nagrani et al. [61] introduced attention bottlenecks with randomly initialized bottleneck tokens for modality fusion.In contrast, our fusion mechanism builds on this idea but differs by (i) employing modality projections as bottlenecks and (ii) integrating an extra self-attention block to capture fused token correlations.These works illustrate the natural strength of attention mechanisms in aligning multiple modalities within a unified space.Numerous multimodal tasks benefit from this capability of modality alignment.Applications such as summarization [63], retrieval [23,4], audiovisual classification [61], predicting goals [20], human replacement [19].[88,98] iteratively apply self and cross-attention to explore correlations among modalities.Instead, FunnyNet-W both fuses all modalities and in parallel learns the crosscorrelation among different modalities; this avoids any biases that may be caused by one dominant modality.

Method
Here, we present FunnyNet-W, its training process and losses (Sections 3.1-3.2).For training labels, we propose an unsupervised laughter detector (Section 3.3).Overview.FunnyNet-W consists of (i) three encoders: the visual encoder with videos as input, the audio encoder with audio as input, and the text encoder with subtitles as input.To parse the subtitles, we use an automatic speech recognition (ASR) system [74]; (ii) the proposed Cross-Attention Fusion (CAF) module, which explores cross-and intra-modality correlations by using cross-and self-attentions in the encoders' outputs.Then, the fused feature is fed to a binary classifier.The overall architecture is illustrated in Figure 2. FunnyNet-W is trained to embed all modalities in the same space via self-supervised contrastive loss and to classify clips as funny or not.For training, we exploit laughter that naturally exists in TV Shows: we define it as 'funny-moment' for any audiovisual snippet followed by laughter; and 'not-funny' for any audiovisual snippet not followed by laughter.

FunnyNet-W Architecture
FunnyNet-W utilizes raw inputs from videos, including the audio waveform and frames.Audio Encoder.First, the audio waveform is transformed into a Mel spectrogram2 .This spectrogram, denoted as X audio , is then passed through an audio encoder to generate a 1D feature vector.Finally, a projection head is applied to obtain a N -dimensional vector Text Encoder.The corresponding transcripts, denoted as X text , are extracted from the audio waveform using an automatic speech recognition model [74].These transcripts are then encoded into a feature vector using the text encoder.Subsequently, a projection is performed to obtain a N -dimensional vector F T ∈ R N .Visual Encoder.The visual encoder employs an architecture based on the transformer to process video frames.The input frames, denoted as X visual , are divided into patches from several consecutive frames.Unlike conventional approaches that use a 'classification token' to obtain a general representation, we compute the representation by averaging the pooled features from all patches.This process results in a feature vector, which is then projected to a N -dimensional vector F V ∈ R N using a projection head.The video context complements the audio in providing richer content [31].Addi-tionally, in the absence of sound and transcripts, visual cues can also elicit laughter.Projection Head.This module consists of two linear layers separated by a GeLU activation function.Dropout and normalization layers are applied after the linear layers.It takes the features outputted from each encoder and projects them into a shared N -dimensional multimodal feature space.Cross-Attention Fusion (CAF).It learns the crossdomain correlations among vision, audio and text (yellow box Figure 2).It consists of (a) three cross-attention (CA) and (b) one self-attention (SA) modules, described below: -Cross-attention is used in cross-domain knowledge transfer to learn across-cue correlations by attending the features from one domain to another [59,62,98].In CAF, it models the relationship among vision, audio, and textual features.We stack all features as F S ∈R 3×512 , and then feed F S into three cross-attention modules to attend to vision, text, and audio, respectively (Figure 2).Next, the scaled attention per modality is computed as σ where i={V, T, A} for {vision, text, audio}, and σ the softmax.The query Q comes from the stacked features: Q S =F S W Q S , while the key K and value V come from a single modality as K i =F i W Ki , and V i =F i W Vi .Next, we obtain three cross-attentions and sum them to a unified feature F U as: -Self-attention computes the intra-correlation of the F U features, which are further summed with a residual F U as: where Finally, we average F CAF tokens and feed it to a classification layer.
Discussion.CAF differs to existing methods [59,98] in the computation of the cross attention.Using stacked features F S to attend to each modality Q S brings three benefits: (a) it is order-agnostic: for any modality pair we compute cross-attention once, instead of twice by interchanging queries and keys/values; this results in reduced computation; (b) each modality serves as a query to search for tokens in other modalities; this brings rich feature fusion; and (c) it generalizes to any number of modalities, resulting in scalability3 .

Training Process and Loss Functions
Subtitle extraction.To extract transcripts from the raw waveform, we use the WhisperX system [3].Whis-perX enforces alignment of the automatic speech recognition model Whisper [74] with an external voice activity detection model to produce accurate word-level timestamps.This approach results in a Time-Accurate Speech Transcription, very similar to manually transcribed subtitles.
Positive and Negative Samples.To create samples, we exploit the laughter that naturally exists in episodes.We define as 'funny' any n-sec clip followed by laughter; 'not-funny' any n-sec clip not followed by laughter.More formally, given a laughter at timestep (t s , t e ), we extract a n-sec clip at (t s −n, t s ) and we split it into audio and video.For each video, we sample n frames (1 FPS).For the audio, we resample it at 16000 Hz and transform it to Mel spectrogram.Thus, each sample corresponds to n sec and consists of a Mel spectrogram for the audio and a n-frame long video.In practice, we use 8-sec clips as the average time between two canned laughters, and it also leads to better performances (ablations of n-sec clips and n-frames per clip in supplementary).Note that we clip the audio based on the starting time of the laughter so the positive samples do not include any laughter.Self-Supervised Contrastive Loss.To capture 'mutual' audiovisual information, we solve a self-supervised synchronization task [12,45,69]: we encourage visual features to be correlated with true audios and uncorrelated with audios from other videos.Given the i-th pair of visual v i and true audio features a i and N other audios from the same batch: a 1 , ..., a N we minimize the loss [11,13,66]: where S the cosine similarity and τ the temperature factor.Equation 3 accounts for audio and visual features.Here, we compute the contrastive loss between all three modalities, i.e., visual-audio, text-audio, and visual-text.Thus, our self-supervised loss is: Final Loss.FunnyNet is trained with a Softmax loss Y cls to predict if the input is funny or not, and the L ss to learn 'mutual' information across modalities.Thus, the final loss is: where λ ss , λ cls the weighting parameters that control the importance of each loss.

Unsupervised Laughter Detection
To detect funny moments, we design an unsupervised laughter detector consisting of 3 steps (Figure 3).
1. Remove Voices.Background audios include sounds, music, laughter; instead, voice (speech) is part of the foreground audio.We remove voices from audios by exploiting multichannel audio specificities.Given raw waveform audios, when the audio is stereo (two channels), the voices are centered and are common in both channels [36]; hence, by subtracting the channels, we remove the voice and keep the background audio.In surround tracks (six channels), we remove the voice channel [36] and keep the background ones.2. Background Audios.The waveforms from (i) are mostly empty with sparse peaks corresponding to audio: laughter and music.To split them into background and empty segments, we use an energy-based peak detector 4 that detects peaks based on the waveform energy.Then, we keep background segments and convert them to log-scaled Mel spectrograms.3. Cluster Audio Segments.For each laughter and music segment, we extract features using a self supervised pre-trained encoder.Then, we cluster all audio segments using K-means to distinguish the laughter from the music ones.

Differences to the ACCV 2022 version [53]
FunnyNet [53] and FunnyNet-W use multimodal input signals from videos and identify whether a video input is funny or not.Here, we describe the three main methodological differences between the two models.FunnyNet [53] uses three encoders: (a) visual encoder for video frames (Timesformer [6]) (b) audio encoder for voice and background audio (BYOL [64]) and (c) face encoder for facial expressions (ResNet [82]).However, the face encoder is cumbersome as it requires an external model for face detection, thus leading to higher runtime and decreasing its applicability.For this reason, in FunnyNet-W we remove the face encoder, which leads to slight performance drops but increased gains in applicability and scalability.Moreover, FunnyNet-W uses the more modern visual encoder VideoMAE.
Furthermore, for FunnyNet-W, we made the following three-fold observation.First, most funny moments are inevitably related to language.Second, recent advances in Automatic Speech Recognition (ASR) [74,3] have rendered it possible to exploit the vocal part of the audio (i.e., the part where people are speaking) and automatically transcribe existing dialogues.Third, the re-4 https://github.com/amsehili/auditok.cent explosion of Large Language Models (LLMs) offers remarkable capabilities in processing text and dialogues across a wide range of tasks.Combining these would mean transcribing dialogues via ASR for free, then using an LLM encoder to detect funniness.However, Fun-nyNet does not rely on textual data.We consider this a wasted opportunity, as the textual data via LLMs can boost the representation capability of the model and, in turn, its performance.To this end, FunnyNet-W differs from FunnyNet in two aspects: it relies on ASR to transcribe dialogues for free and it uses an LLM text encoder for processing the text (Llama-2 [94]).

Datasets and Metrics
Datasets.We use five datasets.

Clustering Audio Segments
Laughter Cluster Music Cluster

CNN Encoder
Embedding Vectors

Experiments
In this section, we provide experiments for FunnyNet-W.First, we compare to the state of the art (Section 5.1), then we provide an ablation of each component of FunnyNet-W (Section 5.2), and finally, we ablate our unsupervised laughter detector (Section 5.3).Implementation Details.We train FunnyNet using Adam optimizer [54] with a learning rate of 1 × 10 −4 , batch size of 32 and Pytorch [70].The input audio is first downsampled by fixed sampling frequency (16000 Hz) and then transformed to log-scaled Mel spectrogram by mel-spaced frequency bins F = 64.At training, we use data augmentation: for frames, we randomly apply rotation and horizontal/vertical flipping, and randomly set the sampling rate to 8 frames; for audios, we apply random forward/backward time shifts and random Gaussian noises.For subtitles, we tokenize them as max length = 64 inputs and send them to the language models.
Setting.In our experiments, we train FunnyNet-W on Friends.For MUStARD and UR-Funny, we fine-tune FunnyNet-W on their respective train sets.For TBBT and MHD, we fine-tune it only with a subset of the training set from TBBT (32 random episodes) These datasets come with data samples of uneven lengths.If the sample length is larger than 8 seconds (our best setting), we crop the last 8-second sequence to fit our model; otherwise, we pad zeros to the end.For UR-FUNNY, we exclude from training the data samples with no sounds.For audio, textual, and visual encoders, we used the corresponding pre-trained models for feature extraction.
To understand the funny moments in the wild, we consider that subtitles do not come naturally with other modalities.Different from text-driven funny detections [9,30,31], we instead use off-the-shelf audio-to-text models, like WhisperX [3], to automatically generate texts from audios for funny detection.Hence, in addition to audiovisual data, we experiment with both real and automatically-generated texts (in the form of subtitles) for funny moment detection.

Comparison to the State of the Art
Here, we evaluate FunnyNet on five datasets: TBBT, MHD, MUStARD, UR-Funny and Friends and compare it to the state of the art: MUStARD [9], MSAM [71], MISA [32], HKT [30] and LaughM [41].Table 1 reports the results (including random, positive and negative baselines) for both metrics.We indicate the modalities each method uses as A: audio, V: video, T gt : groundtruth text, T a : automatically-generated text (speechto-text), and F: face.Furthermore, we also indicate in the 'Wild' column the methods that can run automatically without requiring ground-truth information (either for training or testing).Note that most methods require ground truth labels (mostly in the form of textual subtitles or transcripts) either for training or testing (T gt ).This is in contrast to FunnyNet-W, which can automatically process videos in the wild by exploiting speech-to-tex models (T a ).
The first part of Table 1 (no wild) demonstrates that overall the proposed FunnyNet-W (V+A+T gt ) outperforms all methods on all five datasets.For TBBT it outperforms the LaughM by a notable margin of +10% for F1 and Acc, and FunnyNet by +3% in both metrics.For MHD, it outperforms MSAM by 3% in F1 and 7% in Acc, LaughM and FunnyNet by 3% and 1%, respectively, in Acc.Furthermore, FunnyNet-W outperforms MUStARD, MISA, HKT, LaughM and FunnyNet by 3-12% in F1 and 2-15% in Acc for MUStARD and 1-15% in F1 and 1-10% in Acc for UF-Funny.For Friends, we observe similar patterns, where we outperform LaughM by 15% in F1 and 26% in Acc and FunnyNet by approximately +1% in both F1 and Acc.These results confirm the effectiveness of FunnyNet compared to other methods.
The major advantage and motivation of FunnyNet-W and its predecessor FunnyNet [53] is the fact that Table 1: Comparison to the state of the art on five datasets.Modalities used per method A: audio, V: visual frames, T gt : ground truth text (subtitles or transcript), T a : automatically generated text (text extracted from speech), F: face.The column 'Wild' signifies the methods that can run in the wild, i.e., automatically without requiring ground truth information either for training or for testing.Note, most methods require ground truth labels (mostly in the form of textual subtitles or transcripts) both for training and testing.This is in contrast to FunnyNet-W which can automatically process videos in the wild.
† Reproduced results: we use the exact model as in [41], pre-train it on Friends and fine-tune it on the other datasets  1 reports results when experimenting in the wild.We observe that for TBBT, remarkably FunnyNet-W outperforms its predecessor FunnyNet by 5-10% in F1 and Acc, while for MHD it is inferior by 1% or similar for MUStARD.For UR-Funny and Friends, FunnyNet-W outperforms FunnyNet consistently by 1-3% in all metrics.When we compare FunnyNet-W-T a to the first part of the table, we observe that it still produces on par or superior results to all other methods.This clearly shows the superiority of FunnyNet-W even when compared to methods that have access to manually annotated ground-truth data.
Our remarks are: First, FunnyNet-W outperforms most methods in both metrics in both settings, when using ground truth text (T gt ) or when being in the wild (T a ).Second, the performance in the out-of-domain UR-Funny is significantly high.Third, for TBBT and MHD our results are much less optimized than the ones from LaughM or MSAM, as we do not have access to the exact same test videos as either work, so inevitably there are some time shifts or wrong labels 5 and we use much fewer training data (32 vs 183 episodes in LaughM vs 84 episodes in MHD).These highlight that FunnyNet-W is an effective model for funny moment detection.
Note that in the remainder of this work, unless stated otherwise, using the automatically-generated text (in the form of subtitles) is the default setting of FunnyNet-W.For simplicity, we denote the T a by T.

Ablation of FunnyNet-W
In this section, we provide ablations of FunnyNet-W.Specifically we ablate the encoders (Section 5.2.1), the modalities (Section 5.2.2), the cross attention fusion module (Section 5.2.3), the length of input videos (Section 5.2.4) and the losses (Section 5.2.5).

Ablation of Encoders
Visual encoder.Table 2 ablates two video encoders on Friends, i.e.Timesformer [6] and VideoMAE [91] for two scenarios: one using automatically generated text (T a ) and when using ground-truth text (T gt ).Given the same video sequence, we use the best settings for them (8 frames for Timesformer and 16 frames for Video-MAE).We observe that using VideoMAE outperforms Timesformer by about 1-3% in F1 score and 2-3% in Acc.This is expected because VideoMAE is a larger model, and it also uses a masked autoencoder for unsupervised learning; hence it can generalize better than Timesformer.When comparing the results between using ground truth and automatically generated texts, we observe that the improvements of using VideoMAE are consistent, and the differences are very small (1-2% in both F1 and Acc).Text encoder.Table 3 ablates three different text encoders in FunnyNet-W on Friends: Bert [17], GPT2 [87] and LlaMa-2 [94] (7B model) for two scenarios (one using automatically generated text (T a ) and when using ground-truth text (T gt )).Given the other ablation study, we choose VideoMAE and BYOL-A as the best visual and audio encoders, respectively.We observe that using LlaMa-2 gives the best improvements in both F1 and Acc.Interestingly, using GPT2 results in inferior performance than using Bert.This finding is consistent with what we observe in LLM models.For LlaMa-2, we note that the differences between using ground truth and automatically generated texts are minor, about 0.8-1% in both F1 and accuracy.
Audio encoder.Table 4 ablates four audio encoders on Friends: Beats [10], CAV-MAE [27], BYOL-A-v2 [65] and BYOL-A [64].Given the previous ablation studies, we choose VideoMAE and LlaMa-2 as the best visual and text encoders and operate directly with automatically generated text (T a ).The results show that CAV-MAE, BYOL-A-v2 and BYOL-A perform on par (approximately 1% difference in F1 and Accuracy).In our experiments, we use BYOL-A as it results in the best F1 and Accuracy but it also requires fewer parameters than the other models.
Subtitles sources.Tables 2 and 3 report results for two scenarios: one using automatically generated text (T a ) and when using ground-truth text (T gt ).Consistently, we observe that using the ground truth text outperforms using the automatically-generated one.This is expected, as T a includes imperfect transcripts.We note, however, that the difference in both F1 and Accuracy are minor (1-3% for both metrics).This highlights that substituting ground-truth with an automatic speechto-text model is a good trade-off between good performance and the ability to run in the wild, i.e., without requiring manual ground truth labels.

Ablation of Modalities
Table 5 ablates all modalities of FunnyNet-W on the Friends test set.Using text alone (third row) produces better results than when using the visual or audio modality alone (first and second rows).This highlights the efficiency of large dataset pre-training and the representation power of Large Language Models (since we use LlaMa 2 as the textual encoder).Using audio alone (second row) leads to the second-best performance compared to using single modalities, underlying that audio is more suitable than visual cues for our task, as it encompasses the way of speaking (tone, pauses).Combining modalities outperforms using single ones: combining visual and audio (fourth row) or visual and text (sixth row) increases the F1 by approximately 1.3-10% and the Acc by 0.2-15%.This is expected as audio or text bring complementary information to the visual modal-ity [60,73] and their combination helps discriminate funny moments.Combining audio and text (fifth row) leads to larger boosts than audio+visual or text+visual (fourth and sixth rows), as audio and text contain complementary information regarding character dialogues, expression in voices and background music.Overall, using all modalities achieves the best performance.

Ablation of Cross-Attention Fusion (CAF)
Table 6 reports results with various cross-and selfattention fusions in CAF.We observe that including either self-or cross-attention (second, third rows) brings improvements over not having any (first row), indicating that they enhance the feature representation.
The fourth row shows that using them both for feature fusion leads to the best performance.For completeness, we also compare CAF against the state of the art: MMCA [98] and CoMMA [88].All CAF, MMCA and CoMMA use self and cross-attentions jointly for feature extraction.Their main difference is that both MMCA and CoMMA first use self-attention to individually process each modality, then concatenate all modalities together and process them using cross-attention to output the final feature representation.Instead, CAF uses cross-attention to gradually fuse one modality with the rest of the modalities to fully explore cross-modal correlations.The results (fourth, fifth, and last rows) show that CAF outperforms MMCA [98] and CoMMA [88] by 0.1-0.4 in F1 score and 0.03-0.2 in accuracy.This reveals the importance of the gradual modality fusion, and hence the superiority of CAF.

Impact of time
In this section, we examine the impact the length of the time window has on the final results, as well as the number of sampled frames within the time window.
Influence of Time Window.Following [6], our proposed FunnyNet-W is trained on fixed-length inputs of multiple modalities that last 8 seconds.Here, we examine the impact that the length of time window has on FunnyNet-W and illustrate results on four datasets (as well as their average in a dashed red line) in Figure 4.
For this, we use input time windows of varying lengths (from 2 to 16 seconds) in either the visual encoder of FunnyNet-W (referred to as FunnyNet-W V, top in Figure 4) or the audio encoder of FunnyNet-W (referred to as FunnyNet-W A, bottom in Figure 4).When ablating the input length of the visual input (top in Figure 4), we observe that using approximately 8 seconds achieves the best performance compared to all other settings.Specifically, for F1 (a, left), we observe that for all datasets, the best result is achieved when using 8 seconds, whereas the second and third results are achieved when using 10 and 12 seconds length of the input.For Accuracy (b, right), the performance follows the same trend: the best accuracy is reached for 8-second inputs, while the 10 and 12-second inputs reach the second and third-best accuracies.Interestingly, for both F1 and Accuracy, for the average amongst all datasets (red dashed lines), we observe that both metrics degrade when using longer visual input windows (e.g. more than 15 seconds).This is probably because longer inputs contain too much visual or audio information across both positive and negative samples, which confuses the model and leads to more incorrect predictions.

Varying time window lengths
When ablating the input length of the audio input (bottom in Figure 4), we observe that similar to the previous conclusions, the time window of 8 seconds leads to the best performance both in F1 and Accuracy.Nevertheless, using a longer time window improves the prediction accuracy, in contrast to the visual ablation.Specifically, the best time window setting is between 8 and 12 seconds.For any time windows outside this range, the performance is getting worse.In our experiments, we use a time window of 8 seconds as a good trade-off between the performance of the visual and audio encoders.
Influence of Sampled Frames.Given the input time window of 8 seconds, we test the scenario where we sample different numbers of frames within a fixed 8-second time window.In particular, we examine the impact when sampling from 8 to 100 frames.The results are shown in Figure 5, where we illustrate the (left, a) F1 score and (right, b) accuracy over the number of sampled frames.Our results suggest that the number of frames has no or only a trivial impact on the final performance.This is expected, since more frames in a fixed time window mainly produces redundancy without introducing new relevant information.Furthermore, in this ablation, we also compare the results obtained when using Timesformer (red points and dashed line) and Video-MAE (magenta points and dashed line) in the visual encoder.We observe that using VideoMAE outperforms Timesformer in all settings, hence the final FunnyNet-W uses VideoMAE for the visual encoder.

Ablation of losses
FunnyNet-W uses the classification L cls and the selfsupervised contrastive losses L ss .Here, we examine their impact by training FunnyNet-W with and without L ss .Table 7 reports the results on Friends, where we observe that adding L ss improves over +10 in all metrics (first two rows).This reveals that using the auxiliary self-supervised task of syncing audiovisual data helps to identify the funny moments in videos.Recently, Koleo [80](L koleo )and CLIP [73] (L clip ) have been proposed for improving unsupervised feature clustering.To examine the impact of these two losses, we train FunnyNet-W with different loss combinations and show the results in Table 7.We observe that including Koleo and/or CLIP losses (third-fifth rows) results in a small drop in both F1 and accuracy compared to the proposed loss configuration (second row).Regarding the Koleo loss, this drop is probably because Koleo encourages a uniform span of the features within a batch which maximizes the variances of features and affects the binary decisions on the boundaries.Regarding the CLIP loss, the drop can be explained by the fact that CLIP is widely used for multi-class feature projection, which may complicate the funny or not-funny classification Model complexity.We also compare in Table 8 the complexity of FunnyNet and FunnyNet-W to the other stateof-the-art models.Note that both models use pre-trained visual, audio and text encoders.For completion, we also report the metrics when including the complexity of the visual, audio and text backbone encoders.We observe that the gain in performances and the unsupervised aspect of FunnyNet-W impacts its complexity.Indeed, FunnyNet-W is a huge model, with an increase of approximately 52 GFLOPS, 16M of parameters and 11ms on runtime, in comparison to the second-heaviest model [32].Additionally, when comparing FunnyNet to FunnyNet-W, the latter replaces the face encoder with a text encoder and uses larger visual and text encoders, VideoMAE and LlaMa2, respectively.These lead to higher complexity on GFLOPS and parame-ters.However, the overall inference time is reduced because it does not require online per-frame face detection, masking, and feature extraction.

Analysis of Unsupervised Laughter Detector
Comparison to the state of the art.We compare our laughter detector with the state of the art: LD [79] laughter detector used in [9] and RLD [25].The results on the Friends dataset are presented in Table 9.Overall, our detector demonstrates superior performance compared to both supervised methods.Notably, our detector combined with BEATs features consistently demonstrates superior performance, excelling for instance in temporal precision (78.4%), and detection precision for both thresholds (95.2% for 0.3 and 55.1% for 0.7).Our method combined with BYOL-A and BYOL-A-v2 features also showcases a balanced performance, maintaining high temporal accuracy (86.0% and 82.1% respectively).In comparison, LD exhibits high temporal recall (99.0%) but lower temporal precision (35.7%) highlighting a bias in its predictions.While RLD achieves a better balance between temporal precision and recall (58.9% 62.0% respectively) it is still far from our results.
Furthermore, we evaluate our detector using five audio feature extractors: Wav2CLIP [100], CAV-MAE [27], two versions of BYOL-A [64,65], and BEATs [10].Among these, the BEATs encoder exhibits the most suitable audio representation capacity for our detector, providing the best results (last row).During the analysis of the laughter detection, we make three important observations: (i) The majority of false positives are unfiltered sounds that are not easily separable using Kmeans clustering.(ii) The majority of false negatives correspond to intra-diegetic laughter, which is typically less loud and therefore more challenging to detect.(iii) The peak detector fails in scenarios where music overlaps with laughter, such as in party settings.

Influence of Clustering on the Detection Performance.
Here, we examine how the choice of the cluster count parameter K in the K-means algorithm influences the performance of our laughter detector.In practice, laughter chunks significantly outnumber music chunks.Consequently, in the third stage of Figure 3, we exclude the smallest cluster-identified as the music cluster through empirical assessment-and retain the clusters comprising the laughter chunks.
Figure 6 shows the performance of the detection pipeline both at the detection level (red lines) and at the temporal level (blue lines) as a function of different numbers of clusters (x-axis).Overall, we make the Fig. 6: Evolution of the temporal (blue) and detection (red) F1 scores according to the number of clusters chosen for the K-means algorithm at the end of the laughter detection pipeline following three observations: (1) For 1 cluster, we note that using one cluster is equivalent to no clustering.
(2) Between 2 and 4 clusters, we note that F1 scores are higher than for 1 cluster.Here, there are enough degrees of freedom for the K-means algorithm to correctly detect the centroid of the music cluster.(3)For more than 7 clusters, we note that F1 scores tend to converge to the same value as for 1 cluster.Here, there are too many degrees of freedom for the K-means algorithm, and therefore it detects multiple centroids for the music cluster.Thus, the higher the number of clusters, the smaller the music sub-cluster we have, with the extreme case of having one cluster per sample, thus having the same effect as no clustering.
Moreover, Figure 6 shows that the detection F1 score (red line) is less sensitive to the number of clusters than the temporal F1 score (blue line).This can be explained by the fact that music chunks are generally longer than laughter chunks.Thus, by removing longer false positive chunks, we improve temporal metrics, whereas the impact is less important at the sample scale for detection metrics.
6 Analysis of FunnyNet-W

Modality Impact
To visualize the impact of modalities, we compute the average attention values on the three CA modules (CA boxes in Figure 2) and then, show the average weights for each modality in the pie chart of each example in Figure 7.For this, we show (a-d) four positive and (e,f) two negative samples on Friends with frames, subtitles and audio spectrogram (left) and pitch (right).We observe that the contribution of each modality varies; Table 9: Evaluation of Laughter Detection on Friends.We compare five versions of our laughter detector, denoted as 'Ours', employing different feature encoders, along with two external audio laughter detectors.The last row corresponds to the actual configuration used in FunnyNet-W. the commonality though is that audio contributes more than half, followed by text and finally visual features.Specifically, in cases where there is a strong audio signal, the contribution of audio increases significantly.This is illustrated when the character yells ('Chandler' in positive example (a), or pauses the speech ('Chandler' in positive example (b), or the speech rate speeds up ('Phoebe' in positive example (c) or speech volumes change suddenly ('Chandler' and 'Joey' in positive example (d).In contrast, in negative (e) and (f), the tone, volume, pitch or rhythm do not change greatly, so the text starts to play a bigger role in determining them as non-funny scenes.Furthermore, we observe that in the (c) and (e) examples, the visual feature plays very little role in the final prediction probably because the scenes do not capture the whole character's bodies and their movement, so the visual model can offer only little information.

Feature Visualization
Figure 8 shows the t-SNE [33] visualization of features: (a, b, c) display the unimodal distributions of audio, text, and visual features respectively, while (d) corresponds to all modalities for four datasets.Blue colour corresponds to funny samples and red to not-funny ones.All single features, and in particular the visual and textual ones, are scattered in the 2D space without clear boundaries between positives and negatives.Interestingly, for Friends, TBBT and MUSTaRD the audio features alone exhibit a notable ability to discriminate positive and negative samples; this is probably because of the punchlines used in these shows that typically occur at the end of sentences.For these three datasets, we observe that the joint embedding of all modalities results in the best separation between positives and negatives.Interestingly, for UR-Funny, a dataset without ending punchlines, all combinations of modalities (either single or joint) fail to distinguish funny from not-funny moments.This is probably due to the domain shift between samples from this dataset (TED-talk segments) and the samples used at training (sitcoms).

Impact of CAF Module
To examine the effect of CAF, we visualize in Figure 9 the learned attention maps: red indicates higher and blue lower attention.(a,b,c) display the cross-attention between the unified F U and (a) audio, (b) visual, (c) text features.Since F U is stacked from audio, vision, and text, we observe that each modality highly attends to itself (especially text).We also observe that the audio encoder also attends to the text encoder, indicating that there is mutual information shared between text and audio.Finally, (d) displays the self-attention map between F U , where we observe that F U attends to all tokens with different weights.The small color differences on the diagonal and anti-diagonal areas suggest that the joint features have approximately uniform representations for the final classification.

Failure Analysis
By examining the results, we observe three main groups of failure cases.First, when characters have strong emotional responses expressed only by single words (such as 'haha', 'no!') is not always funny.However, all modalities incorrectly, yet confidently predict them as funny.pression or grimace, surprise, pause in dialogue, phrase, joke).In such cases, FunnyNet-W may fail to discriminate these subtle cues that come usually with humanlevel understanding.4. Audio-Only.As audio is the most discriminative cue, we examine its impact on out-of-domain audios: narrating jokes and reading books.Our model detects funny punchlines from jokes, mostly when they are accompanied by a change of pitch or pause; for the audiobook, it successfully detects funny moments when the reader's voice imitates a character.

FunnyNet-W against LLM Chatbot
Recently, several large language models (LLMs) [68,93] have been fine-tuned and are used as chatbots [67,46].With their expansive knowledge and context-aware responses, they have significantly advanced language understanding and generation, which enable them to perform a wide range of language-related tasks.In this context, we compare the proposed FunnyNet-W against a chatbot to assess its performance relative to these general models.Specifically, we use the LlaMa-2 [94] chatbot on the Friends dataset.
Prompting.We evaluate the language LlaMa-2 chatbot in two setups.First, with or without prompt training: -zero-shot setting, where we prompt the chatbot with a transcript sample and ask it to determine whether it is funny or not.We do that iteratively for all test samples of the Friends test set.The prompt we use is "Is the following sentence funny or not?≪subtitles≫", where ≪subtitles≫ corresponds to each test sample.However, given the popular nature of the 'Friends' sitcom, the chatbot may have already seen samples or even the whole transcript of the TV show during training.We hypothesize that this impacts its performance positively, as the chatbot not only has knowledge of the dialogues that follow, but also knows the comments of the community for each pun or joke.-few-shot setting, where we prompt the chatbot with some training samples followed by the testing sample within the token context limit.The prompt we use is twenty training samples (ten positives and ten negatives): "This sentence is funny: ≪subtitles≫.This sentence is not funny ≪subtitles≫.",followed by the testing sample: "Is the following sentence funny or not?≪subtitles≫".In this case, the chatbot uses the training samples to better distinguish the specific TV show type of humour.
Second, by performing a simple prompt engineering (i.e.part of the prompt that gives context to the chatbot):   -general system prompt, we prompt the chatbot with the general system prompt (referred to as 'Generic'): "You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.[...] ".This system prompt makes the chatbot act as a general chatbot without any prior on the task.-specific system prompt, we prompt the chatbot with the task-specific system prompt (referred to as 'Specific'): "I will give some sentences, and you need to say if it's funny or not, reply only by yes or no.".This kind of system prompt helps the chatbot limit its range and focus on the task only.Moreover, it forces the chatbot to answer, whereas the general system prompt leads sometimes to hesitating answers.Experimental results.Table 10 reports the results when prompting the LlaMa-2 chatbot.We observe that without prompt training, the chatbot's performance drops both with and without prompt engineering.Additionally, we observe the importance of prompt engineering: when using the specific prompt (with or without training) the performances are higher than 50% in both metrics, whereas the generic prompt (no prompt engineering) results in very low performances.This is in line with the current bibliography on LLMs, where prompt engineering is crucial for higher accuracy.We believe that more prompt engineering will increase the performance; yet, this is outside the scope of this work.Overall, we observe that FunnyNet-W outperforms all examined cases with the chatbot.This highlights the need for specific model training for funny moment detection.Interestingly, we note that the performance of FunnyNet-W using text only is close to the one of the LlaMa-2 chatbot, thus showcasing the impressive representation power of LLM chatbots.Figure 12 illustrates four examples: two positive (first and second rows) and two negative samples (third and fourth rows).For the positive samples, we observe that LlaMa-2 correctly understand its funniness (green) with prompt training, whereas when there is no prompt the results are incorrect.For two negative samples, most predictions from the chatbot are incorrect, most likely because it picks up the words with strong emotional expressions, like "weird" and "burn", resulting in false positives.In all examples, FunnyNet-W correctly predicts the results because it does not solely rely on text, but also audio and visual features.
Overall, our findings are twofold: (1) using only subtitles is insufficient to understand the funniness in video scenes, and (2) since we only do minor prompt engineering (generic and specific), the results of LLMs cannot outperform the proposed FunnyNet-W.Potentially, by improving the prompts, we can further improve the performance of LLMs.

Impact of Audio
In the context of funny moment detection, audio is more relevant than text [53] because it contains more information, including vocals, pauses, pitch variations, speech rate variations, rhythm and timing, accent and pronunciation, emotional tone, music and background noise.To highlight the importance of audio, in this section, we test FunnyNet-W by replacing the ground truth audio with automatic machine sounds.
For this, we generate corresponding synthetic audios from the ground truth subtitles of the Friends dataset with a text-to-audio model7 .Note that the synthetic audio only mimics the vocals between characters without any background sounds and, more importantly, without including all additional voice cues that help identify the emotional state of the character and the dialogue.Then, we train FunnyNet-W with the synthetic voices and test it on both the real and synthetic voices and respectively, we test FunnyNet-W (trained on real voices) on both real and synthetic voices.Table 11 reports the results.When training with synthetic voice (first and second column), we observe that testing on real voices (second row) outperforms testing on synthetic ones (first row) by a large margin, i.e. approximately 10-15% for both metrics.Similarly, when training with real voice (third and fourth columns), we observe that there is a significant difference in performance (or approximately 20% in both metrics) between testing on synthetic and real data.These results show that simply replacing real voice with synthetic ones omits other important information, such as background audio, and music; hence, the model makes more correct predictions when the test set contains additional auditory information (real) rather than a simple voice (synthetic).When we test on synthetic voices (first row), we observe that training either with synthetic or real voice produces similar results.This is because the test set contains synthetic data, and therefore learning the specificities of voice is not necessary for good performance.However, when we test on real voices (second row), we observe that training with real voices (columns 3-4) outperforms training with synthetic ones (columns 1-2) by a large margin (e.g. for Acc 79.6% for synthetic vs. 85.6% for real).This clearly shows that the real voice includes important additional cues (pause, intonation, etc.) that help FunnyNet-W discriminate funniness.To further analyze the effect of voice, we perform here a qualitative comparison using Spleeter8 .Specifically, Figure 13 illustrates the spectrum heatmaps between (a) real vocal, (b) real accompaniment (non-vocal parts, such as background music, sounds, talks, audio), (c) synthetic vocal, and (d) the differences between real and synthetic audio.We visualize the heatmaps of examples (two rows), where in both cases FunnyNet-W correctly predicts the funniness when using real audio and incorrectly when using synthetic audio.The first row shows the funny moment when Phoebe tries to shush the wind bell while the wind bell keeps ringing.This contrast between vocals and non-vocal sounds (i.e. in this case bell ringing) is missing from the synthetic vocals.Row 2 shows that when Ross screams excitedly ('Call Mum!'), his voice triggers the smartphone to dial Pete's mum.This strong vocal expression does not appear on the synthetic vocals.Both these examples indicate that audio plays a key role in funny moment detection because it contains not only background sounds but also expressions and feelings from the characters leading to better scene understanding.
Furthermore, we also use T-SNE to visualize the data clustering in Figure 14.This visualization shows that when the train and test data come from the same domain (either real or synthetic, c and d figures), the positive and negative distributions (blue and red points) are clearly separable.This is in contrast to the (a,b) figures, where the two synthetic and real domains are mixed (i.e., training on one domain and testing on another); in this case, the two distributions overlap more, as expected due to domain shift [39,92].

Ethical discussion
Practical Impact.There are various potential applications for FunnyNet-W.First, it may be useful to collect a large dataset of funny moments (similar to [48]), so for example, cognitive researchers could study funniness mechanisms at a large scale.Next, it may be useful to enable artists to edit films more easily, without relying on a live audience.Finally, it may be useful to enhance human-machine interactions.For instance, adding a sense of humor to conversational agents would make the relation more natural and spontaneous.
However, FunnyNet-W is part of artificial intelligence systems that tend to analyze complex human specificities and behaviors (e.g., conversational agents).Given the nature of these systems, their usage and deployment should be done with caution.For instance, in the particular case of FunnyNet-W, it could enhance identity fraud methods, by better mimicking the sense of humor of victims.
Societal Impact.FunnyNet-W is trained mainly with Western cultural materials, especially from the USA, which do not necessarily represent uniform demographics.In particular, we mainly tackle funniness in American sitcoms, which covers a very specific type of humor.Therefore, without fine-tuning, FunnyNet-W might have difficulties in generalizing to funny moments from other cultures, as humor is highly thematic, and themes vary from one culture to another.Moreover, the audio modality might also be highly impacted by cultural bias, as expressiveness is strongly related to culture, e.g., actor performances change a lot from one country to another, leading to misinterpretations.In addition to the cultural barrier, FunnyNet-W includes language bias.Indeed, the audio as well as the textual modality are trained with the English language.This can be a limiting factor for generalization and transferability across languages, as jokes or puns often rely on language specificities.We also note that the textual modality is limited by alphabets that vary among languages.
Environmental Impact.All experiments are done on NVIDIA RTX4090 and A100 GPUs, with each of them requiring 215W in power supply.For this project, we use approximately 800 GPU hours.Training a FunnyNet-W model with all three modalities requires around 6 GPU hours on NVIDIA RTX4090, which amounts to 1.29 kWh and 300.75g of CO2 emitted.

Conclusions
We introduced FunnyNet, an audiovisual model for funny moment detection.In contrast to works that rely on text, FunnyNet exploits audio that comes naturally with videos and contains high-level cues (pauses, tones, etc).Our findings show audio is the dominant cue for signaling funny situations, while video offers complementary information.Extensive analysis and visualizations also support our finding that audio is better than text (in the form of subtitles) when it comes to scenes with no or simple dialogue but with hilarious acting or funny background sounds.Our results show the effectiveness of each component of FunnyNet, which outperforms the state of the art on the TBBT, MUStARD, MHD, UR-Funny and Friends.Future work includes analyzing the contribution of audio cues (pitch, tone, etc).

Fig. 1 :
Fig. 1: What is funny?Audio cues along with visual frames and textual data are a rich source of information for identifying funny moments in videos.Video scene from Pulp Fiction, 1994, source video https://www.youtube.com/watch?v=4L5LjjYVsHQ

AudioFig. 2 :
Fig. 2: Architecture of FunnyNet-W.Given audio-visual clips, FunnyNet-W predicts funny moments in videos.It consists of the audio (blue), textual (red ), and visual (green) encoders, whose outputs pass through the Cross Attention Fusion (CAF ), which consists of cross-attention (CA) and self-attention (SA) for feature fusion.It is trained to embed all modalities in the same space via self-supervision (L ss ) and to classify clips as funny or notfunny (L cls ).

Fig. 3 :
Fig.3: Proposed laughter detector.It takes raw waveforms as input and consists of (i) removing voices by subtracting channels (here, the audio is stereo with 2 channels), (ii) detecting peaks, and (iii) clustering audios to music and laughter.

Fig. 4 :
Fig. 4: Comparison of various time window lengths used as input of the (top) visual encoder of FunnyNet-W (referred to as FunnyNet-W V) and (bottom) audio encoder of FunnyNet-W (referred to as FunnyNet-W A).We illustrate (left,a) the F1 score and (right,b) the accuracy on different datasets.The average results are plotted in red lines.

Fig. 5 :
Fig. 5: Comparison of different lengths of time windows for the visual encoder of FunnyNet-W (referred to as FunnyNet-W V).We illustrate (a) the F1 score and (b) the accuracy on different datasets.The average results are plotted in red points and lines for Timesformer and magenta for VideoMAE.

Figures 10 -Fig. 7 :
Fig. 7: Visualization of (a,b,c,d) funny, (e,f ) non-funny predictions on the Friends test set.We show the audio, visual and text inputs, the learned average weights of cross-attentions from CAF (pie chart), and the subtitles (for better understanding).

Figure 10 -
(c) indicates such an example, where Ross gives a sarcastic response to Joey without changing facial expression or tone; in this case, FunnyNet-W incorrectly predicts the scene as negative.Third, in most cases, all modalities fail to understand inside jokes that depend on long-term dependencies.For instance, Figure10-(d) is the case where the context is so long (the previous awkward moment between Ross and Rachel) that the model wrongly predicts the scene as a not-funny moment.All audio, visual or text fail to give discriminative signals to indicate the funniness.7Funny Scene Detection in the Wild7.1 Applications from other domainsIn this section, we show applications of FunnyNet and FunnyNet-W in videos from other domains.
know, they call it "The Ross".Rachel: I don't know.Weren't you the guy who told me to quit my job when I had absolutely nothing else to do? Ha! Ha! Ha! Ha! Ross: yeah, that would be cool.
mistake?Where are you trying to put her in?her purse?Phoebe: where?where did he put in? Gunther: Rachel, I just made you cocoa.Rachel: OMG, you are so nice.Monica: (screaming) Ah!! Phoebe: Are you guys OK?(b) False positive (d) False negative Joey: People like, huh, he's got a Ross.

Fig. 10 :Fig. 11 :
Fig. 10: Failure cases on Friends split into three main groups: (a, b) strong emotional responses expressed by single wording, (c) subtle sarcastic comments with straight face and no follow-up indications, and (d) inside jokes depending on long-term understanding.
(a) Real vocal (b) Real accompaniment (c) Synthetic vocal (d) Residues between (a) & (c) Phoebe: shh! shh! Background: wind bell rings Machine: shh, shh Ross: you've got mum on the phone.Call Mum!Call Mum! Pete's mum: Hello?Machine: you've got mum on the phone.Call Mum.Call Mum.Machine : Hello.

Fig. 13 :
Fig. 13: Visualization of real and synthetic audio on Friends.We show real vocals (a), real accompaniment (b), synthetic vocals (c) and the residues between real and synthetic vocals (d).

Fig. 14 :
Fig. 14: T-SNE visualization of real and synthetic audio on Friends.We show positive (blue) and negative (red) samples to indicate the feature distributions.

Table 2 :
Ablation of visual encoders on Friends.

Table 3 :
Ablation of text encoders on Friends.

Table 4 :
Ablation of audio encoders on Friends.: audio, V: visual frames, T gt : ground truth text, T a : automatically generated text (text extracted from speech) A

Table 5 :
Ablation of modalities of FunnyNet-W on Friends test set.

Table 6 :
Ablation of CAF of FunnyNet-W on Friends test set.(A: audio, V: visual frames, T: text)

Table 7 :
Ablation of losses used to train FunnyNet-W.

Table 8 :
Comparison to the state of the art of FLOPs count (FLOPs), number of parameters (Params) and inference runtime average (Runtime).We report two versions of model complexity, FunnyNet (V+F+A) and FunnyNet-W (V+F+T) with including pre-trained encoders, and FunnyNet b and FunnyNet-W b without including pre-trained encoders.
They are putting together this panel to talk about fossils they just found in Peru and the Discovery channel is gonna film it.Oh my god, who's gonna watch that?didn't wear this suit for a year because you hated it.You're not my girlfriend anymore, Now that you're on your own, you're free to look as stupid as you'd like. I

Table 11 :
Ablation of synthetic and real voice when training and testing FunnyNet-W on Friends.