Visual Learning of Semantic Concepts in Social Multimedia
- First Online:
- Cite this article as:
- Borth, D. Künstl Intell (2014) 28: 333. doi:10.1007/s13218-014-0328-x
- 474 Downloads
Currently, traditional media is experiencing a major shift towards social media. At the same time, interaction via social media is to an increasing degree enriched with images and videos, as seen during the Arab Spring in the Middle East in 2012 or the Boston Marathon Bombings on April 15, 2013. This combination gives rise to a new type of content, which is being called social multimedia.
Unfortunately, this content is of little use if it is not accessible to users, e.g., by allowing users to retrieve videos by keyword-based search. However, keyword-based search requires each individual video to be annotated with a set of keywords describing its content. Given the vast amount of video content being created nowadays (YouTube, for example, stores about 100 hours of video content every minute) this poses an impossible task for human annotators.
Worse, as naturally as humans can perceive their surroundings visually, this undertaking is quite challenging for machines. This lack of correspondence between the low-level features that machines can extract from videos (i.e., the raw pixel values) and the high-level conceptual interpretation a human associates with perceived visual content is referred to as the semantic gap .
In recent years, great effort has been spent on content-based methods directly analyzing the video stream to bridge this gap. Following this line of research, the thesis focuses on concept detection , the task to detect semantic concepts in visual content. Given an input video clip, concept detection systems use statistical learning to infer the presence of a target concept by calculating its probability of appearance from low-level features extracted from the content. For this purpose, the set of all concepts—or concept vocabulary—should cover a broad spectrum of entities, such as objects (“chair”, “telephone”), scene types (“cityscape”, “desert”), and activities (“interview”, “people singing”), requiring concept detection systems to provide detectors for hundreds or even thousands of target concepts.
This, however, is considered as a major challenge in concept detection, as it demands labeled training samples for supervised machine learning—the underlying technology of current systems .
Such ground-truth training samples are usually acquired manually, i.e., a human annotator labels videos for whether the concept occurs. This time-consuming and cost-intensive effort creates a scalability problem, leading to small-scale, fixed concept vocabularies being useful in research setups, but making it impossible to satisfy the changing demands of users’ information needs. This leads to state-of-the-art systems still focusing on generic concepts such as “quadruped” or “hand” instead of providing detectors for concepts of interest, e.g., sports events such as “Olympics 2012”, incidents such as the “Costa Concordia” accident, or product releases such as the new “iPhone”. Finally, while there are approaches that can infer affect in visual content, no methods have yet been described in the literature for sentiment prediction from visual content. However, this kind of automatic assessment would lead to more comprehensive descriptions of social multimedia, where people express their opinions and sentiments on a regular base.
The thesis  presents strategies to address the above outlined challenge by proposing a novel combination between visual learning of semantic concepts and social media analysis: first, social media streams are mined for trending topics to synchronize concept detection with real world events matching users’ information needs. Second web video from platforms such as YouTube is exploited as an alternative training source for concept detection. Youtube’s user-generated tags are used as positive labels for supervised machine learning. Third, concept detection is extended by a large-scale visual sentiment ontology (VSO). The resulting SentiBank detectors are constructed from the analysis of emotions expressed on YouTube and Flickr.
2 Dynamic Vocabularies from Trending Topics
The first contribution of the thesis is its novel approach towards forming dynamic vocabularies for concept detection. The key idea is to expand concept vocabularies with trending topics that are mined automatically from media like Google, Wikipedia, and Twitter . To achieve this, topics from different media channels are clustered and aggregated to form daily trending topics (see Fig. 2).
Following, the thesis presents the first comprehensive study of various trending topic characteristics across three major online and social media streams, covering thousands of trending topics during an observation period of an entire year. Results from this study show that a typical trending topic “lives” for up to 14 days, with an average of 5 days. Surprisingly, the analysis indicates that Wikipedia as a media channel is as quick as Twitter when it comes to the first appearance of a trending topic.
An important condition for constructing concept vocabularies dynamically is the capability to predict the most popular trending topics for detector training. This is done by forecasting the life-cycle of trending topics at the very moment they emerge. The presented fully automated approach is based on a nearest-neighbor forecasting technique, exploiting the assumption that semantically similar topics exhibit similar behavior . In experimental results, it is shown that this approach is able to forecast the progression of trending topics more accurately than state-of-the-art auto-regression moving-average methods.
Once identified, the trending topics can be either mapped to a static concept vocabulary, trained as a single detector on-the-fly, or used to expand an existing static vocabulary. These three strategies for establishing dynamic vocabularies are evaluated on 6,800 YouTube videos and the top 23 target topics from the dataset. The results show that direct visual classification of trends (by a “live” learning from trending topic videos) outperforms inference from static vocabularies, and can further be improved by a combination of the first and second strategy.
3 Web Video and Active Relevance Filtering
One major challenge in concept detection based on web video is that of how to retrieve proper visual training content from web platforms like YouTube. Similar to a textual web search, where a user has to define a set of keywords to formulate a search query, a concept detection system must define a proper set of keywords for API query construction e.g. for a concept like “car race”, the videos retrieved should not include remote-controlled cars or interviews with racecar drivers. The thesis presents an approach that offers an automatic concept-to-query mapping for training data acquisition from YouTube, where queries are automatically constructed by keyword selection and category assignment. Results demonstrate that the proposed method allows the system to automatically retrieve training videos, that are as relevant as those retrieved manually by humans.
4 Adjective Noun Pairs for Visual Sentiment
The third contribution of the thesis is to tackle the challenge of sentiment analysis based on visual content as illustrated in Fig. 4.
This is accomplished by introducing a large-scale ontology of 3,000 adjective noun pairs (ANPs) . This ontology is based on psychological theory and the proposed construction method is fully data-driven, i.e., it automatically mines online sources such as Flickr and YouTube for sentiment words, which serve as the building elements for ANPs discovery. Further, it presents SentiBank, a novel mid-level representation framework, which is built on the ontology and encodes visual presence of 1,200 ANPs. This bank of concept detectors is able to differentiate between visual concepts such as “cute dog” and “dangerous dog” (Fig. 5) and therefore provides an unique understanding of visual sentiment.
In experiments on sentiment analysis with real-world Twitter data covering 2,000 photo tweets, the proposed mid-level representation demonstrates an improved prediction accuracy of 13 % (absolute gain) in a joint visual-text setup over a state-of-the-art text only methods.
It has also been demonstrated that this approach is able to outperform state-of-the-art porn detection baselines  on real-world pornographic and child sexual abuse (CSA) content. Additionally, the compilation of detected ANPs allows to explain detection results to law-enforcement, which in this domain is an important system requirement.