Collection

S.I. - Multi-modal Transformers

With the development of the Internet, social media, mobile apps, and other digital communication technologies, the world has stepped into a multimedia big data era. Millions of multimedia data, including image, text, audio, and video, are uploaded to the social platform every day. To make the artificial intelligence better understand the world around us, it is essential to teach machines to understand the multimodal messages. Multimodal machine learning, which aims to build models that can process and relate information from different modalities, has been a vibrant field with increasing importance and extraordinary potential. In this novel and hopeful area, extensive efforts have been dedicated to seamlessly unifying computer vision and natural language processing, such as multimedia content recognition (e.g., multimodal affect recognition), matching (e.g., cross-modal retrieval), description (e.g., image captioning), indexing (e.g., multimedia event detection), summarization (e.g., video summarization), reasoning (e.g., visual question answering), and so on. Although fruitful progress has been made with deep learning-based methods, the performance of above tasks is still far from users’ expectations, given the heterogeneous data due to several well-known challenges: (1) how to represent and summarize multimodal data; (2) how to identify and construct the connection and interaction between different modality data; (3) how to learn and infer adequate knowledge from multimodal data; (4) how to translate data or knowledge from one modality to another; and (5) how to understand and evaluate the heterogeneity in multimodal datasets.

Submission Guidelines: Authors should prepare their manuscript according to the Instructions for Authors available from the Multimedia Systems website. Authors should submit through the online submission site at Multimedia Systems and select “S.I. - Multi-modal Transformers" when they reach the “Article Type” step in the submission process. Submitted papers should present original, unpublished work, relevant to the topics of the special issue. All submitted papers will be evaluated on the basis of relevance, significance of contribution, technical quality, scholarship, and quality of presentation, by at least three independent reviewers. It is the policy of the journal that no submission, or substantially overlapping submission, be published or be under review at another journal or conference at any time during the review process. Final decisions on all papers are made by the Editor in Chief.

Editors

  • Feifei Zhang

    Feifei Zhang is currently a professor at the School of Computer Science and Engineering, Tianjin University of Technology. Her research interests include multimedia content analysis, understanding, and applications, especially crossmodal image retrieval, visual question answering, and image captioning. She has authored or co-authored over 20 academic papers in international conferences and journals, including IEEE TIP, IEEE TMM, IEEE TCSVT, ACM TOMM, IEEE CVPR, and ACM MM.

  • An-An Liu

    Dr. An-An Liu is currently a professor in the School of Electronic Information Engineering, Tianjin University, China, and the director of Institute of Image Information & Television, Ministry of Education. He used to be a visiting professor in the School of Computing, National University of Singapore, working with Prof. Mohan Kankanhalli, and the visiting scholar in the Robotics Institute, Carnegie Mellon University, working with Prof. Takeo Kanade. He respectively received his B.E. and Ph.D. degrees from Tianjin University, China, in 2005 and 2010. His research interests include cross-media computing and machine learning.

  • Xiaoshan Yang

    Xiaoshan Yang received Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences in 2016. He is currently an Associate Professor with the Institute of Automation, Chinese Academy of Sciences. His research focuses on data-driven and knowledge-guided multimedia content understanding. He has authored or co-authored more than 50 journal/conference papers, most of them are IEEE/ACM transactions or CCF-A conferences, e.g., IEEE TMM, IEEE TIP, IEEE TCYB, ACM TOMM, IEEE CVPR, ACM MM and AAAI.

  • Min Xu

    Dr. Min Xu is an Associate Professor at the School of Electrical and Data Engineering (SEDE), Faculty of Engineering and Information Technology (FEIT), University of Technology Sydney (UTS). She is currently the Leader of Visual and Aural Intelligence Laboratory within the Global Big Data Technologies Center (GBDTC) at UTS. Dr. Xu is a researcher in the fields of multimedia, computer vision and machine learning. She has published 170+ research papers in prestigious international journals and conferences, including IEEE T-PAMI, IEEE T-NNLS, IEEE T-MM, IEEE T-MC, PR, ICLR, CVPR, ICCV, ACM MM, AAAI and so on.

Articles (16 in this collection)