1 Introduction

With the development of social media, more and more human activities have become public and accessible. In addition, the application of a large number of acquisition equipments and sensors has made it easier for us to obtain information about the surrounding world. Humans collect these multi-source data in chronological order and obtain multi-view sequential data (MvSD). MvSD has broad researches and applications in various domains, including smart transportation, climate science, social media, health care, crime analysis, etc. However, as the volume and scale of MvSD increases, classical data mining methods are no longer applicable. On the one hand, the construction of hand-craft features is restricted by limited human knowledge, thus conventional methods are difficult to represent such complex data. On the other hand, MvSD changes dynamically over time and presents self-correlated, the traditional machine learning methods can not fully mine the knowledge mechanism in sequential data and it is difficult to effectively analyze the hidden attributes. Simultaneously, MvSD collected from various domains or obtained from diverse sensors leads to heterogeneity among views. Thus, how to make full use of the diversity among different views and fuse the latent knowledge in MvSD has attracted extensive research.

In recent years, deep learning has swept many fields and achieved remarkable achievements, such as object detection (Girshick 2015; Ren et al. 2015; He et al. 2016; Redmon et al. 2016; Liu et al. 2016a), image segmentation (Long et al. 2015; Ronneberger et al. 2015; Lin et al. 2017; He et al. 2017; Zhao et al. 2017c; Badrinarayanan et al. 2017), natural language processing (Kiros et al. 2014; Bahdanau et al. 2014; Cheng et al. 2016; Vaswani et al. 2017), etc. Deep learning has brought the possibility to solve the above problems with its general data understanding and parallel computing capabilities. First of all, the superiority of deep learning is based on its feature extraction ability, which breaks the performance of human-engineered features through end-to-end learning. Among them, convolutional neural network (CNN) achieves excellent performance on regular raster data, while recurrent neural network (RNN) is adapted to sequence data and model the correlations. Second, classical methods based on small datasets are untenable in MvSD. In contrast, the performance of deep models will be further improved with massive data samples. Therefore, deep models perform better feature representation and learning on larger datasets. Third, conventional machine learning methods generally exploits some linear functions to fit latent data structure and is not able to express complex models. As we all know, deep networks have nonlinear approximation capabilities and learn rules by optimizing the loss function as much as possible.

It is worth mentioning that MvSD is collected from multiple domains. For these multi-source sequential data, the analysis method based on multi-view learning has more sufficient feature representation capability than single-view learning. Multi-view deep learning can not only be used to analyze the implicit feature correlation and internal dynamic changes in sequential data, but also help solve the incompleteness and uncertainty in sequential data analysis. Taking video sentiment analysis as an example. It usually consists of three kinds of data: text sentences, images and audio clips. It is not comprehensive to analyze the expression only in text streams, because language is ambiguous in different situations. By combining facial features and speakers pronunciation, the speakers attitude is more accurately inferred. As for the traffic flow forecasting task, some external factors will also affect the prediction performance, such as weather, holidays and events. Injecting these factors into the network will further assist in prediction.

In the past few decades, a large number of machine learning techniques have been used for multi-view data, resulting in multi-view representation, multi-view clustering, multi-view fusion, etc. With the large amount of multimedia data available in recent years, multi-view learning has become a promising research. Benefiting from a large number of researches and sufficient theories of multi-view learning (Khan et al. 2022a; Yin and Sun 2019), some deep learning-based multi-view algorithms are gradually being studied (Yin and Sun 2019; Sun and Zong 2020; Mao and Sun 2020), forming deep multi-view learning (Yao et al. 2018; Wang et al. 2015; Kan et al. 2016; Sun et al. 2020d) and deep multi-view clustering (Li et al. 2019; Khan et al. 2022b; Xia et al. 2022). For example, Deep CCA (Andrew et al. 2013), which is extended from canonical correlation analysis (CCA) (Hotelling 1992), learns non-linear mappings between different views through stacked multi-layer neural networks. Deep matrix factorization (Zhao et al. 2017b; Huang et al. 2020a), which applies non-negative matrix factorization (NMF) from traditional clustering to the deep framework. In addition, deep subspace clustering (Ji et al. 2017; Wang et al. 2020b) is also extended on the basis of traditional subspace clustering, which further expands the application (Abavisani et al. 2020; Cai et al. 2021; Wang et al. 2020c). Therefore, adopting some ideas based on multi-view clustering can facilitate our research in the process of researching MvSD.

In the past few years, some works have investigated multi-modal data, demonstrating the effectiveness of deep learning for multi-modal data fusion. Several surveys reviewed the progress of multi-view deep learning (Wang 2021; Baltrušaitis et al. 2018; Chen et al. 2020c; Summaira et al. 2021; Rahate et al. 2021; Ramachandram and Taylor 2017; Zhao et al. 2017a).  Wang (2021) discussed some recent researches on deep multi-modal models from two aspects of clustering and classification, focusing on the application of generative adversarial network (GAN) in clustering and cross-modal learning. Baltrušaitis et al. (2018) investigated the latest developments in multi-modal machine learning, and it stated five challenges: representation, translation, alignment, fusion and co-learning. Chen et al. (2020c) analyzed the prevailing multi-modal network structure and existing problems, including multi-modal feature extraction and latent feature learning. Summaira et al. (2021) discussed the latest advancements and trends in multi-modal deep learning, and adopted a new fine-grained taxonomy to classify existing multi-modal networks. Rahate et al. (2021) reviewed the relevant literature in multi-modal deep learning and categorized multi-modal co-learning from multiple perspectives. These aforementioned surveys are instructive for our MvSD investigation, at the same time, some spatio temporal data (STD) analysis works provide us with application fields and current progress for reference. Wang et al. (2020d) reviewed the recent development of deep learning technology in STD and classified existing literature according to the types of STD, data mining tasks, and deep learning models. It classified the spatio temporal data into five types: event, trajectory, point reference, raster, and video. Alam et al. (2021) classified the STD analytics systems into three categories and provided definitions and related applications for STD. Moreover, it conducted investigations and discussions on existing programming languages, development tools, and data platforms. Atluri et al. (2018) summarized traditional machine learning methods in STD, and discussed related data mining problems in analyzing different STD. Mazimpaka and Timpf (2016) summarized the application of deep learning in trajectory data and traffic prediction.

In this paper, we conduct research on MvSD. Existing multi-view surveys mostly focus on the applications, neural networks and fusion methods, do not specifically consider combining multi-view and sequential data for research. In addition, the aforementioned STD researches do not consider investigating from multi-view perspective. Our contributions are as follows:

  • This paper conducts research from the perspective of multi-view sequence and discusses the challenges in MvSD data.

  • This survey reviews recent deep learning techniques for MvSD and categorizes MvSD into four data types. Then we organize different deep learning models for specific types of view data for representation and learning.

  • This survey summarizes some application domains and emerging tasks of MvSD, and points out some potential future research directions.

The rest of this paper is organized as follows. In Sect. 2, we divide the source data that constitutes MvSD into four categories and discuss the characteristics and challenges in each of these categories. In Sect. 3, we illustrate the existing deep representation methods for various types of view data. In Sect. 4, we investigate the deep learning models for MvSD. In Sect. 5, we summarize the applications of MvSD and related tasks. Finally, we discuss the future trends and conclude the survey. The taxonomy diagram of MvSD is presented in Fig. 1.

Fig. 1
figure 1

Taxonomy diagram of deep learning on MvSD

2 Multi-view sequential data

As illustrated in Fig. 2, we give the paradigm of MvSD, which is composed of m views from different sources. These view data consists of various types, such as continuous multiple images (video clips), character description (text sentences), spatial position changes (trajectory), and these view data lengths or time steps may not be aligned with each other. Among them, each view has order constraints and is arranged according to certain rules. The sequential data of different views usually comes from various domains, which has different statistical characteristics. Therefore, it is difficult for a single view model to handle heterogeneous data. We need to formulate corresponding representation methods according to different views and select appropriate models for feature representation, extraction and fusion.

Fig. 2
figure 2

Illustration of MvSD. MvSD can be composed of m views, we enumerate three types view data, including video clips, text sentences, and mobile data. Each view is arranged in a certain order (for example, chronological order, grammatical rules, etc)

2.1 Data types

There are many types of sequential data, such as meteorological data, time-series data, gene sequences, sensor data, audio clips, etc, all of which are the research objects of MvSD. In order to facilitate subsequent work, we first introduce four data types: point, sequence, graph and raster. Each data type can be directly or indirectly converted to sequential data. Different from the classification method in Refs. Atluri et al. (2018), we generalize point data into two categories, that is, individual instances are regarded as points (e.g., event data), and the instances themselves are points (e.g., LiDAR data). In addition, we categorize trajectory data and text data into sequence data.

2.1.1 Point

Point data describes discrete points in space with specific location coordinates (e.g., geographic latitude and longitude), indicating the existence in space and attaching some additional information. A point is usually represented by a tuple (\(p_{i}\), \(e_{i}\), \(t_{i}\)), where \(p_{i}\) represents the position of the point, \(t_{i}\) represents the time when the point occurrs and \(e_{i}\) represents additional features (such as temperature, humidity, color, etc). Event-type data regards individual instance (e.g., a traffic incident) as point, which usually means that it occurs in a certain location and is accompanied by information such as time and event category. Figure 3a shows an example of the event data. In addition, there are some instance data themselves that are point sets, which are scanned by sensors. Point cloud data is usually represented by three-dimensional coordinates, with additional information such as reflectivity, intensity, and color, etc. Figure 3b shows an illustration of the 3D point cloud data (Hackel et al. 2017). Point data has applications in many fields, such as transportation (e.g., traffic accidents), criminology (e.g., crime incidents), social media (e.g., social event), autonomous (e.g., point cloud data), etc.

Fig. 3
figure 3

An illustration of event and laser data

2.1.2 Sequence

Time series is a typical type of sequence data, which is a sequence obtained at consecutive and evenly spaced time points. For example, in mechanical fault diagnosis, the frequency of equipments is sampled at equal intervals. Figure 4a shows an example of audio signal. As shown in Fig. 4b, video data is viewed as a series of images arranged in chronological order. The trajectory data is also treated as a time series, which periodically records the moving position of the target. Figure 4c shows an example of the trajectory data. Time series is not the only case of sequence data, there are other cases such as text data, which need to consider the logic of language. We group trajectory data, audio, video, time series, and text into sequence data.

Fig. 4
figure 4

Illustration of sequence data

2.1.3 Graph

Graph data is a collection of vertices connected by a series of edges, each of which is assigned a weight. Graph data is used in many fields, including traffic networks, social networks, and recommendation systems. In a social network, each person is a vertex, and people who have a relationship with each other are connected by edges. Each edge has a direction to form a directed graph. In traffic forecasting, the traffic road networks are naturally modeled as graphs. Taking the road network as an example, where road segments are represented as edges, and nodes embedded in the spatial map represent intersections of these road segments.

2.1.4 Raster

Raster data is presented in a grid of pixels, and each pixel has a value, which represents information at a specific location (color or other statistics). Figure 5a shows an example of an image data, the position of each pixel is regarded as a fixed point, and each pixel is an observation value. In neuroscience, as a new neuroimaging method, functional magnetic resonance imaging (fMRI) is based on measuring changes in hemodynamics caused by neuronal activity. The scanned signals form raster data used to analyze brain activity. In urban big data, various fixed-position sensors collect data to form spatial map, air quality, and weather data. Figure 5b shows an example of raster traffic data (Zhou et al. 2020).

Fig. 5
figure 5

Illustration of raster data

2.1.5 Converting data format

The data formats mentioned above often need to be converted into appropriate formats according to specific tasks and models. These formats are often convertible to each other. Point data is naturally converted to raster data by quantifying in each grid cell. For example, the events (e.g., traffic accidents, crimes, etc) that occur in each grid are converted into event raster data, which in turn can be converted to point data. In autonomous driving, point cloud is converted into 3d voxel grid or 2d bird’s eye view (BEV) through quantization operation. Further, point data are treated as nodes in graph data. In spatial map, traffic sensors are viewed as nodes of the graph, and the distances between the sensors are used to construct an adjacency matrix. In some cases, sequence data is viewed as a series of observations in continuous time (e.g., sensor data), and the sequence data is converted to point data by sampling at equal intervals. In addition, some types of sequence data (e.g., trajectory data) are converted into raster data, and the positions of different time instants correspond to the coordinates of the grid in raster data. For some raster data (such as meteorological data), sequence data is obtained by performing continuous time statistics on the observations at each site.

2.2 Challenges

In this section, we discuss the challenges of MvSD and summarize the existing literature. We state the following five problems in MvSD: temporal dynamic, heterogeneity, cross-view dynamics, data missing, misalignment of asynchronous multi-view.

2.2.1 Temporal dynamics

For each sequential data, consequent changes are recorded in chronological order, showing the dynamics at different time slots. The data points at different times in sequential data depend on each other. If the dynamic information of the temporal granularity is ignored, the regularity of sequence change becomes difficult to model and the accuracy will decrease. For example, in the sentiment analysis task, the expressed opinions such as “I think it's...but...”, the semantics will be inconsistent or further enhanced, which is also known as intra-modality dynamic. Information at certain moments will drive sentiment recognition. In traffic forecasting, the detected flow in the same road section is affected by human travel and often shows closeness, period and trend. In crime prediction, the factors that cause crimes may change over time. For example, there are different crime patterns on weekdays and weekends. In air quality forecasting, air quality monitoring stations record changes in the next few hours or a day. These time series show dynamic changes, and even some unexpected factors attacks may lead to sudden changes.

Early sequence modeling methods were proposed under specific tasks. For example, Prophet (Taylor and Letham 2018) was proposed by Facebook in 2017 for the company’s internal business time series. And the early air quality prediction tasks were modeled by random forest (Fawagreh et al. 2014) and inverse distance weighting (Lu and Wong 2008). Autoregression (AR) models, used to describe certain time-varying processes, such as stock forecasts (Ferenstein and Gasowski 2004), climate changes (Janjua et al. 2014). In addition, AR models and their variants are used in prognostication and health monitoring (PHM) (Barraza-Barraza et al. 2017), some variants such as autoregressive moving average (ARMA) (Pham and Yang 2010), autoregressive integrated moving average (ARIMA) (Ordóñez et al. 2019). Further, some methods based on Gaussian process (Zhao and Sun 2016a, b), Markov chain model (Sun et al. 2015), ARIMA (Chen et al. 2011), etc., have been proposed for traffic prediction.

The current schemes for modeling the temporal dynamics of multi-view sequences is to use networks based on RNNs and their derivatives. In sentiment analysis, Refs. (Zadeh et al. 2017; Verma et al. 2020; Wang et al. 2019c) employed independent LSTM to model intra-modality dynamic separately for each view sequence. To model the context of the sequence, Refs. (Hazarika et al. 2020; Xu et al. 2019) introduced bi-directional LSTM to obtain feature representations for each view. In order to obtain the temporal dependence of each weather sequence, DAQFF (Du et al. 2019) utilized bi-directional LSTM to learn long-term temporal characteristics from multivariate time series. DeepAir (Yi et al. 2018) followed the method of DeepSD (Wang et al. 2017), using RNN to embed sequence data to find similarities in different time slots.

In addition, there are some literatures that combine attention-based structures to tackle temporal dynamics. Pham et al. (2018) learnt common representations for different modalities via sequence-to-sequence (Seq2Seq) and introduced an attention mechanism to handle long-term dependencies. To address temporal duplication content in the identical view, Tian et al. (2020) adaptively aggregated useful information through self-attention. DCRNN (Li et al. 2017) captured temporal dependencies among time series through gated recurrent unit (GRU). ST-MetaNet (Pan et al. 2019) proposed a meta RNN, which uses meta-knowledge to generate GRU weights from node embeddings to model diverse temporal dependencies. Forecaster Li and Moura (2019) applied graph transformer to model long-term temporal dependencie. In order to solve the cumulative error amplification in sequence prediction, GMAN (Zheng et al. 2020) directly encoded historical inputs and generated future time steps by transform attention, thereby mitigating the error propagation problem.

2.2.2 Heterogeneity

MvSD consists of a sequence of views from multiple domains, and these views are often heterogeneous. As shown in Fig. 6, the various data mentioned in Sect. 2.1 have their own distributions. For example, image and text data are presented in different forms. Images are usually composed of raster pixels, and the content is intuitive to humans. Whereas, textual data, which usually consists of words and symbols, follows linguistic logic and is therefore more complex than images.

Fig. 6
figure 6

Heterogeneity of MvSD. Different view data has different distribution

In order to solve the heterogeneity of different views, models trained from specific domains are usually used to extract feature representations on corresponding view. For example, Refs. (Zadeh et al. 2017; Verma et al. 2020) extracted language, audio and visual features through three independent modality-specific LSTMs, and then explored the relationships between these modalities in the feature space. ADAIN (Cheng et al. 2018) combined feedforward neural network (FNN) and RNN, where FNN extracted static features and RNN learnt time series features. The obtained features from different views are combined for subsequent networks. stMTMV (Liu et al. 2016b) introducesd linear functions to deal with spatial and temporal features separately, and then aligned spatio-temporal views on nodes. DeepCrime (Huang et al. 2018) proposed a category-dependent encoder that encoded regions and crimes separately and finally mapped them in a common latent space.

An encoder-decoder structure is used to implement transitions between modalities to address view heterogeneity. Pham et al. (2018) translated two modalities into another joint representation via Seq2Seq model. MCTN (Pham et al. 2019) converted one modality to another via circular translation. Furthermore, for the three modalities, the representation learned between the two modalities was further transformed into the other modal, thereby forming the final joint representation. Forecaster (Li and Moura 2019) adopted the encoder-decoder architecture, taking spatial information and auxiliary information as the encoder input, and the decoder predicted the future spatial information.

The heterogeneity gap between different views is minimized in the common feature space. ARGF (Mai et al. 2020) introduced an adversarial approach to transform the distribution of the source modality into the distribution of the target modality. Inspired by domain adaptation, MISA (Hazarika et al. 2020) mapped multiple modalities into a shared subspace according to a weight-sharing encoder, and aligned these features by introducing metric distances.

2.2.3 Cross-view dynamics

MvSD has temporal dynamics within a single view sequence, while there are dynamic interactions between different view sequences. We consider cross-view dynamics into two categories, spatio-temporal correlations and semantic interactions.

Spatio-temporal correlations MvSD changes continuously in time and manifests differently in space, with spatio-temporal dynamics within a single view sequence or across views. For example, each image in a video can be viewed as a continuous change in space over a time period. For another example, in traffic forecasting, the observations of each traffic sensor are closely related to the observations of the surrounding space, and each observation value is also related to its own historical observation. There are many studies exploring spatio-temporal correlations. A conventional way is to model the local space first and then mine the temporal dynamics with recurrent networks, such as combining local convolution with recurrent networks (Yao et al. 2018; Zhou et al. 2020; Bai et al. 2019; Yu et al. 2017; Song et al. 2020; Wang et al. 2020e; Yuan et al. 2018; Chen et al. 2019; Zhang et al. 2017). Bai et al. (2019) combined graph convolutional network (GCN) with LSTM, where the local spatial correlations captured by the graph convolutional network were fed into a multi-layer LSTM to model the temporal relationships. In addition, combining attention mechanism and encoder-decoder structure is also used for spatio-temporal dynamic modeling (Li and Moura 2019; Shi et al. 2020; Yin et al. 2021a; Wu et al. 2020). APTN (Shi et al. 2020) used an attention-based encoder to model spatial, temporal, and periodical. The decoder introduced temporal attention to explore the dependence of the time steps. Forecaster (Li and Moura 2019) integrated the dependency graph into Transformer for forecasting spatially and temporally related data.

Semantic interactions Semantic interactions are often manifested in interactions between multiple views. For specific tasks, these views contain supplemental information that enhances specific views. Taking video sentiment analysis as an example, it usually treats language as the primary modality, as for images and audio as auxiliary modality. To model the semantic dynamics across views, memory-based methods are usually employed (Tian et al. 2020; Zadeh et al. 2018b, c; Ismail et al. 2020; He et al. 2020b). Furthermore, encoder-based methods transform multiple views to a specific view to learn a common representation (Pham et al. 2018; Xu et al. 2019; Mai et al. 2020; Hazarika et al. 2020). In addition, some literatures employ contrastive learning to achieve feature-level and semantic-level interactions (Mai et al. 2021; Liu et al. 2021; Kim et al. 2021). Mai et al. (2021) performed intra-modal/inter-modal contrastive learning and semi-contrastive learning simultaneously to ensure that the intra-modal/inter-modal dynamics are fully learned.

2.2.4 Data missing

MvSD collects data from different sources, some human factors, communication delays, and sensor failure usually cause partial or fully missing of temporal data, thus data missing is a very common phenomenon. In other words, ideally complete MvSD is rare. As shown in Fig. 7, we illustrate different types of missing data. Figure 7a is the complete multi-view sequence, each view is intact during training and testing, and different views are paired with each other. Figure 7b and Figure 7c represent missing data during training and testing phases, respectively. Figure 7d indicates that there are missing data in both training and testing phases. In this section, we introduce recent deep learning methods about data missing in MvSD.

Fig. 7
figure 7

Different missing types of MvSD. Taking text and video sequences as examples, (a) is the complete multi-view sequence. We summarize the missing types as missing in the training phase (b), missing in the testing phase (c), and missing both in the training and testing phases (d)

Reconstructing missing data using autoencoders is one solution. The purpose of the autoencoder is to encode source data into latent features, and then use a decoder to decode the latent features into target domain data. DCC-CAE (Dumpala et al. 2019) combined deep canonical correlation analysis (DCCA) with cross-modal autoencoders. DCC-CAE assumes that audio and visual modalities are available during training, but only one modality is available during testing. DCC-CAE is composed of two decoders, which input available modalities and reconstruct the corresponding missing modality representations. CPM-Nets (Zhang et al. 2020) reconstructed the complete view by constructing latent representations through structural constraints. In the unsupervised situation, CPM-Nets proposed adversarial strategies to further improve the complete representation. MCTN (Pham et al. 2019) introduced cyclic consistency loss in the process of modality translation to make the learned joint representation contain as much of all modality information as possible. MFM (Tsai et al. 2018) decomposed the multimodal representation into two factors: multimodal discriminative and modality-specific generative factors. Among them, the discriminant factors contain the shared joint features used to discriminate the task. The information contained in generative factors is unique to generate specific modalities.

Meta-learning is a learning-to-learn algorithm that learns multiple tasks on training data and processes new tasks during testing. Meta-learning enables knowledge transfer for task-agnostic few-shot learning. SMIL (Ma et al. 2017) is the first work to study the lack of data in both training and testing phases. SMIL jittered the latent feature space through Bayesian meta-learning, making single modality embeddings approximate to full modality embeddings. Meta-learning based spatio-temporal network (Yao et al. 2019a) is suitable for solving the problem of unbalanced spatial distribution of collected data, and transferring knowledge from multiple cities to target cities.

2.2.5 Misalignment of asynchronous multi-view

Sequential data from different views are usually not strictly aligned. One is that the lengths of the view sequences are not equal. For example, affected by the sampling frequency, the number of images and the number of words in the video are not equal. The other is semantic misalignment. There is no complete correspondence between images and text in video data. Each image does not correspond to each word, and one character can be associated with multiple images. Figure 8a shows the ideal condition, that is, the sequence length and semantics of multiple views are completely aligned. However, in reality, more scenarios are shown in Figure 8b and c, where Fig. 8b shows length aligned but semantics unaligned, and Fig. 8c shows both length and semantics misaligned.

Fig. 8
figure 8

Misalignment of asynchronous multi-view sequence

Most of the existing work, such as (Zadeh et al. 2017; Wang et al. 2019c; Liang et al. 2018a), are based on the multi-view sequence alignment. Some recent work focuses on multi-view sequence misalignment (Tsai et al. 2019; Yang et al. 2020; Aytar et al. 2017; Le et al. 2018). Recently, some literatures adopt attention-based structure to achieve multi-view sequence alignment (Tsai et al. 2019; Yang et al. 2020; Le et al. 2018). Multimodal Transformer (Tsai et al. 2019) focused on the interaction between multi-view sequences of different time steps through cross-modal attention to transform one modality to another without explicitly aligning the data. Furthermore, multi-view sequence alignment is also achieved by using pretrained network. pre-trained networks. Aytar et al. (2017) trained a deep convolutional network, which used a large amount of alignment data (including language, visual and acoustic) for aligned cross-modal representation. There are other literatures that use multi-instance learning without explicit data alignment. Tian et al. (2020) formulated weakly supervised video parsing as multi-modal multi-instance learning (MMIL), and proposed MMIL pooling to aggregate multi-modal information.

3 Multi-view sequential representation

In Sect. 2.1, we introduce various data forms. In order to feed these data into the network for feature learning, we need to choose appropriate methods to represent these data. In this paper, we mainly consider three representations for network input: rasterized representation, sequential embedding and graph embedding.

3.1 Rasterized representation

As shown in Fig. 9, the rasterized data quantifies the points in each grid (such as events, track points, traffic flow, meteorological data, etc.). Each cell in the raster grid is regarded as the statistics of a region. For example, in autonomous driving, points in the scene are divided into 3D rectangular grids with a given resolution (Zhou and Tuzel 2018; Yang et al. 2018; Laddha et al. 2021; Fadadu et al. 2022) to obtain voxel grids or BEVs, and then 3D CNN can be naturally applied. In traffic flow forecasting, the whole city is split to \(n\times n\) grids according to latitude and longitude, and each region is indexed by rows and columns, and each grid is aggregated according to time intervals (Zhou et al. 2020; Yuan et al. 2018; Wu et al. 2020; Liao et al. 2018; Wang et al. 2020f; Zhang et al. 2019, 2021a). Through rasterization, CNN extracts the spatial features of different regions. Using recurrent networks to model raster data across multiple time slices enables analysis of dynamic relationships between different regions.

Fig. 9
figure 9

Illustration of rasterized representation

3.2 Sequential embedding

For sequence data, such as text, trajectory data, and time series data, some feature transformation and feature embedding methods need to be used to process these data. For time series data, multi layer perceptron (MLP) is usually used to map it into latent vectors. Yi et al. (2018) transformed the raw features of each domain data into a low-dimensional space through embedding methods. Cheng et al. (2018) applied fully connected layers (FC) to extract features of points of interests (POIs) and meteorological features (such as weather, temperature, humidity, etc.). DAQFF (Du et al. 2019) set 1\(\times\)1 convolution to transform multiple time series data. In the language model, pre-trained models are usually used to perform feature transformation on the language sequence. For example, Word2Vec (Mikolov et al. 2013), GloVe (Pennington et al. 2014), and BERT (Devlin et al. 2018) are commonly used in natural language processing. Some literatures use pretrained 300-dimensional Glove word embeddings to encode a sequence of transcribed words into a sequence of word vectors (Zadeh et al. 2017; Wang et al. 2019c; Liu et al. 2018; Zadeh et al. 2019, 2018b). Still others use the pretrained Transformer model BERT to extract utterance level textual features (Ismail et al. 2020; Rahman et al. 2020; Yu et al. 2021b; Sun et al. 2020e).

3.3 Graph representation

Graph models a set of objects and their correlations in sentences, images, and spatial. Embedding the graph into vector representation and then seamlessly connect with GCN in subsequent processing. The graph representation reflects the association of different areas of the spatial map (Yao et al. 2018; Li and Moura 2019; Bai et al. 2019; Huang et al. 2020b; Yu et al. 2017; Wu et al. 2020). Forecaster (Li and Moura 2019) learnt the weights of the non-zero entries of the adjacency matrix by introducing a sparse linear layer, taking into account that different locations may have different dependency strengths. In the spatial map, DMVST-Net (Yao et al. 2018) employed CNN to extract local feature representation of each local region and its surrounding neighbors, and embeded the local representation into a low-dimensional representation through FC to supplement the information of the graph nodes. In spatial event prediction, Wu et al. (2020) designed an embedding component to generate an embedding vector for each event time step in each time slot. The graph representation reflects the relationship between sentence and image (Wang et al. 2020a). Additionally, the graph representation also reflects the activity on social media (Islam and Goldwasser 2021).

4 Deep neural networks

In this Section, we mainly review deep network techniques commonly used to extract features of multi-view sequences. In the network, different data representations need to choose suitable network models.

4.1 CNN-based networks

Figure 10 shows the CNN structure, which usually consists of convolutional layers, pooling layer, and fully connected layer. BatchNorm (Ioffe and Szegedy 2015) can be appended after convolutional layers. CNN has excellent performance in processing regular grid data. By stacking multiple convolutional layers, the learning from the bottom-level information to the high-level semantic features is realized in bottom-up. It moves fixed-size filters (e.g., 3\(\times\)3, 5\(\times\)5) on the input grid from left to right and from top to bottom. The filters performs inner product operation on corresponding position and generates high-dimensional features.

Fig. 10
figure 10

Structure of CNN

Some works try to convert traffic networks at different time intervals into raster images, and use CNN to extract spatial correlations (Zhang et al. 2017, 2019). ST-ResNet (Zhang et al. 2017) designed a CNN to capture the spatial dependency of closeness, period, and trend. MDL (Zhang et al. 2019) converted the directed graph at each time slot into a tensor representation, which was used by CNN to extract the spatial relationship, and then fully convolutional network (FCN) (Long et al. 2015) was introduced to obtain the temporal dependencies.

4.2 RNN-based networks

RNN and its variants are models for processing sequence data, which use the previous output state of the sequence to predict the next state, as is shown in Fig. 11a. LSTM is an extension of RNN. It has a specially customized memory unit to remember longer input history information, as shown in the Fig. 11b. They are widely used in speech recognition, natural language processing and time series data analysis.

Fig. 11
figure 11

Structures of RNN, LSTM and Seq2Seq

A series of RNN-based methods are used in traffic forecasting (Bai et al. 2019; Yuan et al. 2018; Wang et al. 2020f). In passenger demand prediction, three LSTMs are used to model spatiotemporal maps, external meteorological data, and temporal metadata, respectively (Bai et al. 2019). Hetero-ConvLSTM (Yuan et al. 2018) extracted spatial features through ConvLSTM (Xingjian et al. 2015), and then fed the obtained features to LSTM to model temporal dynamics. MT-ASTN (Wang et al. 2020f) modeled the temporal features of dynamic graph sequence with different scales to capture crowd flow at different scales. In the sentiment analysis task, Refs. (Zadeh et al. 2017; Verma et al. 2020; Ismail et al. 2020) utilized three independent LSTMs to construct modality embedding subnetworks for language, visual, and acoustic respectively to model intra-modality.

Figure 11c shows the structure of Seq2Seq (Sutskever et al. 2014). The Seq2Seq model is designed for sequence data and is an encoder-decoder structure, where the encoder encodes the input sequence to obtain hidden state, and the decoder generates variable-length output according to the hidden state. Pham et al. (2018) performed unsupervised learning of joint multimodal representations via Seq2Seq. Liao et al. (2018) proposed a hybrid Seq2Seq model, which integrated auxiliary information in the encoder-decoder sequence learning framework.

4.3 Graph-based networks

GCNs are often used to model non-Euclidean structural data, and GCNs are usually divided into two categories, namely spectral-based graph networks and spatial-based graph networks. The spectral-based method defines convolution operation in the Fourier domain (Kipf and Welling 2016). The spatial-based method directly applies convolution on the graph to aggregate information from neighbors. We categorize recent literatures according to the above two methods.

Spectral-based GCN introduces filters from a signal processing perspective to define graph convolution. Song et al. (2020) proposed a spatio-temporal synchronization graph convolutional network, which could effectively capture complex local spatio-temporal correlations through spatio-temporal synchronization modeling mechanism. In order to meet the requirements of medium and long-term prediction tasks, Yu et al. (2017) introduced a spatio-temporal graph convolution network, which modeled the traffic network as a graph, and used spectral convolution to extract spatial features. Geng et al. (2019) encoded the correlations of different regions to obtain multiple graphs, which were then used for correlation modeling based on ChebNet-based multi-graph convolution.

Spatial-based GCN simulates the convolution operation of traditional CNN, and graph convolution is based on the spatial relationship of nodes. Wang et al. (2020f) converted the flow Origin-Destination (OD) matrix into a semantic graph, and then performed convolution operation on the semantic graph. Li et al. (2017) treated traffic flow as a directed graph and captured spatial dependencies according to the diffusion process on the graph through a diffusion convolutional recurrent neural network. Graph attention network (GAT) (Veličković et al. 2017) is another graph neural network, which calculates the weights of neighbor nodes through an attention mechanism, without knowing the structure of the graph. Pan et al. (2019) proposed a meta-graph attention network, which used attention mechanism to capture the dynamic spatial correlation between nodes, and the attention weights were generated from meta-knowledge.

4.4 Attention-based networks

Attention is originally proposed for natural language processing, and is now widely used in sequence-based tasks, where it models the relevant parts of the information and amplifies the most important parts. Attention can be used not only to focus on spatial dependencies, but also on spatio-temporal correlations.

Shi et al. (2020) proposed an attention mechanism to model spatial, short-term and long-term cyclic dependencies. Huang et al. (2020b) developed a graph attention network integrated with GCN into a spatial gated block to capture spatio-temporal features. GMAN (Zheng et al. 2020) introduced a transform attention layer between encoder and decoder to model the relationship between historical and future time steps. Refs. (Tian et al. 2020; Wu et al. 2021) developed hybrid attention network to jointly model temporal recurrence, co-occurrence, and asynchrony. Specifically, temporal recurrence was solved by self-attention, while co-occurrence and asynchrony were addressed by cross-modal attention. Zadeh et al. (2018b) studied a delta-memory attention network, which focused on cross-view interactions, and aggregated interactions over time with multi-view gated memory.

Transformer based on attention has been successfully applied (Tsai et al. 2019; Zadeh et al. 2019; Hasan et al. 2021; Wang et al. 2020g). The structure of the Transformer is shown in Fig 12. Tsai et al. (2019) developed a multi-modal transformer to model unaligned multi-modal language sequences, and integrated multi-modal time series from multiple pairs of cross-modal transformers. Zadeh et al. (2019) designed a multimodal transformer layer to capture the decomposition dynamics of multi-modal data and aligned temporally asynchronous information intra-modality and inter-modality. Hasan et al. (2021) encoded multi-modal sequences separately via Transformer, and learnt to represent punchline according to the background context.

Fig. 12
figure 12

Structure of Transformer (Vaswani et al. 2017)

4.5 Hybrid networks

Some works are dedicated to combining multiple modules, such as combining CNN and LSTM (Xingjian et al. 2015; Zhao et al. 2019; Yao et al. 2019b). Chen et al. (2019) combined GCN with GRU, GCN was used for spatial feature extraction, and GRU was used for capturing temporal dynamics. GC-LSTM (Chen et al. 2018) introduced LSTM to model the characteristics of the dynamic graph sequence and captured the temporal characteristics of the graph sequence. Poria et al. (2017) proposed a contextual attention-based LSTM network that focuses on contextual relations, and the attention-based fusion mechanism amplified higher quality and informative modalities. Huang et al. (2020b) adopted GCN to extract spatial features and graph attention network to extract road similarity, and finally integrated these two modules through a gate structure. Pan et al. (2019) designed a combination of meta-graph attention network and meta-recurrent network, where the meta-graph attention network captured spatial correlation, and the meta-recurrent network modeled temporal correlation.

Zadeh et al. (2018c) proposed a multi-attention recurrent network, composed of long-short term hybrid memory and multi-attention block, to discover the interactions between modalities and store them in hybrid memory. Wang et al. (2019c) combined LSTM and gating mechanism, where LSTM was used to model different view sequences, and the gated modality-mixing network was used for infering nonverbal shift vector. Xu et al. (2019) adopted a combination of bi-LSTM and attention mechanism, where bi-LSTM extracted text sequences and attention mechanism learnt the alignment weights between speech and text.

4.6 Discussion

In this section, we analyze the above-mentioned deep models, as shown in Table 1. We demonstrate popular deep models, and describe them in terms of model architecture, data form, model characteristic, and application areas.

Table 1 Comparison of deep models

It can be seen that for sequence data processing, whether it is text, traffic flow or biological signals, RNN-based networks are general frameworks. The reason is that RNN-based networks can model the dependencies of these temporal information. For rasterized data, such as images, spatial maps, etc., CNN-based models are usually used for spatial modeling. In traffic prediction, specific nodes of some grid maps are turned into graph networks, which are suitable for GCN, spatial attention networks, etc. For multimedia data, such as text streams, audio, etc., in terms of feature representation, candidate extractors such as CNN, Transformer are available. Furthermore, it can be found that combining CNNs and recurrent networks becomes a paradigm for spatio-temporal modeling. For multi-view sequence modeling, the paradigm uses view-private recurrent networks for feature extraction, followed by cross-view interaction.

5 Applications

In this section, we summarize related work in different domains, including intelligent transportation, crime analysis, sentiment analysis, climate science and health care. We discuss these application areas separately and provide an overview of recent related techniques. In Table 2, we enumerate the application areas along with the aforementioned models.

Table 2 Deep models for different applications of MvSD

5.1 Intelligent transportation

With the development of mobile communication and human daily travel, a large amount of traffic data is generated, which usually includes traffic volume, speed, accidents, trajectories, spatial maps, and road networks, etc. Traffic data mining and analysis has become an urgent problem to be solved. Traffic forecasting plays an important role in smart cities, providing constructive guidance for urban planning, intelligent management and public safety, thereby promoting urban construction and avoiding waste of resources. Table 3 summarizes the performance of spatio-temporal models on several publicly available benchmark datasets. In TaxiBJFootnote 1 and NYC BikeFootnote 2, we report root mean square error (RMSE) in terms of both flow and demand. For the two types of graph structure data, PeMSD4Footnote 3 and MeTR-LAFootnote 4, we mainly summarize RMSE and mean absolute error (MAE).

Table 3 Various traffic prediction models performance on TaxiBJ, NYC Bike, PeMSD4, and MeTR-LA

As mentioned in Sect. 2.1, traffic data is typically represented as trajectories, events, and raster data. Among them, data such as trajectories and events are usually converted to raster data for processing, and then this type of data is consumed by networks such as CNN (Lv et al. 2019; Chen et al. 2020; Guo et al. 2019a; Sun et al. 2020a). The traffic forecasting of a single road segment can be regarded as sequence data, which is fed to RNN or LSTM (Huang et al. 2014; Yang et al. 2016). The road network is represented using a graph, and these road network associations are modeled by GCN (Yu et al. 2017; Geng et al. 2019; Fang et al. 2019; Sun et al. 2020b; Bai et al. 2021). In addition to the diversity of data forms, complex spatial-temporal correlations hinder traffic prediction. Among them, the spatial correlation is mainly reflected in different regions and different road segments. Traffic in adjacent areas is spatially causal, i.e. flows from one area to another. The temporal correlation is reflected in the fact that the flow of the area is affected by different time intervals, such as short-term or long-term changes. Therefore, spatial-temporal based models are used to deal with this problem (Bai et al. 2019; Zhao et al. 2019; Guo et al. 2019b).

Further, traffic data is affected by external factors (such as weather, accidents), and the data comes in various forms. Some literatures (Zhou et al. 2020; Wang et al. 2020f; Zhang et al. 2019) model the temporal dynamics into three views: closeness, period, and trend. In traffic flow forecasting, Guo et al. (2019b) combined spatio-temporal attention and spatio-temporal convolution to construct three basic components to model the recent, daily-periodic and weekly-periodic of traffic flow, respectively. In demand forecasting, Bai et al. (2019) took the structured city as one view, external meteorological data and time meta as the other two views. For the spatial view, the spatial features of each time slot were extracted by GCN, and then the spatio-temporal correlations were captured by LSTM. For the other two views, these factors were used to model passenger demand. Yao et al. (2018) built three views: spatial view, temporal view, and semantic view. Geng et al. (2019) studied multiple graphs to model the relationship between different regions, including neighborhood, functional similarity, and connectivity.

Example 1

Zhou et al. (2020) proposed a deep flexible structured spatial–temporal model (DFSSTM) for taxi capacity prediction. The structure of DFSSTM is shown in Fig. 13. First, the traffic data was rasterized into a \(n\times {n}\) grid, and the flow relationships of vehicles were characterized by inflows and outflows. Due to the short-term and long-term dependencies of traffic data, DFSSTM divided temporal dependencies into three views: period, trend and closeness. Subsequently, DFSSTM tailored a siamese spatio-temporal network (SSTN), which took both inflows and outflows as inputs, to model spatio-temporal dependencies. Three SSTNs were used to model three temporal dynamics (period, trend and closeness). Finally, DFSSTM designed a fusion layer that automatically adjusted the weights to integrate different views.

Fig. 13
figure 13

Structure of DFFSTM (Zhou et al. 2020)

5.2 Health care

In the medical system, clinicians diagnose patients through comprehensive consideration of multiple factors (for example, previous medical history, various biological indicators, patient physique, etc). This process is complicated and time-consuming. In order to better assist clinicians in diagnosis, many works apply deep learning technology in the field of medical intelligence (Yuan et al. 2019; Jia et al. 2020; Akman et al. 2021; Phan et al. 2022; Olesen et al. 2021; Piriyajitakonkij et al. 2020; Feng et al. 2021; Torres et al. 2016). The intelligent system discovers disease patterns by learning historical data and various clinical indicators, supplemented by experts knowledge. Table 4 summarizes the performance of the automatic sleep staging models on the SleepEDF-20Footnote 5, MASSFootnote 6, and SHHSFootnote 7 datasets. We report overall accuracy (Acc), Cohen’s kappa (\(\kappa\)), and macro F1-score (MF1).

Clinical data usually comes from multiple views and is heterogeneous (for example, verbal description, medical imaging, polysomnography, etc.). To handle asynchronous view sequences, Le et al. (2018) investigated memory technology to establish cross-view interactions and dependencies. It encoded the input view separately through two encoders and saved to two external memories. In sleep monitoring, Phan et al. (2021) took the original signal and time-frequency information as input. In order to solve the over-fitting rate between different views, it dynamically adjusted the learning steps between different modalities, and derived weights based on specific views for fusion different views. Jia et al. (2021b) proposed a temporal fully convolutional network based on U\(^2\)-Net (Qin et al. 2020) for multimodal salient waves detection, which converted the time series classification problem into a salient detection problem. At the same time, the multi-scale features of the sleep stage are captured by the multi-scale extraction module. In recent years, the coronavirus (COVID-19) has spread worldwide, causing a large number of human casualties and economic losses. Some works use coughing and breathing audio to determine whether COVID is positive or negative (Akman et al. 2021; Nessiem et al. 2021; Coppock et al. 2021).

Table 4 Performance of sleep staging models on SleepEDF-20, MASS, and SHHS

5.3 Crime analysis

Crime prediction plays a vital role in crime prevention. Recent works (Huang et al. 2018; Okawa et al. 2019; Stec and Klabjan 2018; Vomfell et al. 2018) can dig out the spatial-temporal patterns and trends of crimes through historical data combined with information such as crime incidents, time and location. This contributes to the early warning, allowing police to conduct inspections in high-risk regions, reducing the impact on society. However, unlike traffic data, the distribution of crime incidents in spatial and temporal is sparse, and there are fewer spatial-temporal associations between different crime incidents.

In order to solve the high heterogeneity of crime spatial distribution, the city is usually converted to raster data to extract spatial attributes, and then a recurrent network is used to capture the temporal dynamics. Wang et al. (2019a) selected crime incidents in a specific area, and divided the area into 16\(\times\)16 raster images. Motivated by ST-ResNet (Zhang et al. 2017); Wang et al. 2019a) constructed three views to extract nearby, periodic, and trend features separately. To model the association between crime and different regions, DeepCrime (Huang et al. 2018) proposed a category-interactive encoder, which embeds information such as spatial and event categories into latent vectors for representation. In addition, DeepCrime (Huang et al. 2018) adopted three GRUs to separately encode crime sequences, abnormal sequences, and interdependent sequences to model the temporal dynamics. In order to make the model interpretable, Rayhan and Hashem (2020) considered an attention-based spatio-temporal network, which captured the dynamic spatio-temporal correlation of crimes based on past criminal events, external features and recurring trends. Specifically, two GAT variants were used to embed spatial hierarchical information and specific category features respectively. CASTNet (Ertugrul et al. 2019) designed a community-attentive spatio-temporal model to capture the spatio-temporal pattern of criminal events, which was used to predict opioid overdose. Specifically, CASTNet (Ertugrul et al. 2019) extracted opioid overdose at different locations through a multi-head attention network, and introduced hierarchical attention to allow interpretation of the contribution of features from different communities to local incident prediction. Table 5 summarizes the performance of the deep crime models on the New York CityFootnote 8 and ChicagoFootnote 9 datasets. Note that most of these models are derived from spatio-temporal models.

Table 5 Summary of deep crime models performance on New York City, and Chicago

5.4 Sentiment analysis

The opinions expressed by humans in daily communication are usually complex and multi-modal, and it is of great significance for computer intelligence to understand these data. Sentiment analysis is also an important branch of future human-computer interaction.

In multi-modal sentiment analysis, these multi-view data are usually represented by vision, acoustic, and language. There are two challenges to be overcome in this task: intra-modality dynamics and inter-modality dynamics. Sun et al. (2020c) proposed a multi-view CRF model, which captured the correlation between features within a single view and considered the relationships between different views. Refs. (Zadeh et al. 2017; Liu et al. 2018) modeled the dynamics of specific modality sequences through three LSTMs and captured the interactions between modalities through a three-fold Cartesian product. Zadeh et al. (2018b) proposed a memory enhancement network, which modeled the interaction of multiple view sequences over time by introducing gated memory. Since different modalities are heterogeneous, MISA (Hazarika et al. 2020) mapped different modalities to a common subspace to learn shared feature representations. Mai et al. (2020) designed an adversarial encoder-decoder structure to embed different modalities into a common space and learn invariant representations of modalities. In order to deal with unaligned view sequences, Xu et al. (2019) introduced an attention mechanism to align speech and text, and realized the integration of speech and text at the word level. Table 6 summarizes the model performance on MOSIFootnote 10, MOSEIFootnote 11, and SIMSFootnote 12, and the evaluation metrics follow previous work Rahman et al. (2020).

Table 6 Performance of multimodal sentiment analysis models on MOSI, MOSEI, and SIMS

In audio-visual event recognition, Brousmiche et al. (2021) studied a multi-level attention fusion network, which could dynamically integrate visual and audio information for event recognition. Tian et al. (2020) presented a new task, named audio-visual video parsing (AVVP), which detected video events (labels them as audible, visible, or audible and visible) and located the duration via a weakly supervised approach. In this task, events may occur repeatedly in different views, so there are three challenges: unimodal-modal temporal recurrence, cross-modal co-occurrence, and cross-modal asynchronous. Tian et al. (2020) simultaneously employed a hybrid attention network to solve these problems, that is, using a self-attention mechanism to model unimodal-modal temporal recurrence, and using a cross-modal attention mechanism to simultaneously deal with cross-modal co-occurrence and cross-modal asynchronous.

5.5 Climate science

Weather data is usually collected by various sensor devices, including temperature, humidity, wind speed, pressure, air quality, etc. By studying the interrelationship of these meteorological data, it is helpful for human to further understand the earth’s environment and prevent natural disasters in advance.

Recently, some deep learning methods have been successfully applied to air quality prediction (Yi et al. 2018; Zhong et al. 2020; Sasaki et al. 2021; Ouyang et al. 2021; Lin et al. 2020). Refs. (Cheng et al. 2018; Du et al. 2021; Han et al. 2021c) adopted FC to extract local spatial features and LSTM to capture temporal dynamics. Yi et al. (2018) proposed a patial transformation component to aggregate spatial monitoring data of different scales. In addition, in order to model dynamic changes between cross-modal data, these features are fused in a distributed manner. Zhong et al. (2020) introduced reinforcement learning to predict air quality. The model mainly consisted of two components: site selector and air quality regressor. Among them, the site selector adaptively selected the relevant sites, and the quality regressor received the selected sites for air quality estimation. In water quality prediction, Liu et al. (2016b) studied a multi-task multi-view method to fuse the data from different domains to predict the water quality of a site. In extreme weather forecasting, Civitarese et al. (2021) proposes a temporal fusion transformer, which uses multiple variables (such as static, historical, and future) as input. In order to solve the imbalance in the spatial distribution of the collected data, Yao et al. (2019a) constructed a spatio-temporal network based on meta-learning, which transfered knowledge from multiple cities to help the target city make spatio-temporal predictions. Table 7 summarizes the performance of the deep air models on BeijingFootnote 13 and LongdonFootnote 14 dataset, we focus on the overall performance and the performance on PM2.5.

Table 7 Air quality models performance on Beijing and LongDon

Example 2

Du et al. (2019) proposed a deep air quality forecasting framework (DAQFF) for PM2.5 prediction. Figure 14 illustrates the network architecture of DAQFF. First, DAQFF customized multiple 1D convolutional neural networks for multivariate time series to model local features. Different from some spatio-temporal models, DAQFF concatenated the temporal features of multiple stations, which could simultaneously capture local features and spatial relationships across stations. Subsequently, to model long-term dependencies, DAQFF introduced bi-LSTM to extract long-term temporal dependencies. Finally, the obtained shared features were concatenated and fused for prediction.

Fig. 14
figure 14

Structure of DAQFF (Du et al. 2019)

6 Future directions

In this Section, we enumerate some challenging deep learning techniques in MvSD in recent years, and point out potential future research directions.

6.1 Interpretable model research

Despite the impressive achievements of deep learning, the working principle of the model is in a black box, and the decision-making is difficult to establish a reasonable basis. Therefore, interpretability models have become a research hotspot, which builds trust with users to understand why decisions are made in different domains such as medical diagnosis, autonomous driving, or recommender systems.

Multi-view data tends to introduce bias because these views are heterogeneous and each view may have deviations, resulting in model error accumulation and amplification. Some works focus on feature-level interpretability (Rayhan and Hashem 2020; Ertugrul et al. 2019; Khanehzar et al. 2021), which achieves global interpretation by modeling local feature relationships. Meanwhile, attention is given to providing reliable explanations behind the predictions (Jia et al. 2021a; Zheng et al. 2021; Ma et al. 2018; Agyemang et al. 2020). Furthermore, it is also an approach to achieve interpretability by designing network architectures that conforms to human cognition (Choe et al. 2021). At present, there are no mature techniques and standards to estimate the performance of interpretability. Thus, it is impossible to compare the pros and cons of interpretable methods. In the future, interpretability will be further refined to guide the agent’s behavior or multi-view fusion decision-making.

6.2 Multi-modal architecture research

As the network structure becomes more and more complex, the cost and time of manually designing the network will be unbearable. Especially for MvSD, in the feature extraction stage, a feature extractor needs to be designed for a single view and there are many alternative network structures for different views. In the multi-view feature fusion stage, we need to consider aggregating multiple views strategies. The automated process of neural architecture search (NAS) can speed up research on MvSD.

Auto-MVCN (Li et al. 2020) tailored a multi-view architecture for 3D shape recognition, which explored correlations between view features by automatically searching for fused cells. In electronic health records, MUFASA (Xu et al. 2021) simultaneously searched modality-specific networks and feature fusion strategies. BM-NAS (Yin et al. 2021b) designed a bilevel search scheme, BM-NAS selected feature pairs from pre-trained unimodal and searched a feature fusion strategy. To make the model applicable to various multimodal tasks, MMnas (Yu et al. 2020a) defined a general network search framework to design task-specific heads for different tasks on a unified backbone. MFAS (Pérez-Rúa et al. 2019) found a reasonable network structure for multimodal fusion by constraining the search space and employing sequential model-based exploration methods. Although many recent works try to design fusion strategies, how to perform fusion of multi-view sequences is not well studied and needs more research in the future. Through multi-modal network architecture search, better models and fusion methods are obtained..

6.3 Data annotations

Deep learning benefits from massive amounts of data, however large-scale data annotation brings prohibitive costs, which becomes more severe when annotating MvSD. Therefore, some techniques based on unsupervised, semi-supervised, etc., are introduced to facilitate MvSD research.

Unsupervised learning uses unlabeled data. In multi-view representation learning, DUA-Nets (Geng et al. 2021) combined inverse networks through unsupervised learning to automatically evaluate the quality of different views. Through unsupervised training, contrastive learning has achieved great success in the computer vision domain (He et al. 2020a). In multi-modal sentiment analysis, Mai et al. (2021) performed intra-modal/inter-modal contrastive learning and semi-contrastive learning simultaneously to ensure that the intra-modal/inter-modal dynamics are fully learned. In semi-supervised learning, a small amount of labeled data is combined with a large amount of unlabeled data (Khanehzar et al. 2021; Chen et al. 2021a, b). ASM2TV (Chen et al. 2021a) designed a semi-supervised learning algorithm for fragmented time series that utilizes a large amount of unlabeled data to improve model performance. In weakly supervised learning, data labels are usually low quality. In the AVVP task, video-level labels are used for training, and precise labels are used at test time (Tian et al. 2020; Wu and Yang 2021; Yu et al. 2021a). Unsupervised, weakly supervised, etc., will continue to be researched in the future to solve the problem of manual multi-view data annotations.

6.4 Unaligned multi-view sequence learning

As mentioned in Sect. 2.2.5, multi-view sequence asynchronous (sequence length is not equal or semantic misalignment) is common in real applications. Therefore, ignoring the asynchrony between these sequences will hinder subsequent tasks.

Multi-view alignment is performed by using pre-trained models. Aytar et al. (2017) provided a model for downstream tasks, using a large amount of synchronized data to learn three modes (visual, sound, and language) aligned and modally robust deep representations. In addition, the attention mechanism provides a feasible solution for sequence alignment and cross-view alignment (Tian et al. 2020; Le et al. 2018). Le et al. (2018) designed a memory-augmented network to model the interaction between two unaligned sequences. Some works implements asynchronous sequence alignment by using Transformer (Tsai et al. 2019; Delbrouck et al. 2020). Delbrouck et al. (2020) investigated a Transformer-based joint encoding method to jointly encode one or more modalities, which established global dependencies between input and output through an attention mechanism. The study of unaligned multi-view sequences is no longer limited by prior manual data alignment, and will be further applied to practical scenarios.

6.5 Trusted multi-view learning

MvSD is collected from different data sources. Various sensors or environmental factors may affect the quality of these views and bring noise. Analyzing these low-quality data, especially when unreliable views are presented, will seriously hinder multi-view tasks. In addition, for a specific task, the value of information expressed by multiple views is different, so the weight of each view is not fixed. Therefore, uncertainty estimation of MvSD is helpful to improve the robustness of multi-view.

Han et al. (2021d) proposed a unified trusted multi-view classification framework that applied a Dirichlet distribution to model the probability of each class and parameterized evidence from different views to estimate the uncertainty of each view. Finally, the Dempster Shafer theory was used to integrate the multi-view opinions. Geng et al. (2021) designed an unsupervised multi-view learning method that estimated views quality online through uncertainty modeling and integrated inherent information from multiple views to obtain a noise-free representation, thereby reducing the impact of quality imbalances of different views. Wang et al. (2019b) studied a negative log-likelihood error loss, which achieved single-value prediction and uncertainty quantification simultaneously. It predicted the mean and variance of the parameterized Gaussian distribution at each time step. Through uncertainty estimation, the model utilizes valuable information as much as possible and reduces the impact of low-quality views.

7 Conclusions

In this paper, we review the latest deep learning techniques in MvSD. We introduce four common data types that make up MvSD, including point data, sequence data, graph data, and raster data. We also enumerate the technical challenges of MvSD: temporal dynamic, heterogeneity, cross-view dynamics, data missing and misalignment of asynchronous views. In addition, we summarize the representation methods of different data types in neural networks. Further, we reviewe the latest deep learning technology applied in MvSD. We also summarize some application areas of MvSD, and finally give several potential research directions in the future.