1 Introduction

Digital image/video coding has boomed with the digitalization of information since the late 1950s, as the data size of the original digitalized image or video data increases dramatically and reaches beyond the capability of storage and transmission. During the early stages of image coding, removing spatial statistical redundancy was the main means of image compression, such as Huffman coding [1] and Run-length coding [2]. The concept of transform coding, which transforms the spatial domain into the frequency domain for compression, was first proposed in the late 1960s, including the Fourier transform [3] and Hadamard transform [4]. Later the discrete cosine transform (DCT) was designed for image coding in 1974 by Ahmed et al. [5]. In the case of video, there is significant temporal redundancy in addition to spatial redundancy, which can be reduced by applying temporal prediction. Several early prediction-based coding techniques were introduced during the 1970s, including differential pulse-code modulation (DPCM) [6], frame difference coding [7], and block-based motion prediction [8]. A prototype of a hybrid prediction/transform coding scheme [9] was first proposed in 1979 by Netravali and Stuller, who combined motion compensation with transform coding techniques, commonly referred to as “the first generation” coding scheme. An overview of the historical development of the first-generation methods is provided in [10].

After several decades of development, hybrid prediction/transform coding methods have achieved great success. Various coding standards have been developed and are widely used in a variety of applications, such as MPEG-1/2/4 (Moving Picture Experts Group), H.261/2/3, and H.264/AVC (Advanced Video Coding) [11], as well as AVS (Audio and Video Coding Standard in China) [1215], H.265/HEVC (High Efficiency Video Coding) [16], and H.266/VVC (Versatile Video Coding) [17]. In [11, 1626], the traditional hybrid coding methods have been well reviewed from the historical pulse code modulation (PCM), DPCM coding to HEVC, three-dimensional video (3DV) coding, and VVC.

With the huge number of mobile devices, surveillance cameras, and other video capture devices, the volume of video data is increasing significantly. In the coming era of big data, image and video processing will require more efficient and effective coding techniques. Nevertheless, researchers in this field have also acknowledged the difficulty of further improving performance under the traditional hybrid coding framework. One reason for the performance improvement limitation is that the traditional coding methods only consider the signal properties of images and videos and the room left for improvement is increasingly squeezed with the constraint of objective quality measurement, e.g. peak signal noise ratio (PSNR). As such, many novel coding methods that incorporate the properties of the human visual system (HVS), referred to as the second-generation coding methods [2730], have demonstrated a higher compression ratio over traditional coding methods while maintaining comparable subjective image quality. Compared to the first-generation coding methods, these methods are more dependent on the structural object-related model than on the source signal. From Musmann’s viewpoint [31], model-based coding (MBC) is composed of the first-generation and second-generation methods, which are based on a signal source or structural object-related models. MBC arose and attracted the interest of researchers, research has advanced greatly in this field, and some exciting results have been achieved. For example, in [32], a background picture model-based surveillance video coding method shows at least twice the compression ratio on surveillance videos of the AVC high profile. Moreover, other model-based coding methods display great potential nowadays and achieve obvious improvement over the traditional hybrid coding methods, such as geometric partition video coding [33] and segmentation-based coding [34]. Some MBC methods were also introduced into various coding standards, such as MPEG-4/7, AVS2, HEVC, and VVC. The developments in MBC have been well reviewed in [3542].

Although MBC aims to improve coding efficiency, many challenging problems still limit the effectiveness of the coding process, such as manually designed coding paradigms based on expert knowledge. During the last few years, neural networks, such as convolutional neural networks (CNNs), have demonstrated considerable potential in a variety of fields, including image and video understanding, processing, and compression. In terms of the compression task, neural networks perform transform coding by mapping pixel data into quantized latent representations first and then converting them back again into pixels. Such a nonlinear transform holds the potential to map pixels to a more compact latent representation than the transforms of the preceding codecs. Moreover, the parameters in neural networks can be well trained based on massive image and video samples, which facilitates the model to alleviate its reliance on manually designed modules. Considering these excellent characteristics, learning-based coding (LBC) has been recognized as a promising solution for image and video coding.

In this paper, we will present an overview of intelligent video coding (IVC) development from MBC to LBC, in which the two technologies encode videos leveraging knowledge in different manners. The technical roadmap of IVC methods is summarized in Fig. 1. The similarity between MBC and LBC is that similar components, such as transform, quantization, and entropy coding, are adopted to construct the framework to exploit the correlation of textural content and remove redundancy. The difference lies in that the former relies on manually designed modules, while the latter relies on a data-driven strategy or components using machine learning. The rest of the paper is organized as follows. In Sect. 2, a brief introduction to the history of MBC is provided. Section 3 provides an overview of recent advancements in learning-based approaches for visual signal compression, including learned image compression and learning-based video coding. Section 4 introduces our previous attempts and understanding of IVC. In Sect. 5, we discuss the future directions of IVC, specifically from the perspectives of standardized potentials, data security, and generalization. Section 6 concludes this paper.

Figure 1
figure 1

The technical roadmap of intelligent video coding methods including model-based and learning-based compression algorithms

2 Model-based coding

MBC focuses on modeling and coding the structural visual information in the images and videos. The history of MBC can be traced back to the 1950s [43]. In [43], Schreiber et al. proposed a Synthetic Highs coding scheme, where the image content is divided into textures and edges, and they are coded by different approaches, e.g. using statistical coding methods for textures and visual model-based coding methods for edges, which was the predecessor of the current HVS model-based perceptual coding methods. In [38], Pearson clarified the term “model” in MBC, which is explained as object-related models and developed from the source model in signal processing, as shown in Fig. 2. A video sequence containing one or more moving objects is analyzed to yield information about the size, location, and motion of the objects, which is employed to synthesize a model of each object as animation data. The animation data are coded and transmitted to the decoder. Moreover, the residual pixel data, comprising the difference between the original video sequence and the sequence derived from the animated model, are also transmitted to the decoder. The decoder adopts the animation data to synthesize the model, which is subsequently accompanied by the residual pixel data to reconstruct the image sequence. From Musmann’s viewpoint [31], MBC includes pixel MBC, block motion MBC, and object MBC, i.e. the first-generation and second-generation methods. In this paper, we would follow Musmann’s viewpoint and provide MBC classification to present the historical development of the model from the signal source to the object and the content understanding of the objects, as summarized in Table 1. From Table 1, it is observed that the evolution of MBC, from the statistical pixel and block to the geometric partition and structural segmentation, and from the content-aware object to the understanding of the content including knowledge, semantics, and the knowledge of HVS. Moreover, many coding standards based on MBC have been developed, such as MPEG-4/7. In this section, we will give a brief introduction to the methods and standards based on MBC.

Figure 2
figure 2

Principle of model-based coding (MBC) from [38]

Table 1 Classification of MBC approaches

2.1 Model-based coding methods

In the historical evolution of MBC, pixel model-based video coding, e.g. PCM [44], was later ever used for early memory and computation resource-limited applications, and it was replaced with block-based motion model coding later [4548]. However, the rectangular partition of block-based coding is rigid and inefficient for modeling irregular visual signals. As a variation of the block-based motion model, more flexible geometric partitions were proposed for motion compensation, including deformable blocks [49], meshes [50, 51] and triangles [52], and they were also studied for H.264/AVC [53, 54], HEVC [55] and VVC [33]. Although geometric partitions are flexible, they are also constrained by their fixed patterns. Therefore, a more flexible and finer-grained partition is based on the input signal itself, such as contour and segmentation, rather than pre-defined geometric partitions. Graham proposed a two-dimensional contour coding in [56], which can be viewed as a predecessor of segmentation coding, and Biggar first formally utilized a segmented image coder with better performance than the transform coder in [57]. Since then, a variety of studies on segmentation-based coding have been performed, including segmentation-based coding [34, 5862] and segmentation methods [63, 64].

MBC methods mentioned above explore flexible and fine-grained partitions without considering the knowledge of objects or scenes in the world. Since different classes of objects or scenes always exhibit different kinds of appearance and motion patterns, modeling such patterns as knowledge and combining them into coding can further improve the compression ratio for particular image classes. The higher performance also comes with costs that modeling and combining knowledge require considerable manpower for manual design, and knowledge of an object or a scene cannot always be transferred to that of others, resulting in potential limitations on wild scenarios. In the following part of our paper, we review the development of MBC methods using knowledge. Accompanying the emergence of segmentation-based coding, object-based coding is a further prolongation of segmentation coding, where the segmentation may represent one identified object [6567]. In [65], three parameter sets were used to define the motion, shape, and color of an object, which can be used to reconstruct an image by the model-based image synthesis method. In [66], a generic object-based coding algorithm was proposed relying on the definition of a spatial and temporal segmentation of the sequences. Moreover, object-based coding is further applied to special videos, such as surveillance video or 3D video [68, 69], and motion compensation for codecs [70]. Based on the knowledge of the known objects, knowledge and semantic-based coding methods were developed, such as parameterized modeling for the facial animation [7176]. Modeling the scene or image content directly is difficult and restricted in wild scenarios; in contrast, perceptual coding [7786] attempts to incorporate the vision model into the coder by using the knowledge of HVS [87]. In [87], a nonlinear mathematical HVS model was proposed for image compression, which was developed from the psycho-visual and physiological characteristics of the HVS, and a reduced achromatic model was developed as a nonlinear filter followed by a bandpass spatial filter. Texture analysis and synthesis coding, as described in Fig. 3, as cross research of perceptual coding and segmentation coding [79, 8892], incorporates a texture analyzer in the encoder and a texture synthesizer in the decoder to incorporate the texture information into the coding process.

Figure 3
figure 3

Diagram of the texture analysis and synthesis (TAS) encoder, and texture synthesis (TS) decoder

To achieve higher efficiency compression of audio-visual information with a relatively low bit rate, significant efforts have been devoted by some standardization organizations. MPEG started to develop the international standard MPEG-4 in 1993 [93]. MPEG-4 is based on object-based coding, which concentrates on analyzing and synthesizing the objects in an image [66], which has several advantages over block-oriented schemes, e.g. adaptation to the local image characteristics and object motion compensation as opposed to blockwise motion compensation. In MPEG-4, each picture is considered to be consisting of temporal instances of objects that undergo a variety of changes. Therefore, the concepts of video objects, as well as their temporal instances of video object planes, are introduced in MPEG-4. Specifically, in MPEG-4, each video object is encoded separately and multiplexed into a single bitstream that can be accessed by the users. The encoder sends the video objects and the information about the scene composition for storage and transmission. On the decoder side, the coded data are de-multiplexed and decoded separately, and then the reconstructed objects fuse to the final decoded frame.

MPEG-7 [94] is another standardized attempt at content description representation, which is a multimedia content description standard and was released in 2001. It is different from the previous formats MPEG-1/2/4 in that it does not deal with the coding of moving pictures and audios. MPEG-7 addresses how humans expect to interact with computer systems for it develops rich descriptions that reflect those expectations. It uses XML Schema as the language of choice for content description, allowing fast and efficient searching for material that is of user interest.

Except for MPEG-4/7, some novel MBC methods are explored in other video coding standards, such as screen video coding in HEVC [95] and scene video coding in AVS2 [96]. Screen video refers to the consecutive images generated or rendered by computers or some other electronic devices, and the video may contain computer-generated screen content and natural images/videos. In [95], two new coding tools, residual scalar quantization (RSQ) and base colors and index map (BCIM) were proposed for screen video coding. RSQ directly quantizes the intraprediction residual without applying a transform since screen content often has high contrast and sharp edges. In BCIM, a base color table is created first by clustering. Then, each sample in the block will be quantized to the nearest base color and recorded in the index map. The scene video is captured in specific scenes, such as surveillance video and videos from classrooms, homes, and courts, which are characterized by temporally stable backgrounds. Regarding scene video coding, background modeling schemes [32, 97] were proposed to achieve more accurate prediction without dependence on foreground segmentation. Based on these methods, AVS2 proposed a background-picture-model-based coding method to achieve higher compression performance [96].

3 Learning-based coding

MBC relies on manually designed modules where the components are heavily engineered to fit together. Such a design results in the structure of the signal being manually engineered and thus the capability of MBC to eliminate the redundancy is limited. The motivation of LBC is that with similar components to MBC, LBC models are trained using the massive image and video samples to determine the coding strategy automatically and alleviate the dependence on manually designed coding paradigms based on expert knowledge. With an automatic coding strategy, LBC enables the structure to be automatically discovered to eliminate redundancy more efficiently, which displays the great potential to achieve a better coding performance. In general, the similarity between MBC and LBC is that they share similar components to remove the redundancy in the signal, and the difference is that the former relies on manually designed modules and the latter relies on a data-driven strategy or modules using machine learning. In the literature, numerous LBC approaches have been proposed for coding. LBC can be grouped into three categories, namely statistical learning, sparse representation, and deep learning-based methods.

Statistical learning is incorporated into image/video compression to reduce coding complexity or improve the compression performance, such as support vector machine (SVM) [98], Bayesian decision [99], random forest [100], decision tree [101], and AdaBoost [102]. SVM was used as a classifier to determine the early splitting or pruning of a coding unit (CU) [103]. In [104], the Bayesian decision rule was employed with skip states to early terminate the binary-tree (BT) and extended quad-tree (EQT) partition. In [105], a random forest classifier was used to determine the most likely partition modes. A fast intra-coding scheme was proposed in [106], where a low complexity coding tree unit (CTU) structure was derived with a decision tree, and the optimal intra mode was decided with the gradient descent principle. AdaBoost is incorporated in [107] as a classifier for CU partition determination. Although these methods are data-driven to discover the best strategy for compression, they are adopted as complex classifiers using manually designed features for coding standards and thus are limited to the scarcity of generalization caused by manually designed features.

A sparse representation of a signal consists of a linear combination of relatively few base elements in a basis or an overcomplete dictionary. Signals that are represented sparsely are termed compressible under the learnable dictionary. Some research efforts were dedicated to learning dictionaries to adapt to a signal class for image compression [108110]. Bryt and Elad employed a K-SVD (singular value decomposition) dictionary-based facial image codec. They trained K-SVD dictionaries for predefined image patches. The encoding is based on sparse coding of each image patch with the trained dictionary, and the decoding is a simple reconstruction of the patches by the linear combination of atoms. Sezer et al. [109] adopted a concatenation of orthogonal bases as the dictionary, where each basis is selected to encode any given image block of fixed size. Zepeda et al. [110] proposed an iteration-tuned and aligned dictionary (ITAD)-based image [111] codec for particular image classes, such as facial images. ITAD is used as a transform to code image blocks taken over a regular grid. Although some encouraging results were achieved, sparse representation-based coding is designed for particular image classes due to the nature of sparse representation, and thus hard to generalize to wild images encountered in practical scenarios.

Recently, neural networks have been widely explored in image/video coding, which is termed deep learning-based coding. Deep learning-based coding has some advantages over statistical learning and sparse representation-based coding. First, neural networks can mine the underlying characteristics of data and exploit the spatial correlation of textural content, and learn the features adaptively rather than manually designed features. Second, with massive training data, deep learning-based coding can be generalized to wild images and videos. In the following part of this article, we introduce the history of deep learning-based image and video coding methods, which mainly originated in the late 1980s and are based on neural network techniques. Some representative works are listed in Table 2. Interested readers may refer to existing reviews for related literature [112114].

Table 2 Representative works of deep learning-based image and video coding

3.1 Learning to compress still images

Multilayer perceptron (MLP) [140] includes an input layer of neurons, several hidden layers of neurons, and an output layer of neurons. This structure provides evidence for scenarios such as dimension reduction and data compression. Chua et al. [115] proposed an end-to-end image compression framework based on the compact representation of the neural network and leveraging high parallelism. The following work [116] trained a fully connected network to compress each 8 × 8 patch of the input image with back propagation. Sonehara et al. [117] proposed a dimension-reduction network to compress the image. In addition, the framework used quantization and entropy coding as individual modules. Furthermore, the MLP-based predictive image coding algorithm [141] was used to exploit the spatial context information. To reduce training time, the nested training algorithm (NTA) was proposed for image compression [142] with an MLP-based hierarchical neural network. A new class of random neural networks [143] was introduced in 1989. Different from MLP, signals in random neural network methods are in the spatial domain. Some researchers have considered the combination of the random neural network and image compression. Gelenbe et al. [144] applied a random neural network in the image compression task, which was further improved in [145] by integrating the wavelet domain of images.

The recurrent neural network (RNN) includes a class of neural networks with memory modules to store recent information. Toderici et al. [118] proposed an RNN-based image compression framework by utilizing a scaled-additive module for coding. Minnen et al. [119] presented a spatially adaptive image compression framework that divided the image into tiles for better coding efficiency.

With the development of CNNs, many deep learning-based frameworks outperform traditional algorithms in both low-level and high-level computer vision tasks [146]. Under the scalar quantization assumption, Ballé et al. [120, 121] introduced an end-to-end optimized neural framework for image compression based on CNNs in 2016. A typical end-to-end learned image coding is illustrated in Fig. 4 (a). During training, Ballé et al. added an i.i.d uniform noise to simulate the quantized operation and replace the stochastic gradient descent approach to avoid zero derivatives. The joint rate-distortion optimization problem can be cast in the context of variational auto-encoders (VAE) [147]. The following work extended the compression model by using scale hyperpriors for entropy estimation [122], which achieved better performance compared with HEVC. Minnen et al. [123] enhanced the context model of entropy coding for end-to-end optimized image compression. Cheng et al. [125] proposed discretized Gaussian mixture likelihoods and attention modules to further improve the performance.

Figure 4
figure 4

The frameworks of the typical end-to-end learned image coding and conceptual coding, as well as the exemplar texture modeling and image synthesis process of conceptual coding

Generative adversarial networks (GAN) are developing rapidly in the application of deep neural networks. Rippel and Bourdev [126] proposed an integrated and well-optimized GAN-based image compression. Inspired by the advances in GAN-based view synthesis, light field (LF) image compression can achieve significant coding gain by generating the missing views using the sampled context views in LF [119]. In addition, Gregor et al. [148] introduced a homogeneous deep generative model DRAW to their coding framework. Different from previous works, Gregor et al. aimed at conceptual compression by generating the image semantic information as much as possible [128]. Agustsson et al. [129] built an extreme image compression system using unconditional and conditional GANs, outperforming all other codecs under low bit-rate conditions. Agustsson et al. [149] proposed using learned perceptual image patch similarity (LPIPS) [150] as the metric for generator training, which further improves the subjective quality of the reconstructed image.

3.2 Learning-based video coding

In this section, we review the development of learning-based video coding. First, we introduce pure learning-based video coding methods. Second, a combination of deep learning and the hybrid video coding framework is presented. Third, we compare these two coding architectures.

Similar to learning-based image coding frameworks, many novel video coding frameworks are built on neural network models to reduce temporal redundancies. As a natural extension of learning-based image coding methods, 3D auto-encoders are proposed to encode the quantized spatiotemporal features with an embedded temporal conditional entropy model. Chen et al. [130] proposed DeepCoder, which combines several CNN networks with a low-profile x264 encoder for video compression. Wu et al. [151] later applied an RNN-based video interpolation module and combined it with a residual coding module for inter-frame coding. Inspired by the prediction for future frames of generative models [152], Srivastava et al. [132] proposed utilizing the long short-term memory (LSTM) encoder-decoder framework to learn video representations, which can be utilized to predict future video frames. Different from Ranzato’s work [152] which predicts one future frame, this model can predict a long future sequence into the future. Agustsson et al. [153] further presented a scale-space flow generation and trilinear warping method for motion compensation. Habibian et al. [133] utilized the rate-distortion auto-encoder to directly exploit spatiotemporal redundancy in a group of pictures (GoP) with a temporal conditional entropy model. Lombardo et al. [134] followed the VAE-based image compression framework and encoded this representation according to predictions of the sequential network. With the emergence of GANs, using an auto-encoder combined with adversarial training has been regarded as a promising method. Wang et al [138] demonstrated the use of a novel subject-agnostic face reenactment method for video conferencing, achieving an order of magnitude bandwidth savings over the H.264 standard. With the advantage of adversarial training, at a lower bitrate, different from VAE-based video coding methods that tend to reconstruct blurry videos, GAN-based coding models reconstruct the video with a pleasing perceptual quality.

Following hybrid video coding systems, recent studies have demonstrated the effectiveness of deep learning models from five main modules, i.e. intra-prediction, inter-prediction, quantization, entropy coding, and loop filtering. For intra-prediction, Cui et al. [154] proposed an intra-prediction convolutional neural network (IPCNN) to improve the intra-prediction efficiency. Instead of using CNN, Li et al. [155] proposed a fully connected network (IPFCN) for intra-prediction. In [156], Li et al. explored CNN-based down/up-sampling techniques as a new intra-prediction mode for HEVC. To alleviate the effects of compression noise on the upsampling CNN, Feng et al. [157] designed a dual-network-based superresolution strategy by bridging the low-resolution image and upsampling network using an enhancement network. Inter-prediction is realized by motion estimation on previously coded frames against the current frame in hybrid video coding. Huo et al. [158] utilized variable-filter-size residue-learning CNN (VRCNN) to refine motion compensation for inter-prediction improvement [159]. Yan et al. [160] proposed a fractional pixel reference generation CNN (FRCNN) to predict the fractional pixels for fractional-pixel motion compensation in inter-prediction. Instead of dealing with fractional pixels, some works [161, 162] have directly explored the inter-prediction block generation using CNN-based frame rate up conversion (FRUC). In addition to FRUC, the two nearest bi-directional reference frames in the reference list are utilized as input for the network in [163]. Regarding the limitation of the traditional bidirectional prediction using a simple average of two prediction hypotheses, [161, 164] further improved its efficiency by leveraging a six-layer CNN with a 13 x 13 receptive field size to infer the inter-prediction block in a nonlinear fashion. Utilizing compressed optical flows to directly specify motion is also effective for inter-frame prediction. In addition, bi-directional motion was studied in [165, 166] by additionally exploring the future frames. Both long-term and bi-directional predictions attempt to better characterize complex motion to improve the coding efficiency. Liu et al. [135] used a pyramid optical flow decoder for multi-scale compressed optical flow estimation and applied a progressive refinement strategy with joint feature and pixel domain motion compensation. Zhao et al. [167] adopted previously reconstructed frames, optical flow-based prediction, and a background reference frame to infer the foreground objects of the frame to be coded. In video coding, quantization and entropy coding are the lossy and lossless compression procedures, respectively. In [168], Alam et al. proposed a two-step quantization strategy using neural networks. After quantization, the syntax elements including coding modes and transform coefficients will be fed into the entropy coding engine to further remove their statistical redundancy. Song et al. [169] improved the performance of context-adaptive binary arithmetic coding (CABAC) on compressing the syntax elements of 35 intra-prediction modes by leveraging CNN to directly predict the probability distribution of intra modes instead of the manually designed context models, where CABAC is adopted in HEVC as entropy coding. Loop filtering was proposed to remove compression artifacts. Zhang et al. [170] established a residual highway convolutional neural network (RHCNN) for loop filtering in HEVC. By leveraging the coherence of the spatial and temporal adaptations, Jia et al. [171] improved the performance of a CNN-based loop filter, and designed a spatial-temporal residue network (STResNet)-based loop filter. Moreover, Jia et al. further improved the filtering performance by introducing a content-aware CNN-based loop filter in [172]. More in-loop filters that work with neural networks can be found in [173, 174]. Beyond in-loop filters, some post-filtering algorithms [175, 176] have been proposed to improve the quality of decoded video and images by reducing compression artifacts.

Pure learning-based video coding methods and combined deep learning and hybrid video coding methods have their advantages and disadvantages. The current performance of pure learning-based video coding is developing rapidly and is competitive with traditional video coding. There still exists room for performance improvement. However, the decoding complexity is relatively high, different models are relatively independent, and the bitstreams cannot be interconnected. Combined deep learning and hybrid video coding methods are built upon the traditional hybrid video coding, which has been well developed for several decades, and thus, the performance starting point is relatively higher than that of pure learning-based video coding, which is trained from scratch. However, the combined video coding only replaces some modules by deep learning from hybrid video coding, resulting in different modules that cannot be optimized jointly to achieve higher performance.

3.3 Learning-based coding standards

To enable interoperability between devices manufactured and services provided by different companies, a series of standards targeting intelligent visual data coding have been investigated in the past several years. Several standardization organizations including ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), JPEG (Joint Photographic Experts Group)/MPEG, ITU-T (International Telecommunication Union Telecommunication Standardization Sector), VCEG (Video Coding Experts Group), JVET (Joint Video Experts Team), AVS, IEEE DCSC (Data Compression Standard Committee), MPAI (Moving Picture, Audio and Data Coding by Artificial Intelligence), and others have been creating these standards with many contributions from academia and industry. While most of these visual coding standards have been very successfully deployed in many applications, there are many challenges currently, especially to accommodate the large volume of visual data in limited storage and limited bandwidth transmission links. Compression efficiency improvements are still needed, especially considering emerging data representation formats, from 8K/HDR (high dynamic range) image/video to rich plenoptic formats.

To improve compression efficiency, machine learning technologies, such as deep neural network-based technologies, have shown great potential for many types of visual data. Thus, new standardization activities that exploit this potential are ongoing, some more mature than others, such as learning-based image and video coding, learning-based point cloud coding, and learning-based light-field coding. These standardization efforts attracted significant attention in the aforementioned standardization organizations. The IEEE 1857.11 and JPEG AI group are preparing neural image coding standards in recent years. The MPAI end-to-end video project and enhanced video coding project are also trying to explore neural network-based video coding solutions. The JVET NNVC (neural network-based video coding) and AVS intelligent coding ad-hoc group have released reference models by integrating neural networks into the conventional hybrid framework. All of the above-mentioned standards are advancing neural network-based video coding for future use cases.

4 Our attempts at intelligent coding

LBC compresses the signal data into the compact latent representation containing the non-interpretable knowledge. Moreover, such a mechanism is not analysis-friendly enough to assist downstream machine analysis tasks. A novel LBC paradigm that incorporates more interpretable representation with powerful neural networks may achieve better coding performance, and the interpretable representation may also be beneficial for machine analysis. In this section, we introduce our attempts at such a paradigm, including conceptual image coding, generative video coding, and cross-modal coding.

Inspired by the human visual system (HVS) [177] which perceives visual contents by processing and integrating manifold information into abstract high-level concepts (e.g., structure, texture, and semantics) to form the basis of subsequent cognitive processes [178], conceptual compression has been an active research area in recent years [128, 179182], following the insights of Marr [183] and Guo et al. [184]. Conceptual coding aims to encode images into compact, high-level interpretable representations for high visual quality reconstruction, allowing a more efficient and analysis-friendly compression architecture. At present, multi-layer decoded representations are integrated to synthesize target images in a deep generative fashion. Herein, the main challenges for conceptual coding include how to achieve efficient representation disentanglement, and how to devise effective generative models for high visual-quality reconstruction. Gregor et al. [128] introduced convolutional deep recurrent attentive writer (DRAW) [148], which extends VAE [147] by using RNNs as encoder and decoder, to transform an image into a series of increasingly detailed representations. However, the interpretability of the learned representations for the image is still insufficient and the models in [128] only worked on datasets of small resolutions. Neural video compression also suffers from similar constraints. Typical video compression methods [134] share the same VAE architecture with image compression methods [128] and transform the original sequence into a lower-dimensional representation. However, the interpretability of the learned representations for video still lacks exploration. Therefore, based on the conventional neural network-based image/video compression in Sect. 3, in this section, we introduce interpretable representations, such as structure information or high-level semantic information, into the compression process to enhance the interpretability of the representations for both images and videos.

4.1 Conceptual image coding

We propose encoding images into two complementary visual components [179, 180] as a milestone for conceptual coding of images. The structure and texture representations are disentangled, as demonstrated in Fig. 4 (b), where a typical texture modeling process is illustrated in Fig. 4 (c). The typical image synthesis process is depicted in Fig. 4 (d). A stylized illustration of disentangled structure and texture representations in domain spaces is proposed in our earlier study in Fig. 5. In our proposed dual-layered model of [179, 180], the structure layer is represented by edge maps, and the texture layer is extracted with the variational auto-encoder in the form of low-dimensional latent variables. To reconstruct the original image from the compressed layered features, our other attempt is to integrate the texture layer and structure layer with adaptive instance normalization adopted using a hierarchical fusion GAN method [180]. The benefits of the proposed conceptual compression framework in [180] have been demonstrated through extensive experiments with extremely low bitrates (<0.1 bpp) and high visual reconstruction quality, as well as content manipulation and analysis tasks through extensive experiments. Nevertheless, it is very challenging to model complex textures of the whole image using only a set of variables. In addition, how to build effective entropy models for visual representations has not been explored for joint rate-distortion optimization. In our recent study in [181], the semantic prior modeling for conceptual coding was proposed. Effective texture representation modeling and compression at semantic granularity are explored for high-quality image synthesis and promising coding efficiency. Moreover, we developed a cross-channel entropy model in [181] for joint texture representation compression and reconstruction optimization. Structural modeling was further introduced in our work [185], which proposed a consistency-contrast learning method to optimize the texture representation space by aligning the representation space with the source pixel space, resulting in higher compression performance. Our proposed models in [181, 185] have achieved superior visual reconstruction quality at ultra-low bitrate (<0.1 bpp) compared to the state-of-the-art VVC in the specific application domain.

Figure 5
figure 5

Stylized illustration of the typical conceptual coding

Since the conceptual coding methods pursue visually convincing reconstruction results with minimal bitrate consumption, the LPIPS metric [150] is usually selected as the quantitative perceptual distortion measure except for user study. In our previously established benchmark [186], this metric has been proven to be highly correlated with human visual perception instead of signal fidelity. For performance comparison, the rate-distortion performance in terms of LPIPS of VVC, the typical end-to-end learned image coding method [123] (E2E), our proposed typical conceptual coding methods LCIC [180] and SPM [181] at low bit-rate range over FFHQ [187] and ADE20K [188] outdoor testing sets are displayed in Table 3. The results demonstrate that conceptual coding methods are capable of achieving higher visual reconstruction results at specific domains compared to signal-based compression methods at extremely low bitrates. Moreover, as observed, LCIC behaves less effectively at the more challenging content of ADE20K compared to FFHQ, which consists of regular facial semantic regions. In contrast, SPM achieves remarkable improvements in reconstruction quality on challenging scenes with diverse semantic regions and textures, verifying the effectiveness of the proposed semantic prior modeling mechanism. Moreover, in terms of LPIPS over the ADE20K outdoor testing set, the rate-distortion curves of VVC, SPM [181] and the most recent work CCL [185] are shown in Fig. 6. The comparison results verify the improvement in reconstruction quality brought by applying their proposed consistency-contrast learning method. Compared to previous works, the proposed conceptual image coding demonstrates the superiority towards efficient visual representation learning, high-efficiency image compression (<0.1 bpp), better visual reconstruction quality, and intelligent visual applications (e.g., manipulation and analysis).

Figure 6
figure 6

The rate-distortion curves of SPM [181], CCL [185] and VVC. A lower LPIPS indicates better quality

Table 3 The quantitative results of VVC, E2E [123], and our proposed conceptual coding methods LCIC [180] and SPM [181] on the FFHQ, and ADE20K outdoor testing sets. The LPIPS is selected as the distortion metric

4.2 Generative video coding

Due to the powerful capability of deep generative models, many approaches [134] map the video sequences into latent representations and formulate the framework through generative networks to achieve low-bitrate compression. Based on the image animation model, such as FOMM [189], Konuko et al. [190] developed a generative compression framework for video conferencing. Wang et al. [138] also proposed a neural talking-head video synthesis model for video conference by adaptively extracting 3D keypoints from the input videos, achieving the same visual quality as the H.264/AVC [191] with only one-tenth of the bandwidth. Nevertheless, designing a video compression framework targeting high visual quality under extreme compression ratios (e.g., 1000 times) remains unsolved.

Motivated by recent attempts at layered conceptual image compression, we made the first attempt to utilize disentangled visual representations for extreme human body video compression, DHVC [139]. On the encoder side, the input video sequence is disentangled into structure and texture representations for further efficient compression. A pre-trained structure encoder is adopted to estimate the human pose keypoints of each frame. Similar to motion vectors in traditional video codecs, the displacements of each keypoint coordinate are computed as a feature to represent the motion information between two frames. For bitrate saving, only the structure code of the first frame and the motion codes of subsequent frames are transmitted during encoding. On the other hand, a texture encoder extracts the first frame into a semantic-level texture code that represents the texture information of the input video sequence. To ensure texture consistency across all frames, we introduce contrastive learning [192] for the alignment of texture representations. On the decoder side, the structure codes are reconstructed iteratively while the generator restores the video from texture codes and structure codes. Finally, entropy estimation of texture codes is introduced to establish rate-distortion optimization together with contrastive learning for end-to-end training of the framework, promoting bitrate saving and better reconstruction.

As depicted in Fig. 7, the main structure information of the human body can be efficiently represented by human pose keypoints. A pre-trained pose estimator [193] is employed as the structure encoder \(E_{s}\) to extract the structure information of each frame as the compact structure code. The texture encoder \(E_{t}\) aims to extract image frames into texture representations. To better capture the texture details of each frame, we adopt the decomposed component encoding (DCE) module [194] for semantic-aware texture code embedding.

Figure 7
figure 7

The proposed pipeline using disentangled visual representation for video compression

To assure the texture consistency of all frames in the same video, contrastive learning [192] is introduced for training the texture encoder \(E_{t}\). Instead of using augmentations for building positive samples, the frames in the same video are well-suited for constructing positive samples. Meanwhile, frames in different videos are regarded as negative samples. Moreover, the framework proposes contrastive learning at the semantic level and computes the semantic-wise infoNCE loss [195] with Eq. (1),

$$ \mathcal{L}_{cst} = -\sum_{i=1}^{L}{ \log} \frac{\exp(t_{i}\cdot{t_{i}^{+}}/\tau )}{\sum_{j=1}^{Q}{\exp(t_{i}\cdot{t_{ij}^{-}}/\tau )}} , $$
(1)

where \(t_{i}\), \(t_{i}^{+}\), \(t_{i}^{-}\), τ, L, and Q denote semantic-wise texture parts of an input frame, another frame in the same video, other frames in different videos, a temperature parameter, the number of semantic regions of the image, and the length of negative sets, respectively. This technique enables the encoder to utilize both the similarity of the positive pair (t, \(\mathbf{t}^{+}\)) and the dissimilarity of the negative pairs (t, \(\mathbf{t}^{-}\)). Following MoCo [192], a queue is used for storing negative samples \(t_{i}^{-}\) of previous input frames. In this way, the module conducts contrastive learning efficiently with small batch sizes.

For compression comparisons, the average LPIPS and DISTS results of the Fashion and TaichiHD datasets are shown in Table 4. Noticeably, the bitrate of other compared methods is adjusted slightly above our method. Nevertheless, the proposed framework outperforms all other compression frameworks with the lowest LPIPS and DISTS scores at ultra-low bitrates. Moreover, the quantitative results in Table 4 further validate that integrating with contrastive learning facilitates better visual qualities. In general, our method achieves superior visual quality compared to previous methods due to its disentangled texture and structure representations, resulting in sharper results with more details retained, such as facial features and intricate backgrounds.

Table 4 Comparisons with state-of-the-art video compression methods. Lower scores represent better visual quality. “w/o c.” denotes the proposed model without the proposed contrastive learning techniques

4.3 Cross-modal coding

Conceptual compression frameworks encode images into representations, such as latent variables extracted from deep neural networks, which are not human-comprehensible. Human comprehensible representations, such as text, sketch, semantic map, and attributions, are significant for various applications, such as semantic monitoring and human-centered applications. Semantic monitoring aims to monitor the semantic information, such as identification, human traffic, or car traffic, rather than the raw signal or latent variables. Human-centered applications aim to directly convey the human-comprehensible information of visual data to human users. Therefore, we proposed cross-modal compression (CMC) [197] to take a step forward to transform the highly redundant visual data into a compact, human-comprehensible representation with ultra-high compression ratios.

We proposed a CMC framework, as illustrated in Fig. 8, which consists of four submodules: CMC encoder, CMC decoder, compression domain encoder, and compression domain decoder. The compressing procedure also consists of four steps. First, the CMC encoder compresses the raw signal into a compact and human-comprehensible representation. Second, the compression domain encoder encodes the representation to a bitstream in a lossless way. Third, the compression domain decoder reconstructs the representation from the bitstream in a lossless way. Finally, the CMC decoder reconstructs the signal from the representation with semantic consistency. The bitrate is optimized by finding a compact compression domain, while the distortion is optimized by preserving the semantics in the CMC encoder and decoder.

Figure 8
figure 8

Illustration of the cross-modal compression (CMC) framework

Under such a framework, we will further introduce a paradigm. With the recent advances of image captioning [198] and text-guided image generation [199], generating high-quality text from images and generating high-quality images from the text are more feasible. Therefore, we built an efficient image-text-image CMC paradigm, where the images are compressed into the text domain, which is compact, common, and human-comprehensible. Specifically, a classical CNN-RNN model [198] is adopted as the CMC encoder to compress the image to text, where the image feature is extracted from a CNN with the image as input, and fed to an RNN to generate the text in an autoregressive way. Huffman coding [1] can be used as the compression domain encoder/decoder to reduce the statistical redundancy of text in a lossless way. AttnGAN [199] is used as the CMC decoder to reconstruct images from the text due to its promising performance on text-to-image generation. The effectiveness of CMC is verified via various experiments on several datasets, and the model has achieved encouraging reconstructed results with an ultrahigh compression ratio (4000-7000 times), showing better compression performance than the widely used JPEG baseline [200].

5 Open discussion

Considering the rapid growth of intelligent video coding, it is expected that a more advanced and insightful model will be developed in the near future, further facilitating the coding and representation efficiency of visual signals. Nevertheless, the field of intelligent video coding poses many new research challenges. Below are a few evolving and significant challenges that need to be addressed.

Domain and profiling

There is considerable discussion in the video coding standards community regarding the definition of interoperability and conformance testing. To enable intelligent-video-coding-compliant terminals and systems to decode latent representations without ambiguity, it is necessary to standardize them by defining the appropriate rules and assigning them to syntax elements. At the system level, structural, semantic, and textual representations should be parsed correctly by compatible structural, semantic, or textural decoders. Meanwhile, intelligent-video-coding-compliant networks should be able to understand and process the meanings of the latent representations at the intelligent model level. However, visualizing or analyzing bitstreams of highly compact latent representations poses a considerable challenge in assessing the semantic conformance of existing intelligent video codecs. As such, the introduction of profiles may contribute to defining unambiguous conformance procedures and ensuring interoperability for intelligent video coding. Video coding standards have used profiles and levels to define tools with a restricted level of complexity suitable for specific applications. Similarly, intelligent video coding requires different subsets of latent representations for different applications. Some specialized applications may also need restrictions or extensions of the latents. In this regard, it is a critical issue as to how it should support extensions and specialization in specific domains while at the same time ensuring unambiguous conformance validation, requiring a nontrivial effort.

Data security

In the context of intelligent video coding, latent representations derived from networks involving signal information can be used to reconstruct the entire video stream. Such representations, however, are not encrypted, and therefore pose the risk of sensitive information leakage. As such, trustworthy and robust coding network design plays a central role in real-world applications.

Representation interpretability

To enhance the supporting ability for downstream tasks using the compressed data, it is important to develop latent representations that are highly interpretable. By using such representations, it becomes possible to apply interactive coding techniques, which can enable a range of novel applications such as content editing and immersive interaction. This opens up new opportunities for compression-based approaches to provide versatile features and functionalities beyond traditional video compression methods.

Generalization ability

When standardized coding methods and technologies are ready for implementation and deployment, it becomes crucial to identify the path that intelligent video coding would follow to gain entry into practical application domains while satisfying the objectives that such codecs could satisfy versatile requirements. For example, some intelligent video codecs trained for outdoor scenes might not be an ideal choice for coding facial images. It is not practical to employ multiple models for scene adaptation. Furthermore, the active efforts to harmonize the intelligent video coding standard with other media data standards will facilitate and expedite its adoption in practical domains (e.g., short video on mobile devices and immersive media applications).

6 Conclusion

Intelligent video compression provides a comprehensive suite of compactly representing visual media with the capability of describing intrinsic semantics, which also has the potential to revolutionize current and future multimedia coding applications. In particular, such methods include latent codes for describing the structure, semantics, or motion of the visual data, which facilitate efficient editing, analysis, reconstruction of the decoded data, and access to the data. In addition, extracted latent codes can also describe content preferences and support on-the-fly manipulation and transfer of customized content and styles. In this review, the development roadmap for the history of intelligent video coding has been revisited, along with the methodology for describing the structure and semantics of video data. Furthermore, the paper presents three potential research directions in conceptual coding, cross-modality coding, and generative coding that could potentially provide promising solutions to future visual media coding utility and application scenarios. As a final point, a few evolving and significant challenges are discussed regarding future intelligent video coding deployment in practical real-world scenarios.