Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions

Visual recognition is currently one of the most important and active research areas in computer vision, pattern recognition, and even the general field of artificial intelligence. It has great fundamental importance and strong industrial needs. Deep neural networks (DNNs) have largely boosted their performances on many concrete tasks, with the help of large amounts of training data and new powerful computation resources. Though recognition accuracy is usually the first concern for new progresses, efficiency is actually rather important and sometimes critical for both academic research and industrial applications. Moreover, insightful views on the opportunities and challenges of efficiency are also highly required for the entire community. While general surveys on the efficiency issue of DNNs have been done from various perspectives, as far as we are aware, scarcely any of them focused on visual recognition systematically, and thus it is unclear which progresses are applicable to it and what else should be concerned. In this paper, we present the review of the recent advances with our suggestions on the new possible directions towards improving the efficiency of DNN-related visual recognition approaches. We investigate not only from the model but also the data point of view (which is not the case in existing surveys), and focus on three most studied data types (images, videos and points). This paper attempts to provide a systematic summary via a comprehensive survey which can serve as a valuable reference and inspire both researchers and practitioners who work on visual recognition problems.


I. INTRODUCTION
Deep neural networks (DNNs) have achieved great successes on many visual recognition tasks.They have largely improved the performances of long-lasting problems such as handwritten digit recognition [1], face recognition [2], image categorization [3], etc.They are also enabling exploring new boundaries including the studies on image and video captioning [4]- [6], body pose estimation [7], and many others.Y. Wu is with Institute for Research Initiatives, Nara Institute of Science and Technology, Takayama-cho, Ikoma, Nara 630-0192, Japan.D. Wang is with School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.
However, such successes are generally conditioned on huge amounts of high-quality hand labelled training data and the recently greatly advanced computational resources.Obviously, these two conditions are usually too expensive to be satisfied in most applications which are cost-sensitive.Even for the cases that people do have enough high-quality training data thanks to the massive efforts of many annotators, it is usually a great challenge to figure out how to train an effective model with limited resources and within acceptable time.Assuming somehow the model can be properly trained (no matter how much efforts it costs), it is still not easy to have the model properly deployed for real applications at end users' side, as the run-time inference has to fit the available or affordable resources, and the running speed has to meet the actual needs which can be real-time or even more than that.Therefore, besides accuracy which is usually the biggest concern in the academia, efficiency is another important issue and in most cases an indispensable demand for real applications.
Though most of the works on using DNNs for visual recognition tasks are focusing on accuracy, there are still a lot of encouraging progresses on the efficiency side, especially in the recent few years.In the past two years, many survey papers have been published on efficiency issue for DNNs, as detailed in the following subsection I-A.However, none of them pays a major attention to visual recognition tasks, especially missing the coverage of special efforts for efficiently dealing with visual data, which has its own properties.In practice, efficient visual recognition has to be a systematic solution which takes into account of not only compact/compressed networks and hardware acceleration, but also proper handling of visual data, which may be of various types (such as images, videos, and points) with quite different properties.That can be an important reason for the lack of a survey on this topic.Therefore, as far as we are aware, this paper provides the first survey on efficient visual recognition with DNNs.It targets at a systematic overview of recent advances and trends from various aspects, based on our expertise and experiences on major types of visual data, their various recognition models, and network compression algorithms.

A. Related Surveys
There are some related surveys published recently, but their scopes and contents are significantly different from ours.
1) General introduction of DNNs and efficiency strategies for both model compression and hardware design.Sze et al. (2017) [8] provides one of the earliest surveys on general DNN models and their compression and hardware acceleration.Cheng et al. (2018) [9] covers more advances on both network compression models (mainly CNNs) and hardware acceleration, but pays significantly more attention to the former.Though its scope is rather broad, the contents may be not detailed enough for many readers.Differently, Zhang et al. (2019) [10] has a much narrower coverage which can be good for the detailed directions if focuses on, but its distinctive categorization may confuse certain audience.Deng et al. (2020) [11], one of the latest surveys, is comprehensive and professional in both DNN compression and hardware design.Conversely, understanding this survey completely is naturally a tough task.
2) General model compression and acceleration strategies (algorithm part).This group include Cheng et al. (2018) [12] which covers all major aspects of efficient DNNs but lacks the advanced contents, Lebedev et al. (2018) [13] which is mainly about CNN-based models, and Elsken et al. (2019) [14] which focuses on the specific area of automatic network architecture search (NAS).
3) Hardware-based/Distributed acceleration.Guo et al. (2019) [15] covers all aspects of FPGA-based DNN acceleration, especially on hardware-oriented model compression and efficient architectures for hardware design.The discussion is mainly on quantization and pruning, leaving tensorial methods and NAS out of its scope.Wang et al. (2019) [16] provides a more comprehensive survey of custom hardware oriented accelerators, including approximation methods for both convolutional and recurrent neural networks, but some promising approaches (e.g., neural architecture search) are still uncovered.
4) Task-specific DNN models.Recently, a few surveys are focusing on the progresses on specific tasks, such as 3D data representation (Ahmed et al., 2019) [17], texture representation (Liu et al., 2019) [18], generic object detection (Liu et al., 2019) [19].Though they have conducted comprehensive reviews on the existing models for such specific tasks, which is very valuable for understanding the progresses on the model development side, the efficiency issue is unfortunately not their focus and thus lacks sufficient coverage and in-depth analysis.

B. Contributions and Organization
By contrast, this survey mainly focuses on the global efficiency of the production line from the raw visual data to the final recognition results, and it is expected to help the readers who are interested in the modern visual recognition tasks and their efficient DNN-based solutions.This paper contributes in the following aspects, which are also its novelties to the best of our knowledge.
1) A systematic survey of existing advances on efficient visual recognition approaches with DNNs, which is the first of its kind, as far as we are aware.
2) The first summary on data related issues for efficient visual recognition, including data compression, data selection, and data representation.
3) A new investigation of network compression models in the perspective of benefiting visual recognition tasks.4) A review of acceleration approaches for run-time inference and model generalization in the scope of efficient visual recognition.5) Insightful discussions on challenges, opportunities and new directions in efficient visual recognition with DNNs.
For clearly making the vein of this survey, Fig. 1 is drawn here as the blueprint of the organization.Specifically, in Section II, we introduce the three main data types commonly concerned in visual recognition problems and discuss about their properties and the challenges related to each of them.Section III reviews the efforts on three aspects before the actual recognition part: data compression, data selection, and data representation.Section IV briefly introduces and analyzes the widely studied directions for network compression, within the scope of visual recognition.Section V provides a summary of the recent progresses on efficient model generalization and fast inference at the testing phase, which are very important for real deployment of DNN-based visual recognition systems.Finally, Section VI overviews all efforts to generate a clear overall mapping, and discusses about some important uncovered aspects and new research directions.

II. VISUAL RECOGNITION: DATA TYPES, CHALLENGES AND PRELIMINARIES
Visual recognition covers several data types and a large number of detailed tasks, meanwhile, the recognition methods also include variant specific approaches.This section first introduces three most commonly concerned visual data types, i.e., image, video, and point, with their properties and recognition challenges.Then the related efforts before and on the recognition discussed in the whole paper are also simply listed as brief preliminaries for readers.

A. Data Types
Images are the most widely studied visual data, probably due to their wide existence and relative simplicity of acquisition, storage, transmission, and processing.In many cases they are both efficient and sufficient for sharing information visually.By contrast, as both the capturing devices and other related infrastructures and devices get greatly advanced and popularized, videos seem to be the most informative media.They have increased dramatically and will probably be even able to replace images in most scenarios soon.Videos appear very natural to humans, so they generally contain huge redundant information in the spatio-temporal domain.Points are relatively special since the radar and depth cameras appeared late, and they are usually sparse compared with images, but one dimension of them is higher and may have very large ranges.There are mainly two types of visual point data: point clouds and 2D/3D skeletons.The most valuable advantage of points is that they contain geometry information so the shape is the main information for recognition tasks.

B. Challenges
It is clear that different data types have different characteristics, e.g., images: classical and efficient; videos: abundant informative and spatio-temporal information; points: sparse, high dimension and shape information.Thus the challenges they must deal with in recognition tasks should also be discussed.
1) Images: Larger scale and deeper understanding: There are two clear trends on image-based recognition with DNNs.The first is that the scale of processed data increases quickly.As shown in Fig. 2, there is a clear history and trend that the benchmark dataset for developing new models have been shifting to larger scales and wilder contents (from MNIST [1] to CIFAR-10/CIFAR-100 [20], and to ImageNet [21]).With larger and more diverse training data, the trained DNN models can do more challenging recognition tasks (sometimes with the help of transfer learning), but it also brings greater challenges in efficient computation.The second is that the recognition is going toward deeper understanding and richer results.Traditionally, classification or categorization is most commonly concerned, but recently a lot of efforts and progresses have gone far beyond that, spreading to many tasks including detection, attribute extraction, keypoint/pose estimation, semantic segmentation, image captioning, visual question answering, and even to the visual genome extraction, as shown in Fig. 3.Such a trend greatly extends the research area and has attracted wider and stronger interests from both academia and industry.While new performance records are made on different tasks in much shorter time, the demands on exploring proper acceleration approaches become greater than ever, especially from the industry side which is eager to apply the latest models to various real scenarios.
2) Videos: Information Richness and Sparsity: Information richness is the natural challenge of videos compared with images.In contrast, some of the visual challenges for images, such as occlusions and static background clutters, may get alleviated in videos when the motion information in the videos is properly treated.However, videos have their own particular challenges.A major one is the sparsity of relevant information, as shown in Fig. 4. How to extract such sparse information from large amount of data without getting confused is a great challenge.Meanwhile, the bigger data size (compared with images) and in many cases the needs for realtime or even faster processing have made the efficiency issue even more important.Note that, most of the new recognition tasks for images also exist for videos, which require extra efforts for specific strategies on acceleration.
3) Points: Geometry and Efficiency: Though not as popular as images and videos, point data also play an important role in visual recognition.Except for the point clouds obtained from radar or depth camera (depth images can be easily turned into point clouds) for 3D object recognition [22], [23], and 2D/3D skeletons used for action recognition [24]- [28], 3D CAD models have also been used for visual recognition, such as the ShapeNet dataset [29].In a general sense, they can also be regarded as point data, as the CAD polygonal models can be tuned into voxels and point clouds when needed.Therefore, they are also included in the data samples as shown in Fig. 5. Since the geometry information in points is usually very informative for recognition, how to maintain such 3D geometry information whilst improving the efficiency of models is critical and is not easy as the dimensional range might be very large.

C. Preliminaries
In view of that varieties of methods will be introduced and discussed in the following content of this survey, it is necessary to make brief preliminaries to show most of the representative methods with their characteristics.Hence, Table I is designed to afford a clear glance from data processing (Section III) to real deployment (Section V), and the location of each method is also given.Please note that some trivial practices can not be exhibited here due to space limitation.

III. BEFORE RECOGNITION: EFFORTS ON THE DATA
Visual data have their own properties, which can be made use of when efficient recognition models are designed.Much efforts can be made to change/map the data to a more compact form before applying the recognition models.These efforts may be grouped into three categories based on their functionalities: data compression (reducing redundancy and irrelevance), data selection (reducing irrelevance), and data representation (increasing compactness).In this section, we overview and summarize existing progresses in all these three aspects, and organize the contents for each of them according to their related data types.By doing so, their motivations and strategies can be easily understood and searched for.

A. Data Compression
As datasets and networks grow in size, the great memory consumption and high computational complexity have made the training of deep neural networks a challenge and hindered the popularity of AI.Specifically, for deep learning, there are four types of consumption (as shown in Fig. 6) due to large data sets: a) storage consumption, most of the commonly used datasets are now hundreds of gigabytes in size, the situation of which puts a lot of pressure on storage hardware; b) transmission consumption, the transmission of large amounts of image/video data can be a challenge to network bandwidth; c) memory consumption, when training a network, usually the larger the batch size (the amount of data fed into the network each time) is, the better the performance is, and thus a larger memory is always desired; d) computing consumption, DNNs usually need to rely on powerful computing resources (e.g. a GPU), especially when large datasets have to be handled, and such a demand has a rising trend.Increasing data size may lead to a better recognition performance, but it can also increase all these four types of consumption.
To reduce the consumption needs, the commonly used and also very important and effective strategy is to compress the data, which can be done before the recognition task and it is relatively less task-dependent.

1) Image compression:
Image compression and image recognition can be linked together to save the consumption on image decompression, Images SIFT [30], DCT domain [31], Wavelet transform [32] Greatly optimized and easy to be used directly Section III-A1 Encoder-decoder networks [33]- [36] Ensure both the relevant information and speed Section III-A1 Videos H.265 [37] and H.264 [38] Can not be optimized end-to-end Section III-A2 Coding by CNN [39], Kalman filtering network [40], frames reconstructed by CNN+LSTM [41] Must restore the compressed data to a raw video format and bring lots of extra cost Section III-A2 Deep feature flow [42], direct training [43] End-to-end and real-time process Section III-A2 Points Auto-encoder-based geometry codec [44]- [46] Aiming at geometric characteristics Section III-A3

Points
Sampling with modeling attention [61]- [63] Superiority over traditional or neural methods Section III-B3

Generalization via transfer learning
Fine-tuning Transfer the learned knowledge of the source task to a related task with domain adaptation [118]- [122] Pre-trained models with fine-tuning is generally better than training-from-scratch for most target tasks with smaller-scale datasets Section V-B Fast transfer Adapt the pre-trained model to the target task together with compression [123]- [125] Learning fewer parameters during transferring is more efficient than naive full model fine-tuning Section V-B Fig. 6: Major types of consumption in deep learning which has been proved to be more efficient and in many cases can be made to be more effective as well.A general framework for decompression-free joint compression and recognition is shown in Fig. 7, which illustrates both training phase and test/inference phase.While the recognition module in the framework is generally DNN-based in the scope of this survey, the compression module can be either hand-crafted or learning-based.Though many researchers may expect a purely learning-based (DNN-based) framework, the current reality is that the sophisticated hand-crafted image compression models are still more popular.It is not easy to invent a new DNNbased compression model that outperforms the greatly optimized hand-crafted models.For researchers who know much more about recognition than compression, usually they tend to keep the compression module as it is and put more efforts on how to make their recognition modules better to receive the compressed data.Therefore, in this subsection, this survey still starts from introducing the hand-crafted compression models before going to the learning-based models, but putting more efforts on the later to encourage more future researches on inventing better purely learning-based solutions.
Hand-crafted Lossy Compression.Although several more complicated compression standards have been developed including JPEG, WebP and BPG, JPEG remains the most widely used one for lossy image compression.Recently, a lot of interesting works [30]- [32], [126]- [128] have been developed to learn the feature distribution directly from compressed data.Wu et al. [30] formulate the popular SIFT feature extraction in JPEG's DCT domain, which indicates that effective visual features can be directly extracted from the compressed data.Ehrlich et al. [126] introduce a general method to learn a Residual Network in the JPEG transform domain.Liu et al. [127] combine compression and recognition based on JPEG's transformation, and it has been proved by experiments that it could achieve 3.5× compression rate improvement, while its consumption is only 30% of the conventional JPEG without classification accuracy degradation.In [31], it is proposed to learn DNN parameters directly from the DCT coefficient of JPEG compression, which is 1.77× faster than ResNet-50 at the same accuracy.Javed et al. [128] propose a method for reviewing document images, which is directly based on the compressed documents data.In [32], Haar wavelet transform is utilized to compress and decompress high-resolution iris images, enabling fast and accurate iris recognition.
Learning Based Lossy Compression.Although traditional compression algorithms are carefully constructed, there is still room for the improvement in compression efficiency.For example, the conversions are fixed and cannot be adaptive to fit different inputs.In addition, pre-defined quantization implementations can result in data redundancy.Moreover, limited by manual design, the algorithms are usually hard to be optimized for a specific metric, even if the metric is a perfect assessment of image reconstruction quality.Therefore, more attentions have been paid to learning based compression methods in recent years.Different from traditional methods, in a learning-based approach, the parameters of the neural network are automatically learned from a large amount of data by definite optimization objectives to deal with specific situations.A general pipeline is that the input image x is firstly processed by the analysis network (encoder) to generate compressed feature-maps, which are then converted into a set of bit-streams by quantization and lossless arithmetic coding.
After that, they are used to generate a recovered image x by a re-factoring network (decoder), and the entire network is trained end-to-end until convergence.On the basis of this pipeline, many excellent image compression methods [33]- [36] have been proposed.Joint Compression and Recognition.Stimulated by the tremendous memory reduction caused by compression algorithms, some methods [129]- [132] combine compression with recognition to improve effectiveness and efficiency.Specifically, in order to save the decompression consumption, a well learned analysis network can be directly cascaded with a downstream recognition network as shown in Fig. 8.Such a joint optimization not only tries to retain as much classification relevant information as possible during compression, but also accelerates the speed of inference and optimizes the consumption of training compared with models directly trained on the input images.Detailed models can be found in [129]- [132].
2) Video compression: As the most common media, video was said to take more than 70% of all Internet traffics according to the white paper of "Cisco Visual Networking Index: Forecast and Methodology, 2016-2021", and now the percentage is probably even greater.In the past few years, many representative progresses [133]- [137] have been made in video-based recognition.These methods, however, all focus on designing a special neural network for analyzing frames, ignoring the fact that videos are in a compressed format during transmission and storage.Therefore, extra time and storage are needed for decompression before the analysis.In order to improve the effectiveness and efficiency of video recognition, it is necessary to apply efficient compression and decompression beforehand or directly use compressed data for recognition.
Compression Algorithms.As the most popular video compression algorithms, some highly efficient compression standards such as HEVC(H.265)[37] and AVC(H.264)[38] have been in use for a long time.Taking the H.264 algorithm as an example, three kinds of frames are defined in the encoding protocol.The fully encoded frame is called I-frame (key frame), and the frame containing only the difference partial encoding generated by the I-frame is called P-frame.The highly compressed frame obtained by using both previous and forward frames for data reference is called B-frame.There are two core algorithms used by H.264: intra-frame compression and inter-frame compression.Among them, intraframe compression is an algorithm for generating I-frames, and inter-frame compression can generate highly compressed B-frames and P-frames.
To achieve inter-frame compression, H.264 relies on many handcrafted modules, such as DCT transform module, blockbased motion estimation module, and motion compensation module.Although these modules are well designed, they are not optimized end-to-end.
Recently, several DNN-based video compression methods Fig. 8: The input image is represented as a series of compressed feature maps, and subsequent recognition network learns classification information from these feature maps.
have been proposed for intra prediction & residual coding [39], post-processing for predicted frames [40], inter-frame interpolation [41], and full network-based video compression [138], [139].Chen et al. [39] design two convolutional neural networks to encode predicted images and residual images, respectively, and the reliable experimental results prove that deep neural networks can achieve better results than handcrafted modules.Lu et al. [40] model the video artifact reduction task as a Kalman filtering procedure and restore decoded frames through a deep Kalman filtering network.By constructing a recursive filtering scheme based on the Kalman model, more accurate time information can be used to obtain better reconstruction quality.Wu et al. [41] regard the video compression challenge as a repeated image interpolation challenge, so that the remaining frames can be reconstructed from the key frames through an interpolation reconstruction network.In addition to this, their algorithm provides a compressible code to disambiguate different interpolations and encode key frames as faithfully as possible.However, although the interpolation network is end-to-end optimized, motion information still requires additional calculations, which depend in part on other algorithms.In [138], an end-toend video compression deep model that jointly optimizes all the components for video compression has been proposed.Specifically, the optical flow estimation network is used to obtain motion information, and the compressed network is used to compress both motion information and residuals.These two different networks are jointly learned through end-to-end optimization.Habibian et al. [139] present a depth generation model for lossy video compression, which consists of a threedimensional automatic encoder with discrete potential space and an autoregressive prior for entropy coding.Self-encoder and transcendental encoder are trained jointly to achieve the best rate-distortion curve.This method is superior to the latest learning video compression network based on motion compensation or interpolation.In addition, three extension directions are proposed: semantic compression, adaptive compression and multimodal compression.These directions mentioned above will undoubtedly lead to new video compression applications, which may be realized by DNNs but not classical codecs.
Utilizing Compressed Data.Due to the time redundancy and the enormous size of data streams, there is a large amount of redundant or irrelevant information in the video data, which makes learning neural networks difficult and slow.Therefore, most video recognition algorithms use a video compression algorithm such as H.264 as pre-processing, which can reduce superfluous information by two orders of magnitude.In recognition, the mainstream method is to restore the compressed data to a raw video format and then handle each frame as a RGB image.Considering that neural networks can learn features from data, the process of decoding compressed data may be skipped.Zhu et al. [42] present a fast and accurate video recognition framework namely deep feature flow.It extracts depth features on key frames (I-frames) through a convolutional network and maps them to other frames for auxiliary prediction through the flow field.In this process, significant efficiency gains can be achieved by reducing the amount of computation.In [43], a network trained directly on compressed videos is proposed, as shown in Fig. 9.The benefits of this design are threefold.Firstly, the compressed video representation removes large amounts of redundant information and preserves useful motion vectors.Secondly, compressed video representations are more convenient for exploring video correlation than individual images.Finally, such an approach is more efficient because only informative signals are processed rather than near-duplicates, and the efficiency can also be improved by skipping the steps of decoding the video as the video is stored in a compressed version.
In general, video compression technology has already played a significant role in video-based recognition tasks, no matter whether it is about a traditional video compression standard or a network-based learning algorithm.In most useroriented application scenarios, such as public security monitoring and real-time scene replacement, real-time processing is usually a must and stability of the algorithm is a desire.In those cases, video compression has been proved to have great importance and application potential.
3) Point Cloud Compression: Different from image/video data, point cloud contains more complex structures, which make the compression much more difficult.The most efficient way of compressing the point cloud is to utilize its geometric characteristics.Quach et al. [140] present a novel data-driven geometry compression method for static point clouds based on learned convolutional transforms and uniform quantization.The first auto-encoderbased geometry compression codec is proposed in [44], where Fig. 9: A recently proposed video recognition network, which was directly trained on compressed videos [43].
the point cloud is treated as input rather than voxel grids or collections of images.Inspired by the great success of variational auto-encoders (VAE) in image/video compression, Wang et al. [45] present a DNN based end-to-end point cloud geometry compression framework, in which the point cloud geometry is first voxelized, scaled and partitioned into non-overlapped 3D cubes which is then fed into a 3D convolutional networks for generating the latent representation.Furthermore, Yang et al. [46] present a novel and interesting end-to-end deep auto-encoder to achieve the state-of-the-art performance for supervised learning tasks on point clouds.In this work, a graph-based enhancement is firstly enforced (in the encoder) to promote local structures and form a low dimensional codeword, which can be used to deform/fold a canonical 2D grid and then reconstruct the 3D object surface of the input point cloud.The learned encoder can be treated as a compression module, which has been proved to have great generalization abilities: experiments on major datasets showing that the folding can achieve higher classification accuracy than other unsupervised methods with significantly fewer parameters (7% parameters of a decoder with fullyconnected neural networks).The research on learning point cloud compression has just started and got tested on recognition tasks, and this new topic remains largely unexplored.Encouraged by the successes of these pioneering works, more and bigger progresses can be expected in the coming near future.

B. Data Selection
Selecting only relevant and informative parts from the raw data for recognition can significantly reduce the computational cost and may also lead to a better recognition accuracy (as disturbance by irrelevant information/noise is reduced).Due to the great differences between the data types in terms of data structure and information characteristics, the researches focus and method on them are also quite different from each other.For image data, sample selection for training is the main concern.In the case of video data, frame selection inside each video (for both training and testing) matters most as continuous video frames contain a lot of redundant information.Similarly, point data also have much redundancy and subset sampling is the main stream.Details on the motivations, strategies and methods for them are given below.
1) Image Data: Currently, DNNs are generally data hungry, namely, the more labeled training data the better performance they can achieve.However, more data also means more costs, which include the efforts for data acquisition and labeling and all the consumption (storage, transmission, memory, and computation) for learning.Given a fixed amount of training data, directly scaling up the computation (by increasing the number of parameters and/or doing more iterations) usually has a performance upper bound and more computation after that goes more towards overfitting.Such an overfitting is believed to be the result of the inherent noise (or certain redundant samples in a softer tone) in the training data [47].Therefore, reducing such noise or redundant samples shall not only saves consumption for model learning, but also have the potential to even boost the model's performance.
Subsampling the training data has recently been studied under such a motivation, and it has already shown quite promising results in the past couple of years.Since so far such researches have only concerned about image recognition and subsampling can be regarded as data selection, we introduce them here.In a more general sense, these approaches shall also be applicable or at least have the potential to be made applicable for other data types.Since DNNs prefer more data, naive random subsampling likely ends up with inferior performance.Therefore, some efforts have focused on how to get subsets better than randomly sampled ones or revealing the inequality of training samples.Core-set selection [48] and representative subset finding [49] are good examples in this direction.However, doing better than random subsampling does not guarantee no drop in performance.More recent works improve over these by identifying redundant samples for removing them without sacrificing the recognition performance.Clustering in the DNN feature space [50] shows a successful removal of 10% semantically redundant samples from CIFAR-10 and ImageNet datasets, while a slightly later work on "select via proxy" is able to push this reduction to 40% on CIFAR-10 with no performance loss with the help of three uncertainty metrics.The very recent work on Active Dataset Subsampling (ADS) [47] presents even more encouraging results: removing 50% of CIFAR-10 training samples yet even perform slightly better than training on the full dataset.Similarly, on ImageNet dataset they are able to outperform training with all data by using only 80% of it.
As shown in [47], advantages of subsampling are not limited to its abilities to maintain or even improve recognition performance whilst saving all learning consumption.The sampled subset can be effective for other model architectures which are not trained on.Moreover, the subsampling method itself (e.g.ADS) may be applicable to many tasks and various data size settings.The research in this direction has just begun and there is still large room for exploring.An important and highly valuable problem is how to do that with noisy or weak labels as collecting high-quality labels is usually a great challenge.
2) Video Data: Since the huge redundant information in the spatio-temporal domain is prime for videos, instead of performing expensive processing on every frame to approach target tasks, such as object detection and action recognition, selecting key frames and performing the major processing sparsely is a more efficient choice (see Fig. 10 for the motivation).However, how to efficiently select proper key frames remains an open issue.Since relevant and discriminative video information could be unequally located in its temporal domain, obtaining few highquality key frames without loosing those information could cost significant extra computation.How to do that efficiently is an important issue for video-based recognition tasks.
Generally, existing approaches on selecting key frames could be divided into two types.The first one is to randomly pickup key frames by predefined intervals.It is commonly applied in video recognition tasks, which may also include temporal boundary detection [51]- [54].Although this method takes the minimum computation cost, picking up frames by a large interval may miss critical information, while using a small interval could increase the post-hoc computation costs.Alternatively, in another type of approaches, the information from processed frames can be used to predict the possible key frames in the future, which could avoid to process many redundant frames and select high-quality key frames [55]- [57], [59], [60].The sampling and early stopping can be learned by an end-to-end deep reinforcement learning model as studied in [58].
3) Point Data: Generally, directly working on all the point data can be heavy and expensive for all the types of consumption mentioned earlier.Contrarily, shape or skeleton is the main object for recognition tasks.Usually, the same shape may be still recognizable with just a subset of points of what we get from the sensors, and in many cases only a small portion of the data are relevant to the recognition task.Therefore, recently quite a few researches have tried to embed data selection components/concepts in their DNN models for handling point data, to achieve a better performance in terms of effectiveness and efficiency.
The selection of point data are usually done by sampling or modeling attention.A simple strategy for sampling is called Furthest Point Sampling (FPS), which can be done efficiently [141], but FPS has significant weaknesses of being task-dependent, low-level (cannot handle semantically highlevel representations), permutation-variant, and sensitive to outliers.Recently, quite a few works have explored new sampling models for point clouds with the help of DNNs.Within them, the work "Learning to Sample" [142] introduces a neural network termed S-NET, which takes a point cloud and produces a smaller one that is optimized for a particular task.The simplified point cloud is not guaranteed to be a subset of the original point cloud, but a post-processing step is adopted to match it to a subset of the original point set.S-NET has a space consumption linearly proportional to its output point set size, and it offers a trade-off between space and inference time.As an example mentioned in the paper, cascading S-NET that samples a point cloud of 1024 to 16 points with a following PointNet [143] reduces inference time by over 90% compared to running PointNet on the complete point cloud, with only 5% increase in space and 4% decrease in recognition accuracy (much better than FPS and random sampling).Another more sophisticated model [61] proposes to do data selection from two perspectives together: a parameter-efficient Group Shuffle Attention (GSA) that does attention based embedding, and an end-to-end learnable and task-agnostic sampling operation called Gumbel Subset Sampling (GSS) for representative subset sampling (from a highdimention embedding space).Its superiority in effectiveness and efficiency has been shown on several datasets.There are also several other works on attention modeling, such as the Attentional PointNet [62] for 3D-Object detection and the Point Attention Network [63] for gesture recognition, etc. Due to the space limitation, we don't introduce more details here.Since both the research and applications on point data are now growing very rapidly, it is expected that where will be an explosive growth of interests and efforts in designing more efficient and effective recognition models.Data selection for point data are a promising area to look into.

C. Data Representation 1) Image Data:
Image representation for recognition in the DNN era is usually part of the representation learning network, not as a separate pre-recognition operation.However, when there is not enough training data (i.e., small training set) or only a small part of the training data get labeled (i.e., a semi-supervised setting), it can be an effective and also efficient choice to borrow image representation from some suitable pre-trained model (trained on some source data) or from a model trained under an unsupervised setting on the target data before the main task related training with the limited labeled data.
There are some representative references for these two cases.For the first one about borrowing representation from pre-trained models, a typical scenario is medical image analysis or recognition, where people have to face the reality of insufficient training data due to many reasons including the privacy issue.A highly cited work in the field [64] discussed about full training vs. fine-tuning (with pre-trained models) for medical imaging applications in three specialties (radiology, cardiology, and gastroenterology) on three vision tasks including classification, detection, and segmentation with extensive experiments, and concluded that the use of a pretrained CNN with adequate fine-tuning outperformed or, in the worst case, performed as well as a CNN trained from scratch, while at the same time enjoying the benefit of being more robust to the size of training sets (with the help of a layer-wise fine-tuning scheme).A similar conclusion was researched in a study on the effectiveness of using pre-trained CNNs as feature extractors for tuberculosis detection [65], where three different ways of utilizing pre-trained CNNs (with three different CNN structures) are discussed and in some cases directly using the pre-trained CNNs can even beat their fine-tuned versions.For the second case where data are sufficient but available labels are scarce, pre-training a network under the unsupervised setting, e.g. using a spatial prediction task with "Contrastive Predictive Coding", has been shown to be effective for fast further training on recognition tasks with little labeled data [66], which is called "data-efficient" as it allows task-related training on only a small account of data.Computationally, the task-related training shall be also very efficient, as it can inherit the weights from a pre-trained model which is task independent and can be obtained beforehand.Actually, besides the models trained under unsupervised setting, models trained on a large general dataset (e.g.ImageNet) were also found to be very effective for image representation for many visual tasks, superior than the well-designed hand-crafted features [67], and such direct adoption of pre-trained models has been served as a good baseline for exploring new DNN models.
Besides building image representation from a pre-trained model, changing the input image data before training a model from scratch can also be an efficient way.There is interesting finding reported by fast.AI in their MOOC "Practical Deep Learning for Coder" and also their write-up 1 on how they managed to train Imagenet to 93% accuracy in just 18 minutes (using 128 NVIDIA V100 GPUs): "progressive resizing", namely, using small images for initializing training and then gradually increasing the image size as training progresses, seems to be good for making faster progress on the training and thus saving training time.This raw image presentation strategy seems to be reasonable and effective, and it has been recommended by several other websites, though there might be still no any formal academic publication introducing it or justifying it.Further study in this direction can be valuable for both research and applications.
2) Video Data: Data representation for efficient video-based recognition has two representative trends in recent years.Both of them are very new and look rather promising.
One is decomposing video sequences into individual frames for frame-based representation and then represent the motion information by exploring temporal correlations among the 1 https://www.fast.ai/2018/08/10/fastai-diu-imagenet/high-level features of frames.This is contradictory to the slightly earlier work (Carreira and Zisserman, CVPR 2017 [52]) on the Inception 3D (I3D) architecture which is about 3DCNN.Wu et.al. [68] proposed to apply 2DCNN on individual video frames and then do computationally highly efficient 1D temporal convolution on the extracted 2DCNN features, which is both more effective and much more efficient than 3DCNN models for the task of video-based person reidentification.Later, Xie et.al. [53] replaced the 3D convolutions at the bottom of the 3D CNN network by low-cost 2D convolutions and temporal convolution on the high-level "semantic" features (outputs of those 2D convolutions), and also got both better performance and faster speed for action classification.Zhao et.al. [144] further found that making the temporal convolution along the feature trajectories so that the representation can be robust to deformations, and thus they got improved accuracy on action recognition.
The other representative trend is using pooling to generate a super compact and sparse representation for videos before feeding into recognition models.A very successful example in this direction is "dynamic image" [69], [70], which is generated by an efficient and effective approximate rank pooling operator, turning a whole video into just one single highly informative image.The dynamic image can simultaneously capture foreground appearance and temporal evolution information, while at the same time excluding irrelevant background appearance information.Therefore, even a simple 2DCNN model built on top of it can generate superior RGB video recognition results.Soon, the model got extended for RGB-D video based activity recognition and showed good results [145], and very recently it has also been used for generating multi-view dynamic images for the task of depthvideo based action recognition [146].
3) Point Data: To perform visual recognition tasks on point data, traditional works apply CNN with the same dimension as point data [147]- [149], which come with huge computation costs, as shown in Fig. 11(a).Compared with RGB data, however, the point data might inherently be sparse.Motivated by such a property, computational-efficient networks are developed.A general approach is to align the coordinates dimension to the CNN channel dimension so that one dimension is reduced compared with the original point data, as shown in Fig. 11(b).Nonetheless, such an approach arises a problem: the indexadjacent points could be locally uncorrelated in spatial domain no matter how to assign the point indices.Since the RNN/CNN inherently assumes local correlation exists, it is inappropriate to directly process locally uncorrelated features.Therefore, it is preferred to model all points simultaneously in the network, and let the network automatically learn the proper relationship between them [71].
HCN [72] and PointNet [143] are among the most efficient and superior networks in skeleton-based action recognition and point-cloud-based object recognition, respectively.Although they may look different at the first glance, they all align the coordinates dimension to the CNN channel dimension to make the computation efficient and simultaneously model all points to avoid the influence of point orders.IV.ON RECOGNITION: NETWORK COMPRESSION DNNs are always redundant in most cases, thus the great potential ability of compression is inherent in this topic [150].Because of some internal high dimensions of raw visual data, corresponding neural networks are redundant consequentially.In this section, four approaches of network compression in the field of visual recognition are introduced: 1) compact networks; 2) tensor decomposition; 3) data quantization; and 4) pruning.These approaches can make huge visual recognition networks efficient.Furthermore, some joint compression practices which synthesize multiple specific approaches are also presented.

A. Compact Networks 1) Compact CNNs:
In fact, CNNs are born to deal with visual recognition, and the key natural feature of CNNs is the weights sharing which can be regarded as the earliest practice of compact networks to match the data structure of images.Thus, based on the characteristics of visual recognition tasks, further efforts have been proposed to make CNNs more compact to reduce the ever growing network size [76], [151], [152].In general, most compact design for CNNs can be concluded to two perspectives, one of which is based on the receptive field of filters, and the other one is based on the topology within single convolutional layer (intra-layer) or between convolutional layers (inter-layer).Besides, some more crazy ideas, which invent alternative building blocks for reducing the parameters to learn, appear to be another novel aspect.Aspect of Receptive Field.It is clear that designing effective receptive field [153] is crucial to the representation capability of convolutions, which is jointly determined by the filter size and filter pattern.In the aspect of filter size, VGG-Net [154] proposes a stack of two 3 × 3 convolutional layers to replace a 5 × 5 convolutional layer because their receptive fileds are equivalent.In contrast, more practices are in the aspect of filter pattern.For instance, an n × n convolutional layer can be split into two chained layers which have an n × 1 layer ahead and a 1 × n layer behind [73], [155], atrous or dilated convolution [74], [156], [157] uses irregular filters with holes, and deformable convolution [158] generalizes the atrous convolution to learn the offsets of sampling directly from the target tasks.Aspect of Topology.The topology art of changing connections in CNNs can bring more ability of expression or lighter convolutional units.Since the well known NIN [159] was proposed, it has been aware that vanilla convolutional kernel may be redundant especially for the situation of multiple stacked convolutional kernels.Besides, Inception [160] and ResNet [161], [162] are proposed to enhance the performance of CNNs, which inspired many researchers to think about efficient convolutional units under the well designed topology, e.g., SqueezeNet [75] uses amounts of 1 × 1 convolutions to replace 3 × 3 convolutions and reduce the counts of channels in the rest 3 × 3 convolutions.The residual unit has led to relatively more results of efficient structures, e.g., bottleneck architecture [73] and a similar one with depthwise convolution [163] in ShuffleNet [76], MobileNetV2 [164] and MobileFaceNet [165].that are fixed to replace the traditional convolutional filters whose weights need to be learned.A set (which can be just a small number) of learnable linear weights are used to integrate the feature maps after convolution.Later, a new network type called Perturbative Neural Networks (PNNs) [77] replace their each convolutional layer by a so called perturbation layer, which computes its response as a weighted linear combination of non-linearly activated additive noise perturbed inputs.Both the LBC and perturbation layer are shown to be a good approximation of the convolutional layer but with much fewer parameters to learn, and both LBCNNs and PNNs share the same idea of replacing weighted convolution by a linear weighted combination of feature maps.
2) Compact RNNs: RNNs are often used for video recognition to extra the temporal features, but designing a compact RNN is harder than CNN because the inner structures of gated units are complex such as those in LSTM [167], [168] and GRU [169].That is to say, any single whole RNN contains several layers in which some units have elaborate internal structures.This situation is the most prominent characteristic which does not exist in neurons of other kinds of DNNs.Therefore, compact RNNs can be designed at the level of units or the level of whole networks.For discussing expediently and concisely, we give a simple equation which can abstract any gated connection, such as forget gate, input gate, output gate, etc.
where y is the data passed the gate; W is the weight matrix corresponding to the input data of current time x(t); U is the weight matrix corresponding to the status of previous time h(t − 1); b is the bias vector and the σ(•) is the activation function.
Units Level.The topology of gated units in RNNs is complex, however, some approximate reconstruction may be considered to remove some subordinate connections.In fact, GRU is actually a compact design based on LSTM.In detail, there are 4 gated connections described by Equation (1) in LSTM, whereas GRU has only 3 ones.Thus, space and computation consumption of GRU can be less than that of LSTM.Other than GRU, S-LSTM [79] and JANET [80] contain forget gates only in their units, minimal gated unit (MGU) [170] integrates the reset and update gates.On the contrary to the elaborate gates, FastGRNN [171] uses a shared matrix which connects both input and state so that the number of gated connections is kept but parameters are cut down.Quasi-recurrent neural network (QRNN) [172] replaces the previous output with previous input in the gate calculation, and accordingly gated connections are transformed from Equation (1) to Networks Level.One can also simplify the architectures of stacked layers through multiple basic units, just like compact CNNs.Sak et al. [81] introduce a linear recurrent projection layer to reduce the dimensions of the output of LSTM layers.Wu et al. [82] introduce residual connections into their 8layer translation LSTM.Some other analogous compact RNNs include skip-connected RNN [173], Grid LSTM [174] and SRNN [175].However, compact design is comparatively hard to implement and has a lack of uniform principles.Even worse, compact method is not easy to achieve high compression ratio, which makes it devoid of significance to deal with largescaled visual recognition neural networks.Thus, this method is often used to combine with other compression methods or sometimes relied on expensive extra disposing such as the transfer learning based distillation [176].
3) Neural Architecture Search: Unlike the rigid human designed or hand-crafted network architectures, neural architecture search (NAS) is a promising domain that can automatically construct a compact DNN.For the conventional practice [83], NAS is the process to train an RNN as the controller to generate a good architecture gradually.In detial, this RNN should train repeatedly to gain the produced architecture, and update the RNN controller itself based on evaluating the architecture by the environment or evaluator.Obviously, this process leads to massive training cost, since the search space will enlarge exponentially when the target DNN is deeper.More broadly, according to [14], there are three aspects of challenges lie in the field of NAS, i.e., search space, search algorithm and evaluation strategy, and many researchers aim their attentions to solve these problems.It is interesting that even the controller is usually an RNN, most of the target DNNs are oriented to CNNs, and only a few researches are considered for RNNs [177]- [180], thus the review of NAS is mainly focused on CNNs here.concrete elements to represent local structures, to construct the target DNN sequentially and and efficiently [178], [188].
Moreover, the search complexity can be further reduced in the range of the cell, e.g., progressively increasing the number of blocks within one layer as shown in Figure 13(a), and searching hierarchical topology within a cell as illustrated in Figure 13(b).Search Algorithm.Distinctly, appropriate search algorithm is critical since its target is to efficiently combine basic network elements together to be a whole network that fits a specific task or dataset most.Roughly speaking, except naive random search, these algorithms can be classified into four approaches, i.e., a) reinforcement learning (RL) including Q-learning [189], [190], REINFORCE [83], [186], [191] and proximal policy optimization (PPO) [178]; b) neural evolutionary algorithm (EA) [84], [184], [192]; c) Bayesian optimization (BO) [85], [193]; and d) differentiable gradient for architecture (DGA) [86], [179].It is hard to say which algorithm is better, however, classical RL approach still reveals the higher boundary of performance [187].Nowadays, some other minor improvements are also used for improving the searching efficiency, e.g., exploring the search space with the network transformation and weight reusing [186], [191], [194], and sequential model based optimization (SMBO) [185] which can choose the more likely direction of optimization rather than blindly searching with wide range.Evaluation Strategy.It is clear that the searched architecture should be evaluated to verify whether it can fit the task.The first strategy is simple, i.e., finitely train and evaluate the target DNN to observe its trends, e.g., just train on subset of dataset [195], build an accuracy predictor trained on the limited search space to guide the following search process [185], [196], predict the search direction based on partial training curves of current searched architectures [180], [197], etc.The second way is to yield the searched architecture to inherit the weights which are produced in previous searched architectures [186], [191], [198].The third aspect is termed as one-shot NAS, that only train the one-shot model, from which the searched architectures can directly share the one-shot weights without extra training [86], [179].
To fully show its promising developing tendency, we make Table II to compare some well-known human designed networks with NAS models nowadays.On the whole, NAS models can surpass human designs in both accuracy and efficiency, especially the EfficientNet [187] with the highest accuracy and several lightweight models for mobile format [179], [186], [187] with tiny number of parameters and operations (GFLOPs).Further, the searching time of NAS algorithms has also decreased significantly from 22,400 [83] to 4 [179] GPU days, that makes it possible to apply NAS models on the real embedded devices.Nevertheless, there are still some problems to be solved in this direction, e.g., a lack of practical report of deploying NAS models in fickle real data, enhancing the NAS models for mobile platforms to find the limitation of compact architecture, handling the trade off between accuracy and size, etc.Hence, all the signs indicate that this kind of approach is promising but still needs further studies.

B. Tensor Decomposition
Traditionally, since the parameters in DNNs are mostly stored as matrices, matrix decomposition is widely used in network compression, especially the SVD [199]- [206].However, restriction of orders makes it limited to use matrix decomposition to compress larger and larger visual recognition neural networks.Besides, tensors may have some inherent connections with neural networks [207].Hence, discussing tensor decomposition here is sufficient because a matrix is actually a 2nd-order tensor.

like [212]
( where K ∈ R r1×r2×•••×r d denotes the kernel tensor, any one ) is the factor matrix, and the operation × i means the mode-i contracted product [212].If every r i equals to a positive integer r C and the kernel tensor K presents like a superdiagonal tensor, which means all elements in K are 0 except 3) will become CP (Canonical Polyadic) decomposition [212].If all the factor matrices F (i) are orthogonal and the kernel tensor K is so called all-orthogonal [209], [210], then Equation (3) will become HOSVD.
There are many researchers who apply CP and Tucker to compress the weights in neural networks in recent years, especially CNNs for visual recognition [87], [88], [213]- [221].In general, to utilize Equation (3) efficiently, any single weight matrix W ∈ R M ×N should be mapped into a dth-order [90].Thus, the greater value of d can bring the more effective compression which decreases the complexity from O((mn) d ) to O(dmnr + r d ).However, the curse of dimensionality [222], [223] has not been solved completely by Tucker because of O(r d ).In addition, the relatively new block term decomposition (BTD) [224], which is in actual the sum of multiple Tucker blocks, can ease this curse to some extent, since the size of each kernel tensor may be smaller.In [89], BTD-LSTM shows superior ability of keeping information for visual recognition, but corresponding computation process is relatively complex.
2) Tensor Network: Tensor network [228], [229], which represents a tensor with a linked of matrices or low order tensors with contracted products, is promising to avoid the curse of dimensionality by eliminating the high order kernel tensor with r d elements according to Equation (3).Commonly, there are three types of tensor networks are applied, i.e., tensor train (TT) [230], [231], tensor chain (TC) [232], [233] or tensor ring (TR) [225], [234], and hierarchical Tucker (HT) [235], [236].Tensor Train.According to [237], a dth-order tensor A ∈ R n1×n2×•••×n d can be represented as where the operation × 1 is the mode-(N, 1) contracted product [238], and the core tensors Apparently, according to the equation above, the spatial complexity of TT format is O(dnr 2 ) where n is the maximum value of modes and r is the maximum value of TT ranks.It is obvious that the O(r d ) in classical decomposition, i.e., Equation (3), is avoided and the curse of dimensionality is solved.
Tensor Chain.TC format, which can be regarded as a variant of the TT format, is first proposed by B. N. Khoromskij [232] and proved having similar approximation capability as TT by Espig et al. [233].Nowadays, the TC format is introduced in the field of DNNs as the TR format by Zhao et al. [225], [234].
Comparing with TT, the TC format of a dth-order tensor A ∈ R n1×n2×•••×n d has only one difference which is r 0 = r d = 1.Thus, differ from Equation (4), there are two pairs of equal modes must be contracted at the end like where , and × 3,1 1,d+1 means the paired contracted modes are, the 1st of the former vs. the 3rd of the latter while the (d + 1)th of the former vs. the 1st of the latter.Fig. 14 shows TT and TC in tensor network graphs, where every node represents a tensor and each edge is a mode of its connected tensor.Hierarchical Tucker.In fact HT is the fountainhead of TT, i.e., TT is a special form of HT [239], since HT has extremely flexible organizational structure.Particularly, for a dth-order tensor A ∈ R n1×n2×•••×n d , their modes could be divided into two sets as t = {t 1 , t 2 , ..., t k } and s = {s 1 , s 2 , ..., s d−k }, then we have where is termed as transfer matrix, and ⊗ is Kronecker product.One can continuously using Equation ( 6) until all the truncated matrices become U i ∈ R ni×ri , thus the HT format of A will be done.Obviously, there are multiple ways to split the modes of A, however, even the simplest format with the binary tree is still formidable to write in formulation [96], [97].Fortunately, tensor network graph in Fig. 14(c) can afford us a convenient description.
3) Typical Practices: On the whole, TT format represents to be the most vibrant method of tensor decomposition.We list current applications of TT compressed neural networks including both CNNs and RNNs for visual recognition in Table III, where a handful of TC and HT are introduced as well.We find an interesting phenomenon that the tensor decomposed CNNs for image tasks are hard to avoid the accuracy loss, while RNNs for video tasks are easy to achieve higher accuracy in compressed forms.We guess that RNNs have larger scale than CNNs in general which situation verifies that a larger network is easier to compress [240], and the practice on 3DCNNs further verified this since 3D convolutional kernels have heavier redundancy [92].However, how to deal with the accuracy loss still needs to be learnt.

C. Data Quantization
Data quantization can project concrete data, including weight matrices, gradients, nonlinear activation functions, etc., from high-precision value space to low-precision value space, e.g., from real number field R to integer field N.This approach can not only reduce the space complexity of network parameters, but also accelerate the running time of neural networks because bit operations are much faster than float operations.As raw visual data are generally represented by integers, quantized neural networks may be more suitable for visual recognition tasks.
1) Method: Problem Formulation.In general, there are two formulations to describe how to transform floating point weights, gradients, or activations, etc., to quantized data type.One category, which is the most widely used, projects the original high-precision value to the space of quantized data as where x is the original high-precision value in the continuous space, Q(x) is the quantized data in a discrete space, round(•) is the rounding operation, and ∆ is the quantization step length if the discrete states have uniform distribution.If Kbit quantization is used, we have ∆ = 1 2 K−1 to discretize x ∈ [0, 1] to 2 K states.This category, as described in Equation (7), is straightforward and easy to consider, thus most practices have followed this direction, which can be observed in Table IV.
The other category regards the quantization as an optimization problem and tries to solve it approximately.The classic model can be generally governed as min where ) is a set which has n discrete states for quantization.The earliest and the most well-known quantization in the category of optimization is the XNOR-NET [100].An obvious motivation is that the optimization problem pays more attention to the whole network rather than local quantization, so Equation ( 8) may be more appropriate for large scale neural networks.Quantized Objects.As mentioned above, except weight (W), there are also several other different objects in neural networks can be quantized, such as activation (A), error (E), gradient (G), and weight update (U).Fig. 15 illustrates the data of these quantized objects in forward pass, backward pass and weight update processes.Parameter (W) is the most straightforward to be dealt with.Propagation data (A,E) correlate highly to the data flow during forward and backward passes which influence accelerating a lot.Quantized gradient and update have a great help for training the full quantized networks, but are harder to implement, thus the corresponding practices are fewer as shown in Table IV.
Algorithm Description.Generally, if the bit-width K is given, Equation ( 7) can be rewritten as where x Q is the quantized data with K-bits, i.e., quantized objects W Q , A Q , G Q , E Q and U Q described above.Data flow among these quantized data can generally be computed by bit-wise operation which is much faster than the full precision computation.Besides, performance of quantized networks with appropriate bit-width will not degenerate that can be learnt in Table IV.The overall algorithm is briefly described in Fig. 15, and the more comprehensive algorithm description of quantization can consult [99].
State Distribution and Projection.The data in quantization set always have concrete discrete states, which may contain different distributions.For example, uniform distribution is the most widely used one, logarithmic distribution has exponential variance on step length that has obvious benefits to convert multiplication to addition, and adaptive distribution often occurs in the situation when formulating the quantization as an optimization problem such as TTQ [270], ADMM [242], etc.On the other hand, the primary mission in quantization is projecting the original high-precision data to the discrete state space, of which deterministic and stochastic projections are  the two main approaches used widely.The former projects the high-precision data to the nearest discrete state, while the latter projects the data to one of the two adjacent states with probability which is determined by the distance from the original data to the discrete states.Generally, deterministic projection is easier to handle, so that most references listed in Table IV select it.
2) Typical Practices: In the aspect of CNNs, we list a number of typical and latest practices with their best performance on CIFAR10 or Ima-geNet in Table IV.Note that compression ratio of quantization is not considered because it relates directly to the bit-width, thus the storage saving is limited.As can be clearly observed, all the recent practices quantize W, and most works quantize both W and A, a small number of works quantize G,E, or U.It is worth to mention that the practice which quantizes W, A, G and E [98] inspires some practices on semantic segmentation tasks in terms of accelerating corresponding encoder-decoder CNNs [271], [272].We emphasize quantized objects here rather than other aspects of quantization, e.g., state distribution and state projection, is due to that the choice on which objects are quantized or not can influence the training and inference processes significantly.Particularly, quantized G and U can simplify the training very much because the data flow of back propagation can be all in integers.Moreover, Banner et al. [265] propose the range Batch Normalization (BN) to greatly reduce the numerical instability and arithmetic overflow caused by the popular standard deviation-based BN, hence the BN can also be quantized.We optimistically believe that some entirely quantized DNNs will come soon and then mature and efficient integer neural networks should become the mainstream especially for embedded surroundings.
The amount of practices of quantized RNNs for visual tasks is few to the best of our knowledge, in comparison with natural language processing tasks [100], [249], [273]- [275].Unlike CNNs, RNNs are dynamic systems that reuse the weight and accumulate the activation error in temporal dimension, and furthermore, there are two dimensions of back propagation in RNNs, which are spatial (layer-by-layer) and temporal (step-by-step).These situations make it harder to clarify the training dataflow and quantization sensitivity.In fact, many quantization methods can be shared by both CNNs and RNNs [249], [276], however, more applications of quantized RNNs for visual tasks should be established in consideration of the complexity of RNNs discussed above.

D. Pruning
Pruning can reduce the amount of weights or neurons, thus the memory and calculation costs are retrenched.However, the additional indices for indicating the location of non-zero elements, and the irregular access or execution pattern, become the two major drawbacks.
1) Basic Method: Problem Formulation.Similar to quantization, there are also two formulations which can describe pruning.The first one is direct and naive, i.e., utilizing some search algorithms for trained DNNs to find those "unimportant" weights or neurons to prune, like S(X) = sparse(X) (10) where X is the weight matrix and S(X) is the new weight matrix after pruning.The function sparse(•) is the search algorithm to pre-select the unimportant weights and neurons to be pruned.Such search approaches could be low-precision estimation [102], [103], negative activation prediction [104], etc. Besides, hashing trick [277] may also help to make weights in DNNs sparse [278] even it seems to be weights sharing rather than a pruning approach in concept.Further, some other hashing methods may reveal similar abstract thought to normal pruning, e.g., using locality sensitive hashing (LSH) to collect neural nodes in active set and other neural nodes which are not in this set will not be computed during forward and backward calculations [279].However, it is obvious that search algorithm may consume vast computing time for DNNs with large scale.Further, as the same reason for quantization, regarding the problem of pruning as optimization may be a more adaptive formulation to large DNNs because of the layer coupling.A general formulation of pruning in optimization can be described as [105] min where W = {W (1) , W (2) , • • • , W (G) } is the set of all weights in G different layers or parts; λ is a penalty parameter that affects the sparsity, and L 0 (W ) is the normal loss function of DNNs.Pruning Objects.There are two typical pruning objects: weight pruning and neuron pruning, which are illustrated in Fig. 16.The former one reduces the number of edges that can make weight matrices sparser, while the latter one reduces the number of nodes to make weight matrices smaller.Evidently, the latter method may cause more accuracy loss than pruning weights only.Using ResNet50 as an example, according to Table V, weight pruning networks [280]- [284] can exceed neuron pruning networks [106], [285], [286] in terms of the performance of top-1 accuracy on ImageNet in most cases.Nevertheless, this gap has been reducing recently.Pruning Structure.Generally, the compute acceleration has a great deal to do with the sparse pattern, which is termed as pruning structure in this paper.It is well-known that the operation in one neural layer can be abstracted as matrix multiplication, thus the pruning structure can be described as the amount of zeros in matrix.Besides, the convolutional computation is commonly converted to the modality of GEneral Matrix Multiplication (GEMM) by lowering features and weight tensors to matrices [287].
Fig. 17 illustrates different pruning structures: element-wise, vector-wise, and block-wise.Note that different pruning grains produce different pruning structures.For instance, in weight pruning, kernel (discrete), fiber, or filter pruning produces vector sparsity, while channel or kernel (group) pruning produces block sparsity.

2) Typical Practices:
Here we let W and N denote weight pruning and neuron pruning respectively.In the aspect of CNNs, according to Table V, the works that prune W and those prune N are close in terms of quantity, and just a few practices have considered both of them (W and N) [288].We emphasize pruning objects here is due to that, as mentioned before, pruning only weight or neuron may influence the performance more obviously than other aspects, i.e., formulation and structure.From Table V, in terms of performance, it cannot be deemed absolutely that optimization is better than searching for large DNNs until now.One certain advantage of optimization is efficient training and avoiding some time-consuming searching procedures.For pruning, element structure can be helpful for achieving higher compression ratio, especially the case of DGC [280].Relatively, other structures may bring significant acceleration of computing, particularly filter (vector) and channel (block) pruning in convolutional kernels.On the whole, pruning on convolutional kernels is much more difficult than that on fully connected layers.
In the aspect of RNNs, there are two typical image captioning tasks, among which one prunes W [303] and the other prunes N [304].The former has achieved 13.12× compression ratio on MS COCO dataset with 2.3 improvement of CIDEr score.The latter has also used MS COCO dataset and got only 0.4 reduction of BLUE-4 score with 50% sparsity.Comparatively speaking, it might imply that there are still lots of potentials for visual RNNs with pruning to get further advanced.

E. Discussions and the Derived Joint Compression
In fact, each compression method has its own characteristics which other compression methods do not have.Before listing recent practices of joint compression, we present here our observations and thoughts about what features make each compression method worthy and unique based on the foregoing content in this section.
1) Main Characteristic of Each Method: Compact Networks.As discussed before, there are two levels of compact designing, one is delicate cell and the other is NAS algorithm.According to Table II, NAS shows superior performance compared with human designed networks in both accuracy and storage saving.The most critical point is that, executing time of NAS algorithm has been shortened a lot nowadays [179], such situation makes NAS promising and generic for embedded applications with various or changing surroundings though there are still a lot of detailed works to do towards real applications.While other compression methods, i.e., decomposition, quantization and pruning, are still lack of enough flexibility, because a single specific network architecture is hard to be applied to all kinds of datasets.Tensor Decomposition.In early studies of network compression, tensor decomposition was the only approach that supports the so called in situ training [305], which means training a new model from scratch.Meanwhile, quantization and pruning needed pre-training in most cases to discover the distribution of weights to quantize or prune further.However, new studies of quantization [306] and pruning [307] have made up this short slab.Moreover, both of these two reports conclude that in situ training could perform better than fine-tuned models.Even so, tensor decomposition still appears to be the most powerful in the aspect of compression ratio particularly for RNNs according to Table III.Currently, the most unique characteristic of tensor decomposition is that various tensor network formats may have some inner linkages to DNN architectures, thus theoretical explanation of efficient DNNs may be explored in this kind of compression.Chien et al. [88] regard the whole Tucker decomposition process in Equation (3) as a neural network connection.Cohen et al. [239], [308] use HT decomposition to explain the depth efficiency of DNNs.Chen et al. [309] find BTD can describe various bottleneck architectures in ResNet and ResNeXt.Li et al. [310] propose a new DNN architecture based on MERA tensor network, which in actual is a kind of renormalization group (RG) transformation [311].In this sense, it is expected that the studies of compressing DNNs with tensor decomposition can build a bridge between the experiments and the system of theories about DNNs.Data Quantization.Although data quantization is not very good at reducing model complexity (in terms of the number of parameters) of DNNs compared with decomposition and pruning, it still has a significant advantage of computing acceleration and friendly deployment of embedded hardware, which are guaranteed by the low bit data flow in quantized DNNs.The authors implemented WAGE [98] on FPGA platform, further found 8 bit models perform 3× faster in speed, 10× lower in power consuming, and 9× smaller in circuit area than 32 float point models [99].Although pruning with appropriate structure may also help to reduce computation complexity, bit operations in quantized DNN are still more adaptive for hardware environments.
Pruning.According to Table III and Table V, pruning has a better capability of accuracy maintenance or even improvement, though seemingly other methods like tensor decomposition has more potential power to gain higher compression ratio.However, accuracy loss of compressed DNNs under decomposition is hard to avoid especially for CNNs at present.For example, the performance of VGG16 on CIFAR10 with pruning can achieve amost 0.15% accuracy lift in the new study [307], while the accuracy loss is hard to make up even the ranks are set very high based on TT convolutions [92].Additionally, two main advantages of decomposition, i.e., higher compression ratio and in situ training, can be separately obtained by [280] and [307] in the aspect of pruning.However, similar to NAS, data structure of pruned DNNs may present complex and chaos, which is unfriendly to embedded applications and corresponding theory explanation.

2) Joint Compression:
Considering different characteristics of each compression approach, recently some researchers have made their applications involve more than one class of method besides using individual compression method, termed as joint compression in this paper.Corresponding visual recognition works are illustrated in Table VI.The joint compression has great potential for higher compression ratio.For instance, compression ratio of 89× of AlexNet on ImageNet (58.69% accuracy) [245], 28.7× of ResNet18 on ImageNet (66.6% accuracy) [111], and 1910× of LeNet5 on Mnist (98% accuracy) [312].
However, researchers should pay more attention to maintaining the model accuracy in joint compression through sufficient analyses and comprehensions about respective characteristic of each compression component.For example, since there might be a projection between the structure of tensor decomposition and the architecture of DNN as discussed above, the decomposition method appears to be perpendicular to the other methods to some extent, so combining it with quantization or pruning is a promising direction which still needs further researching according to Table VI.

V. ON RECOGNITION: INFERENCE AND GENERALIZATION
Efficient training can allow learning from more data, with more parameter tuning or more complete architecture search, which can all lead to a better trained model.However, it is also critical to make the model run efficiently on affordable or existing devices and easily transfer to new data/tasks.In many cases, training may be done just once or only at one place (e.g. on the cloud) with powerful computational resources, but run-time inference has to be done at cost-sensitive and thus resource-limited edge computing devices.Therefore, in some sense, efficient run-time inference is a more serious issue when real applications are concerned.In this section, we focus on the fast run-time inference for model deployment at testing stage and transfer learning efforts for making the models easily generalized to new scenarios/tasks without costly retraining.

A. Fast Run-time Inference
Efficiency is not only a big concern for training deep learning models, but also rather important and sometime critical for the deployment at users' end in real applications.Therefore, how to accelerate run-time inference with limited resources is an indispensable issue of great importance for industrial applications.This is even more critical for applications that requires real-time or even faster responses, such as automatic recognition for autonomous driving.
There have already been a rich literature on accelerating run-time inference with DNNs.We roughly categorize them into two groups based on the difference of focus: data-aware acceleration and network-centric compression, and detail their recent advances and trends as follows.
1) Data-aware acceleration: During run-time inference, in many cases there is no need to check all the input data for generating the final recognition results, especially for the data which may include much irrelevant or redundant information, such as videos.Therefore, efforts on reducing or avoiding the computation on the irrelevant/redundant parts of the data are very important for fast inference.
A simple yet very helpful direction is exploring the similarity of the intermediate feature maps of two consecutive video frames for reducing redundant computation.A representative earlier work is the one called deep feature flow [42].It runs full expensive convolutions only on sparse key frames and then propagates their deep feature maps to other frames via a flow field.Significant speedup was achieved as flow computation is relatively faster than full convolution.This work got extended to a more unified framework [316] which is proved to be faster, more accurate and more flexible.The framework contains three main components: sparsely recursive feature aggregation for ensuring both efficiency and feature quality, spatially-adaptive partial feature updating for improving the quality of features from non-key frames, and temporally-adaptive key frame scheduling for more efficient and better quality key frame usage.However, these two works are still computationally expense, as they rely on per-pixel flow computation, which is a heavy task.
Recently, Pan et.al. [112] have explored another direction.They proposed a novel recurrent residual module (RRM) which only do dense convolution on the first frame and have the following frames fed into a sparse convolution module which only extract information from the difference images of neighboring frames.The sparse convolution has no bias term and it shares the same filter banks and weights with dense convolution.After enhancing the sparsity of the data (by difference images), a general and powerful inference model which is called EIE (Efficient Inference Engine) [113] is adopted to do the inference efficiently according to the dynamic sparsity of the input.In usage, the video is split into several chunks which can get processed with RRM-equipped CNN in parallel.Good results (speedup) have been observed on object detection and pose estimation in videos, and the model shall also be applicable to other visual recognition task for videos.Since it only explores the natural sparsity of data, it can ensure no accuracy loss during the speedup.
Besides information redundancy in consecutive frames, irrelevant information may largely exist for object detection in videos, as objects often occupy only a small fraction of each video frame.Therefore, an intuitive acceleration strategy is to do dense full processing on only a few frames and make use of the spatio-temporal correlation among nearby frames to save computation on the other frames.A representative work reallocates the computation over a scale-time space called Scale-Time Lattice [114].It performs expensive detection sparsely on key frames and propagates the results across both scales and time with substantially cheaper networks, by making use of the strong corrections among object scales and time.Together with some other minor novel components (e.g. a network for temporal propagation and an adaptive scheme for keyframe selection), the work achieved better speed-accuracy tradeoff than previous works.Another representative work called Spatiotemporal Sampling Network (STSN) focuses on feature level propagation across adjacent frames.STSN is mainly about deformable convolutions across time, which is optimized directly with respect to video object detection per- formance.The approach has a natural robustness to occlusion and motion blur, which are two key challenges in detecting objects in videos.

2) Network-centric compression:
The main contents in the last section focus more on the theoretical methodology and training, meanwhile, for the practical scenarios, a lot of efforts have been made on general network-centric compression for fast inference.A comprehensive survey may have to cover several dozens of publications.Due to the scope and page limits of this paper, only a few representative ones for showing the recent advances and trends are covered here, as briefly introduced and compared in Table VII.
Besides pointing out their key ideas, main approaches, advantages and disadvantages, we provide a group of Key Property Indicator (KPI) for briefly evaluating them in the perspective of important factors for real applications.There are totally five indicators, with the following detailed explanations.
I "General or Specific?":whether the model is a general one that can be applied to the compression of any networks, or it is specific to some data types, tasks, or network structures.II "Static or Dynamic?": "static" means that the compression has to be done together with the training of original models and any changes to the compressed model has to be accompanied with a retraining of the original model, so that once a compression model is optimized, it is static; "dynamic" means that the compression model can be changed dynamically on demand without requiring a retraining of the original model.In the last two years, there is a clear trend towards shifting to dynamic compression, so most of the references in Table VII are about dynamic models and only one representative of static model is listed.III "Easy User Control?": whether the compression can be easily controlled by the users or not, such as using one or very few hyper-parameters for an intuitive controlling of the compression/speedup rate.IV "Input Adaptive?":whether the compression is made adaptive to the input data or not.When it is input adaptive, the network can be changed for each specific input so that the inference time may be greatly reduced for easier inputs.V "Resource-Aware?": whether the compression is aware of the available resources (including resources for storage, transmission and computation) or not.Fewer and weaker resources shall led to higher compression rate, and vice versa.Resource types shall also matter in detailed control of the compression.
There are a few findings in our study, which are worth mentioning.
• Like the channel-wise sparsity work proposed by Liu et.
al. [293] or the slimmable neural networks proposed by Yu et.al. [320], there is a clear trend that the network pruning focuses more and more on whole channels or blocks [115] as such structural pruning is GPU-friendly (can be exploited by GPUs) and allows the acceleration to work on dense operations of fewer components, instead of many sparse individual weights.• Models which can be easily controlled by users ("Easy User Control") are usually also "Resource-Aware", as there is a common assumption that user can control the compression according to the actual situation of resources.However, in real cases, the state of resources can be dynamic even for a specific device instance (e.g.someone's smartphone), as all the factors of storage, transmission and computation can be changing over time.
There can be multiple processes/applications running at the same time and the internet connection speed is hard to be consistent.Therefore, instead of asking user to specify the compression rate, having a model to dynamically decide it is a more practical and also more promising choice, which is widely ignored and far from being explored.Fang et.al. [116] present an inspiring work in this direction.• Being "Resource-Aware" and being "Input-Adaptive" seem hard to be achieved at the same time, as the former is about overall compression which can be controlled by users, while the latter is about automatically adjusting of compression based on each individual input data instance.However, these two can be solved from different aspects and there is no conflict between them.Therefore, integrating both factors ("Resource-Aware" and "Input-Adaptive") is a promising direction worth more investigation, and there are also a few good examples [117], [317], [318].

B. Generalization via Transfer Learning
It is well-known that transfer learning has been widely utilized for DNN-based visual recognition.Leveraging weights of a good model pre-trained on a large well-annotated dataset like ImageNet [21] has been a standard way for a great range of visual tasks, motivated by the encouraging results of even the most naive way of transferring [67].Transferring the learned knowledge of the source task to a related task could significantly reduce the training time while at the same improving the recognition performance, especially for fewshot recognition problems.
Transfer learning has been playing a critical role in reducing annotation burdens for DNNs-based applications.Since deep learning's performance heavily depends on the quantity and quality of data, in a deep learning project pipeline, a huge amount of time could be spent on data collection and annotation.Therefore, a neural network model is more efficient when it can use less manually annotated data to reach good results (e.g.comparable with those consume much larger amount of labeled data).
To reduce manual annotation efforts on the target task, usually either synthetic data or data from similar tasks are employed.Compared with the target data domain, however, the available source data may have different feature domains.Ignoring the domain gap and directly applying the model trained on the source data or performing co-training on both data domains could compromise the neural network's performance.Therefore, domain adaptation, an important and widely-studied branch of transfer learning, has been utilized to bridge the gap between different data domains [323].The successful applications cover many visual tasks including image classification [118], pose estimation [119], image segmentation [120], object detection [121], and image captioning [122], etc.
Recently, there have been increasing interests in deeper understandings of transfer learning for DNNs.A theoretical study on the transferability of deep representations is given in [324], and a comprehensive empirical study is done in [325] with many valuable findings.It is great to see that better pretrained models generally perform better and transferring with fine-tuning is better than training-from-scratch for most of the studied diverse image recognition tasks with many smallerscale datasets.There is an important message on the efficiency issue related to transfer learning: for achieving the same accuracy ("90% of the maximum odds of correct classification achieved at any number of steps"), fine-tuning a pre-trained model is averagely 17 times faster than training from scratch (the reason for that is discussed in [324].).Since the dataset (e.g.ImageNet) is usually more than 20 times larger than the target dataset, if the pre-training time is also taken into account, the total time for training from scratch is likely to be shorter.However, when both the DNN model and the dataset for pre-training is public (e.g.ImageNet), which is often the case, directly borrowing a pre-trained model is a better choice for efficiency, as the time for pre-training is already a sunk cost paid by some others.Reusing it has no time cost and it is also a good choice as these models are usually carefully optimized when they were published online.
For an even faster transferring (with possibly a better performance) than fine-tuning, recently some pioneering studies have proposed to adapt the pre-trained model to a specific target task/domain together with network structure compression so that both the full model adaption and fewer parameter learning can be achieved at the same time.Learning fewer parameters during transferring is believed to be more efficient than naive full model fine-tuning.In [123] be linear combinations of existing ones and force significant parameter size reduction (typically 13%, dependent on network architecture), and when coupled with standard network quantization techniques, they are said to be able to further reduce the parameter cost to around 3% of the original full pretrained model.A similar idea is proposed in [124], based on the parametrization of universal parametric families of neural networks covering specialized problem-specific models, with just small differences among them.Besides directly changing the pre-trained model, there is also a new idea of fixing it during the transferring and adding extra adapter modules to it for adapting to different target tasks [125].During transferring, only the adapter module is learned and the module is made to be much lighter than the original pre-trained model so that the additional learning can be very efficient.In the paper [125], extensive experiments show that with only 3.6% parameters (of the original pre-trained model), the performance can go beyond 99.6% of fine-tuning the whole pre-trained model.Though the last idea on appending has only been tested on NLP, the model looks general and therefore similar idea may work for visual recognition tasks too.

A. Summary of Recent Advances
As detailed in the former sections and also summarized in Fig. 18, to design an efficient visual recognition model or system, one may put efforts on data processing (compression, selection and representation) which is usually data type specific, network compression for training, fast run-time inference and efficient transfer learning, as already explored by the rich literature in the past few years.
An important message is that a lot of things can be done on the data side or having it taken into account when people design new efficient models, which is not only important for the recognition performance, but also critical for ensuring that the network compression fits the data well.
B. An Important New Frontier: Data and Network Cooptimization Recently, some methods actually have collaboratively optimized the data and networks together, although we reviewed them separately as 'Before Recognition" and "On Recognition" above.For example, in a previously cited paper [130], the network is cascaded by a data compression network and a classification network and trained in an end-to-end way, thus the compression network can compress images most efficiently, retain classification information, and also improve classification's efficiency.Similarly, in [43], the compressed video is used as the input of the action recognition network, and the decoding function is assigned to the recognition network through training, therefore the network learns the abilities for decoding and recognition at the same time through an end-to-end training.
In fact, training end-to-end has been a common desire for DNN-based solutions, however, even though joint compression and recognition is already made possible, so far the existing solutions have still treated them as two separate yet linked modules rather than a single fused model.Comparing with linking them for co-optimization, fusing them as a whole model may be more helpful for developing more computationally efficient DNN models with minimum redundant computations for real applications, even though that may lead to more space consumption due to the uncompressed data which is nowadays relatively easier to handle.Since we have not found any corresponding practices to the best of our knowledge, here we propose two possible directions for such a unified model design.
One direction is to explore quantization: linking data quantization and network quantization.As discussed earlier in subsection IV-C (esp.Fig. 15 and Table IV) and subsection IV-E, quantization can be done for the whole data flow inside DNNs and it is quite superior for adapting to various hardware platform, thus it is closer to real applications.Meanwhile, currently most visual data are originally represented by low bit integers thanks to the advancement of digital visual sensors, which make data quantization convenient and straight-forward.The main barrier for linking these two is that currently the raw visual data format may not directly match to quantized DNN data types.Therefore, we think necessary and highly valuable future efforts should be either on transforming the representation of raw data towards that of the quantized DNN models or going to the extreme to make the new visual sensors being able to produce various or more easily adjustable raw data formats for the integration with DNN quantization.
The other direction is to extend tensor network for covering the input data.Tensor networks can inherently describe linear transformations, e.g., matrix production and tensor contraction.Meanwhile, a DNN architecture is generally similar to a tensor network except all the nonlinear activation functions.Therefore, tensor network can at least inspire us to develop some new DNN architectures by analysing probable relationship between tensor networks and DNNs, even though the strict mapping from tensor network to DNN may be difficult.Thus, compressed input data and a compressed DNN architecture  [326] and (b) MERA [311].
can be both put into a system which is similar to a tensor network.For example, if the input data and the weights in a DNN are both approximated in TT format, the whole system can be expressed like a projected entangled-pair states (PEPS) tensor network [326] as drawn in Fig. 19(a) where red and green nodes illustrate input and output data respectively, and existing efficient computing algorithms, i.e., Algorithm 5 in [231], may help to accelerate this kind of DNNs.Further, if one layer of the DNN is designed like multi-scale entanglement renormalization ansatz (MERA) [311], higher ability of expression may be obtained as presented in Fig. 19(b).It is easy to observe that DNN in the MERA architecture has stronger local correlation than PEPS, e.g., the content of O 2 in Fig. 19(b) comprises the information from G 1 , G 2 , G 3 and G 4 .Through such efforts, the efficiency of data representation and network may be both optimized via the tensor network models with theoretical supports.

C. Unexplored Yet Promising New Directions
Though important and attractive recently-emerged directions have already been discussed in the former sections when individual topics are introduced, there might be still some unexplored yet promising directions, which may be worth mentioning.Within them, the following two are believed by us to be most valuable, though it is still challenging to work on.

1) Efficient Network Compression:
Network compression is now a vibrant research subject, especially orienting to visual recognition neural networks which are always large-scaled due to high-dimensional raw visual data.However, according to our investigation, we summarize several promising directions of compression methods below which may promote the miniaturization of DNNs further.
• NAS may become a promising or even requisite approach especially for embedded visual applications.Such a particular approach is very suited to the application scenarios with variable surroundings, e.g., fault diagnosis based on visual information, ground object identification based on aerial photography, etc. • Some tensor decomposition methods have not been studied adequately due to their complexity, e.g., HT, BTD, Kronecker tensor decomposition (KTD) [327], [328], PEPS [326], MERA [310], etc. Researching these tensor formats may imply potential unrevealed prospects on neural network compression.• Quantization and pruning are usually combined to compress DNNs which can be observed in Table VI.However, in the field of visual recognition, corresponding tasks on RNNs are few.We deem that RNNs have inherent adaptation to sequential visual data such as videos.Therefore, we suggest researchers to aim at this direction to work out more compressed RNNs based on quantization and pruning.• As different compression methods have different characteristics, it is natural to expect a super-synthetical compression method, which contains the flexibility of NAS, regular architecture of tensor decomposition, efficient computing of quantization in embedded surroundings, and high accuracy of pruning, could be proposed in the future.In other words, there are still a lot of works to be done to reach the culmination of network compression.
• The miniaturization of DNNs has a great practical significance for embedded devices which may work in manufacturing field, medical facility, aerospace equipment, etc.Hence, how to land any of compression methods to various hardware has a broad prospect to study.

2) Efficiently Learn Efficient Transferable Models:
Though transfer learning with DNNs has been widely studied due to the lack of labeled data for many visual recognition tasks as stated in subsection V-B, the existing transfer learning models have so far only been optimized for recognition accuracy on target tasks, with the efficiency issue widely disregarded.
As far as we are aware, there is only one exception, the NASNet work [178] proposed by Zoph et.al. at CVPR 2018.It designs a search space for NAS that decouples the complexity of an architecture from the depth of a network.The search space permits identifying good architectures on a small dataset (i.e., CIFAR-10) and transferring the learned architecture to image classification datasets/tasks across a range of data and computational scales (including the large ImageNet dataset).The NASNet actually finds the best convolutional layer (or "cell") on the small dataset and then applys this cell to another dataset by stacking together more copies of this cell, each with their own parameters to composite a convolutional architec-ture.Therefore, the architecture can be flexible, general and light-weight.The resulting architectures approach or exceed state-of-the-art performance in both CIFAR-10 and ImageNet with less computational demand than human designed architectures [178].In greater details, the learned NASNet is 1.2% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS which is a reduction of 28% in computational demand from the previous state-of-the-art model.When used with the Faster-RCNN framework, the features learned by NASNet also surpass stateof-the-art at that time by 4.0% achieving 43.1% mAP on the COCO dataset.The work shows a great example on approaching a more efficient NAS model to learn efficient transferable network architectures.Further exploration in this direction will be highly valuable and should be paid more attention.
Besides NAS, adaptive fine-tuning with dynamic routing for transfer learning also has the potential to be able to learn an input-adaptive efficient transferable model.The very recent work of SpotTune [329] has shown how to learn such an input-adaptive model under a transfer learning setting.However, the dynamic routing component is designed only for improving accuracy but not efficiency.Since such a strategy has been widely used for fast run-time inference as detailed in the former subsection V-A, it is likely that a unified model may be designed for enhancing both accuracy and efficiency.In any case, it will be great if efficiency can be taken into account when new DNN-based transfer learning models are designed, so that they can get closer to real applications.

Fig. 7 :
Fig. 7: A general framework for joint image compression and recognition.The upper row in black shows the inference (test) phase, while the whole figure (including the blue parts) describes the training phase.Note that the compression module has to be optimized for minimizing both the reconstruction loss and the classification loss, different from the recognition module which is only optimized for the later one.

Fig. 10 :
Fig. 10: Video key frames which are informative for recognition could be sparse and unequally located.

Fig. 11 :
Fig. 11: Two kinds of point data representation, and the corresponding neural networks for their processing.
Fig. 12 illustrates several typical compact network designs in the aspect of topology.New Compact Building Blocks.The Local Binary Convolutional Neural Networks (LBCNNs) introduced in [78] propose using an alternative of the traditional convolutional layer called local binary convolution (LBC) layer inspired by the design of traditional local binary patterns (LBP), which uses a set of filters with sparse, binary and randomly generated weights

Fig. 15 :
Fig. 15: Data of quantized objects including WQ , A Q , G Q , E Q and U Q .

•Fig. 18 :
Fig. 18: Overview of the recent advances and new directions.
and W. Dong are with School of Artificial Intelligence, Xidian University, China.F. Yang is with Division of Information Science, Nara Institute of Science and Technology, Japan.G. Li is with Department of Precision Instrumentation, Center for Brain Inspired Computing Research and Beijing Innovation Center for Future Chip, Tsinghua University, Beijing 100084, China.
J. Shi is with Department of Computer and Information Science, University of Pennsylvania, United States.

TABLE I :
Preliminaries of the efficient methods mentioned in this survey.

TABLE II :
Comparison between state-of-the-art human designed networks (the upper part) and NAS models (the lower part) proposed in the recent years based on ImageNet.
Search Space.Search space contains various basic network elements, e.g., normal convolution, asymmetric convolution, depthwise convolution, any kind of pooling, different activation functions, etc., and the laws to connect these elements together.Generally, by restricting the search space, the controller can search the basic network cells, which consist of some

TABLE III :
Applications of neural networks compressed by tensor networks for visual recognition tasks.

TABLE IV :
Typical practices of quantization for CNNs.

TABLE V :
Typical practices of pruning for CNNs.

TABLE VI :
Extant works of joint compression for visual recognition tasks.

TABLE VII :
Recent representative network compression works for fast run-time inference.Please refer to the text for details about the Key Property Indicators.