1 Introduction

The advancement of effective deep learning-based object detectors has been influenced by Internet of Things (IoT)-based technologies. The majority of deep object models demand too much Central Processing Unit (CPU) power and cannot be used on edge devices, despite the fact that many object detectors attain outstanding accuracy and carry out inference in real-time (Wang et al. 2021a, 2021b, 2021c, 2022). Exciting outcomes have already been achieved using a variety of strategies. The brief study of strategies to deployment of deep learning-based applications into edge devices include (Wang et al. 2020a, 2020b, 2020c, 2021a, 2021b, 2021c; Véstias et al. 2020; Li and Ye 2023; Subedi et al. 2021):

  • Using a partitioning technique, since various layers may execute at different times. For example, in a fully connected or convolutional layer, divide the processing graph into offloadable tasks so that the execution time of each composite task unit is the same.

  • Large-scale analytics platforms require intermediate resource standardisation for data manageability and low latency, as opposed to standalone applications on mobile devices. With the provisioning of intermediate resources, deep learning-based analytics platform can determine the proportion of local processing, provided that there is a mechanism to divide the load between buffering and memory loading. The offloaded execution through efficient partitioning can reduces costs, latency, or any other issue-related aim.

Moreover, a detailed study is provided in Sect. 4.6 of manuscript. In recent years, a new field of study i.e., lightweight object detectors have emerged with the goal of developing compact, effective networks for deployments of the IoT that frequently take place in low computing or resource-constrained settings. The research community has long worked to identify the best accuracy detection models through advanced architectural searches, as developing the deep learning-based lightweight network architecture is a difficult procedure. When using these models in edge devices, such as high-performance embedded processors, the question arises regarding usage of high-end innovative applications with fewer resources. It is still not entirely possible to perform detection using a smart phone or edge devices. Although existing models available today are capable of doing this task, but their precision level is just insufficient and undesirable in real-time instances.

Edge computing, according to Gartner, is a component of an architecture of distributed computing where data processing resides near the edge where devices or individuals generate or consume that data (Hua et al. 2023). Because of the constant growth in data created by the IoT, edge computing was first allocated to reduce bandwidth costs for data travelling long distances. On the other hand, the emergence of real-time applications that require processing at the edge is driving the current technological advancements. Among many other benefits, data minimization at the network level can prevent bottlenecks and significantly reduce energy, bandwidth, and storage expenses. A single device is able to send data across a network, problems occur when hundreds of devices send data at once. In addition to reducing quality due to delay, it also raises bandwidth expenses and creates bottlenecks that might result in cost spikes. By acting as a local source for these systems’ data processing and storage, edge computing services and offerings assist in fixing this problem. It also serves as an edge gateway, minimizing bandwidth requirements by processing data from an edge device and sending the pertinent data back through the cloud (Jin et al. 2021). A key element in modern integrated real-world Artificial Intelligence (AI) systems is edge devices. IoT devices could only gather data in the beginning and send it to the cloud for processing. By putting services closer to a network’s edge, edge computing expands the possibilities of cloud computing and enables a wider range of AI services and machine learning applications. IoT computing devices, mobile devices, embedded computers, smart TVs, and other connected gadgets can all be termed edge devices. Real-time application development and deployment can be accelerated by edge computing devices through high-speed networking technologies such 5G networking. Robotics, image and video processing, intelligent video analytics, self-driving cars, medical imaging, machine vision, industrial inspection, among examples of such applications (Véstias et al. 2020).

Edge computing can be applied to devices that are directly connected to sensors, routers or gateways that transfer data, or small servers installed locally in a closet. There are an increasing number of edge computing use cases as well as smart devices capable of doing various activities at the edge. The range of applications for edge computing is expanding in tandem with the development of AI capabilities. The applications spanning a wide range can be found utilising edge computing (Xu et al. 2020). Additionally, there is a good deal of overlap among the various use cases for edge computing. In particular, edge computing functionality in traffic management systems is closely related to that of autonomous vehicles as briefly discussed below:

  1. (a)

    Industrial infrastructure

    Predictive maintenance and failure detection management in industries are supported by the edge computing. When a machine or component breaks down, the capability kicks in, enabling factory workers to fix the issue or replace the part in advance and save money by preventing lost output. The architecture of edge computing can handle large amounts of data from sensors and programmable logic controllers, as well as facilitate effective communications across extremely complicated supervisory control and data gathering systems.

  2. (b)

    Retail

    Huge amounts of data are produced by retail applications from different point-of-sale systems, item stocking procedures, and other company operations. Edge computing can assist in analysing this vast quantity of data and locating problems that require quick resolution. Additionally, edge computing provides a way to handle consumer data locally, preventing it from leaving the client’s residence, a privacy regulation problem that is becoming more pressing.

  3. (c)

    Healthcare

    In order to give medical practitioners precise, timely information about a patient’s status, the healthcare and medical industries gather patient data from sensors, monitors, wearable technology, and other devices. Edge computing solutions can provide dashboards with such data so users can see all the key indications in one convenient place. AI-enabled edge computing solutions can recognise anomalous data, allowing medical personnel to respond to patient requirements quickly and with the minimal possible false alarms. Furthermore, edge computing devices can aid in addressing concerns related to patient confidentiality and data privacy by processing data locally.

  4. (d)

    Global energy

    Cities and smart grid systems can monitor public buildings and facilities for improved energy efficiency in areas like lighting, heating, and clean energy use by using edge computing devices. As an illustration: edge computing devices are utilised by intelligent lighting controls to regulate individual lights for optimal efficiency and public space safety; Embedded edge computer devices are used in solar fields to detect changes in the weather and modify their position; Edge computing is used by wind farms to send sensor data to substations and link to cell towers.

  5. (e)

    Public transit systems

    Only the data necessary to support in-car activities and dispatcher insights in public transportation applications can be collected and transmitted by edge computing systems deployed in buses, passenger rail systems, and paratransit vehicles.

  6. (f)

    Travel transport utilities

    In order to increase convenience and safety, edge computing can control when traffic signals turn on and off, open and close additional lanes of traffic, make sure that communications are maintained in the event of a public emergency, and do other real-time tasks. The adoption of autonomous vehicles will be significantly influenced by sophisticated traffic management systems, as was previously indicated.

  7. (g)

    Advanced industries

    In advanced industries, vehicle sensors and cameras can provide data to edge computing devices, which make choices in milliseconds without any latency. This fast decision making is necessary in autonomous vehicles, for safety reasons. Self-parking apps and lane-departure warning are two examples of edge computing services that are currently readily accessible. Furthermore, as more cars are able to communicate with their surroundings, a quick and responsive network will be required. In order to assist predictive maintenance, electric vehicles require constant monitoring. Edge computing can be used to manage data in this regard. Data aggregation is supported by edge computing, which reports actionable data for maintenance and performance. These above-mentioned multitude of industries investing in implacability of edge devices. These industries include travel, transport and logistics, cross-vertical, retail, public sector utilities, global energy and materials, banking insurance, infrastructure and agriculture etc. Their share representation with respect to employability in various edge computing devices is shown in Fig. 1a (Chabas et al. 2018). The travel, transport and logistics holds the maximum share of 24.2%, then 13.1% in global energy markets, 10.1% in retail and advanced industries followed by less shares by other industries. We have also represented hardware costs comparisons in terms of minimum and maximum cost in case of edge computing devices for mentioned industries. The hardware value includes opportunity across the tech stack on the basis of sensors, on-device firmware, storage and processor. By 2025, the edge computing-based devices depicts $175 to $215 billion potential hardware value. The industries such as travel, transport and logistics approximate $35 to $43, cross-vertical estimated to be $32 to $40 billion, $20 to $28 billion in retail sector, $16 to $24 billion in public sector utilities, $9 to $17 billion in global energy and materials, $4 to $11 billion in infrastructure and agriculture as depicted in Fig. 1b (Chabas et al. 2018). There is a dire need to focus on advancing development of lightweight object detection models to boost their employability in heterogeneous edge devices. This survey study analyses the state-of-the-art deep learning-based lightweight object identification models in order to attain excellent performance on edge devices. With equivalent accuracy, powerful lightweight object detection models offer these advantages (Kim et al. 2023; Huang et al. 2022):

    1. (1)

      Lightweight object detection models based on deep learning require less communication between edge distributed training.

    2. (2)

      Less bandwidth will be needed to export a cutting-edge detection model from the cloud to a particular application.

    3. (3)

      Deploying lightweight detectors on Field Programmable Gate Arrays (FPGAs) and other hardware with limited resources is more practical.

    Fig. 1
    figure 1

    a Share representation of various industries embedded in edge computing devices. b Comparison of hardware costs in case of edge computing devices

1.1 Motivation

Object detection is the core concept in deploying innovative edge devices-based applications such as face detection (Li et al. 2015), objects tracking (Nguyen et al. 2019), video surveillance (Yang and Rothkrantz 2011), pedestrian detection (Brunetti et al. 2018), etc. The powerful capabilities of deep learning boost the performance of object detection in these applications. The generic deep learning-based object detection models have computational complexities such as extensive use of platform resources, more bandwidth, and large data processing pipelines (Jiao et al. 2019; Zhao et al. 2019). However, a detection network might potentially use three orders of magnitude more Floating Point Operations (FLOPs) than a classification network due to the computational complexity, making its deployment on an edge device much more difficult (Ren et al. 2018). The generic deep object detectors often use more network layers which eventually require high parameter tuning. Deep models have more network layers, which makes it harder for the network to detect small targets because they lose position and feature information over time. The network parameters being too large could damage the model’s effectiveness and make it challenging to implement on smart mobile terminals, which brings us to our final possible concern.

For the development of lightweight object detection on edge devices, a comprehensive assessment of the research directions related to this topic is necessary, particularly for researchers who are interested in pursuing this line of inquiry. To assess the usefulness of deep learning-based lightweight object detection on edge devices, more research is required than just a basic review of the literature. Because the proposed research can offer a comprehensive examination of the literature, it can achieve each of these objectives. A deep learning-based lightweight detection evaluation hasn’t been written about recently in the literature. There are generic and application specific surveys dedicated to deep learning-based object detectors (Jiao et al. 2019; Zou et al. 2023; Liu et al. 2020a, 2020b, 2020c, 2020d; Mittal et al. 2020; Han et al. 2018; Zhou et al. 2021a, 2021b) but not have consolidated study specifically for lightweight detectors for edge devices as mentioned in Table 1. To raise readers’ understanding of this developing subject, deep learning-based lightweight object detectors on edge devices have been investigated in this work. The research of deep learning-based lightweight object identification models with regard to various backbone architectures and diverse applications on edge devices will be advanced by the release of this study. The key objectives of the survey are as follows:

  • To provide taxonomy of deep learning-based lightweight object detection algorithms on edge devices

  • To provide an analysis of deep learning-based lightweight backbone architectures for edge devices

  • Literature findings of applications deployed through lightweight object detectors

  • Comparison of lightweight object detectors by analyzing results on leading detection datasets

Table 1 Comparison of existing object detection related publications with proposed work

The organization of research paper is as follows: Sect. 2 elaborates the work related to development of deep learning-based object detectors. The deep learning-based object detectors have further categorized into two, one and advanced stage. Section 3 describes materials and methods required for deep learning-based lightweight detection models on edge devices. The architectural details related to training and inference lightweight models have also been mentioned in this section. Further, detailed crucial properties and performance milestone of lightweight object detection methods have been mentioned in this section. Section 4 discusses commonly utilized backbone architectures in deep learning-based lightweight object detection models. Further, applications of lightweight object detection models have also been mentioned. The recommendations for designing powerful deep learning-based lightweight model are provided in Sect. 4. The final section brings the entire study to a close and outlines some crucial implications for more research.

2 Background

Recent developments in the field of deep learning-based object detectors have mostly concentrated on raising the benchmark datasets’ state-of-the-art accuracy, which has caused an explosion in model size and parameters. The research, on the other hand, has demonstrated interest in suggesting lighter, smaller, and smarter networks that would minimise the parameters while keeping cutting-edge performance (Nguyen et al. 2023). In the next section, we will provide a brief summary regarding categorization of generic object detection models.

2.1 Taxonomy of deep learning-based object detectors

During the last years, there has been a rapid and successful expansion in lightweight object detection research domain. This domain has exploded from adopting and familiarizing the latest machine and deep methods through development of new representations. The generic deep learning-based object detection models have been classified into two, one and, advanced stage each having different concepts.

2.1.1 Two-stage object detection models

Two-stage algorithms, having two different stages of region proposal and detection head. The first stage was for the calculation of RoI proposals using anchors in external region proposal techniques such as Edge Box (Zitnick and Dollár 2014) or Selective Search (Uijlings et al. 2013). The second stage consists of processing extracted RoIs into final bounding boxes, coordinate values and class labels. The examples of two-stage algorithms include Faster RCNN (Ren et al. 2015), Cascade RCNN (Cai and Vasconcelos 2018), R-FCN (Dai et. al. 2016) etc. The advantages of two-stage object detectors include better analysis of objects through given stages, multi-stage architecture to regress the bounding box values efficiently and better handling of class imbalance in datasets. Two-stage detectors adopted a deep neural-based Region Proposal Network (RPN) and a detection head. Even if the existing Light-Head R-CNN (Li et al. 2017) used a lightweight detection head, the backbone and detection part become imbalanced when the detection part is combined with a small backbone. This mismatch increases the danger of overfitting and causes repetitive calculation.

2.1.2 One-stage object detection models

Two-stage detectors helped deep learning-based object detection get off to a good start, but these systems struggled with speed. Due to their flexibility in satisfying demands like fast speed and minimal memory needs, one-stage detectors were ultimately adopted by researchers. The region proposal stage of two-stage detectors was eliminated by the one-stage algorithms since they saw the object identification problem as a regression problem. Instead of sending portions of the image to a fixed grid-based CNN, the entire image is sent at once, and anchors assist in identifying specific region suggestions. For the purpose of detecting the given area in a picture, boundary box coordinates were included. The examples of one-stage detectors include YOLO (Redmon et al. 2016), SSD (Liu et al. 2016), RetinaNet (Lin et al. 2017a, 2017b) etc. The YOLO series outperforms two-stage models in terms of efficiency and accuracy.

2.1.3 Advanced-stage object detection models

The recently emerged advanced-stage object detectors removed the anchors concept in one-stage detectors for detecting objects. The advanced detector, CornerNet (Law and Deng 2018) detected objects as paired key points and a new corner pooling layer was introduced to better localize corners. CenterNet (Duan et al. 2019) detected the object as a triplet, rather than a pair of key points. Foveabox (Kong et al. 2020a, 2020b) predicted category-sensitive semantic maps and category-agnostic bounding box for the object. The advanced-stage detectors also found struggling in locating multiple targets having small-size, complex backgrounds and slow detection speed. The one-stage methods (Bochkovskiy et al. 2020; Qin et al. 2019) utilized predefined anchor boxes and anchor-free (Duan et al. 2019) concepts for predicting bounding boxes.

2.1.4 Light-weight object detection models

The low computation in terms of bandwidth and resource utilization are light-weight object detectors and few examples include ThunderNet (Qin et al. 2019), PP-YOLO (Long et al. 2020a, 2020b), YOLObile (Cai et al. 2021), Trident-YOLO (Wang et al. 2022a, 2022b, 2022c, 2022d), YOLOV4-tiny (Jiang et al. 2020), Trident FPN (Picron and Tuytelaars 2021) etc.

The deep learning-based object detection algorithms have been categorized into two, one, advanced-stage and light weight detectors are highlighted in Fig. 2. The algorithms such as Faster RCNN (Ren et al. 2015), Mask RCNN (He et al. 2017), Cascade RCNN (Cai and Vasconcelos 2018), FPN (Lin et al. 2017a, 2017b) and R-FCN (Dai et al. 2016) etc., fall under two-stage detectors whereas YOLO (Redmon and Farhadi 2018), SSD (Liu et al. 2016), RefineDet (Zhang et al. 2018a, 2018b) and RetinaNet (Lin et al. 2017a, 2017b) under one-stage detectors. The advanced object detectors such as CornerNet (Law and Deng 2018), Objects as points (Zhou et al. 2019a) and Foveabox (Kong et al. 2020a, 2020b) are listed in Fig. 2. However, the algorithms listed above often include a large number of channels and convolutional layers, which demand a lot of computing power for deployment in edge devices. The deep learning-based lightweight object detectors presented in Fig. 2 are specifically designed for contexts with limited resources. Due to their efficiency and compactness, the one and advanced stage detectors’ pipeline is the industry standard for designing lightweight object detectors.

Fig. 2
figure 2

Taxonomy of recent deep learning-based object detection algorithms

3 Deep learning-based lightweight object detection models for edge devices

Numerous computer vision tasks, such as autonomous driving, robot vision, intelligent transportation, industrial quality inspection, object tracking, etc., have used deep learning-based object detection to a large extent. Deep models typically improve performance, but the deployment of real-world applications onto edge devices is constrained by their resource-intensive network. Lightweight mobile object detectors have drawn growing research interest as a solution to this issue, with the goal of creating extremely effective object detection. Deep learning-based lightweight object detectors have recently been developed for situations with limited computer resources, such as mobile devices.

The necessity to execute backbone designs on edge devices with constrained memory and processing power stimulates research and development of deep learning-based lightweight object identification models. A number of efficient lightweight backbone architectures have been proposed in recent years, for example, MobileNet (Howard et al. 2017), ShuffleNet (Zhang et al. 2018a, 2018b), and DetNaS (Chen et al. 2019). However, all these architectures are heavily dependent on widely deployed depth-wise separable convolution-based methodologies (Ding et al. 2019). With regard to deep learning-based lightweight object identification models, we will describe methodology and each component in depth in the following sections. Our deep learning-based simple object detection models were heavily influenced by existing simple and complex object detection models. We give an architectural breakdown of deep learning-based lightweight object detection models in the following section.

3.1 Architecture methodology of lightweight object detection models

The different building blocks of deep learning-based lightweight object detection algorithms on edge devices consist of number of components consisting of input, backbone, neck and detector head. The definition and details of each component is tabulated in Table 2. An input for the lightweight object detector is either an image, patch or pyramid, initially fed in the lightweight backbone architecture such as CSPDarkNet (Redmon and Farhadi 2018), ShuffleNet (Zhang et al. 2018a, 2018b), MobileNet (Qian et al. 2021), PeleeNet (Wang et al. 2018) for the calculation of feature maps. The backbone is the part of deep learning-based lightweight object detection architecture which converts an image to feature maps, whereas the neck transforms the feature maps by connecting the backbone to detector head. The input image is passed to lightweight backbone architecture to calculate initial features vectors of objects. This backbone network may be a pre-trained network or a neural network built from scratch with the aim of feature extraction. The backbone architecture performs feature extraction and produces feature maps as an output. Then, neck component transforms this feature map to a required feature vector for handling various object detection challenges as per application. The lightweight detector head can be visualized as a deep neural network focusing on extraction of RoIs. Further, some pooling layer fixes the size of calculated RoIs to calculate final features of the detected objects. The final features are then passed onto classification and regression loss functions to assign class labels and regressing the coordinates values of bounding boxes. This whole process is repeated until the final regressed values of bounding boxes are obtained with the required class labels. The detailed methodology as presented in Fig. 3, deep learning-based lightweight object detection consist of three parts i.e., backbone architecture, neck components, and lightweight head prediction. The input images are fed to the backbone and their architecture converts the input image into feature maps. In case of deep learning-based lightweight models, the backbone architecture should be deployed from given categories in Table 2. The Conv2D + batch normalization + ReLU activation function is represented by a fundamental convolutional module that makes up the backbone architecture. By eliminating redundant gradient information from CNN’s optimization process and integrating gradient modifications into the feature map, it lowers input parameters and model size (Wang et al. 2020a, 2020b, 2020c). In the bottleneck cross stage partial darknet model, for instance, a 640 × 640-pixel image is divided into four 320 × 320-pixel images, which are then combined to form a 320 × 320-pixel feature map. This 320 × 320x32 resulting feature map was produced using 32 convolutional kernels. Additionally, include the SPP module to add features of various sizes and increase the network’s receptive area. By enhancing the information flow between the backbone architecture and the detecting head, the neck alters the feature maps. The neck, PANet is built on a FPN topology utilized to provide strong semantic characteristics from top to bottom (Wang et al. 2019). FPN layers from bottom to top also express important positional features.

Table 2 Building blocks of deep learning-based lightweight object detectors
Fig. 3
figure 3

Methodology of deep learning-based lightweight object detection model

Furthermore, PANet encourages the transmission of low-level characteristics and the use of precise localization signals in the bottom layers. This improves the target object’s position accuracy. The prediction layer, sometimes referred to as the detection layer, creates many feature maps in order to accomplish multiscale prediction. However, at the prediction layer, the model is capable of classifying and detecting objects of various sizes. As a result, it is projected that each feature map will have various regression bounding boxes at each position, yielding various regression bounding boxes. The anticipated output of the model with bounding boxes is then shown as a detection result. The three steps mentioned above combine the training model for detection into the lightweight object detection model. After model training, the test data is passed to get fine-tuned lightweight model with modified features as shown in Fig. 3. The parameters in context of deep learning-based light-weight models are discussed below:

  1. (a)

    Training

    To train an edge-cloud-based deep learning model, edge devices and cloud servers must share model parameters and other data. More data must be transferred between edge devices and cloud servers as the training model gets bigger. A number of methods have been put forth to lower the cost of communication during training, including Edge Stochastic Gradient Descent (eSGD), which can reduce a CNN model’s gradient size by up to 90% by communicating only the most important gradients, and intermediate edge aggregation prior to federated learning server aggregation. The two main components of training deep learning-based lightweight detection models are the ability to exit before the input data completes a full forward pass through each layer of a neural network distributed over heterogeneous nodes and the use of binarized neural networks to reduce memory and compute load on resource-constrained end devices (Koubaa et al. 2021; Dey and Mukherjee 2018).

    Researchers have created a novel architecture known as Agile Condor that carries out real-time computer vision tasks using machine learning methods. At the network edge, close to the data sources, Agile Condor can be utilised for autonomous target detection (Isereau et al. 2017). Precog is a new method that lowers latency for mobile applications by prefetching and caching that anticipates the subsequent classification request and uses end-device caching to store essential portions of a trained classifier. As a result, fewer offloads to the cloud occur and edge servers calculate the likelihood that linked end devices may make a request in the future. These pre-fetched modules function as smaller models that minimise network traffic and cloud processing while accelerating inference on the end devices (Drolia et al. 2017). Another example include ECHO is a feature-rich, thoroughly tested framework for implementing data analytics in a distributed hybrid Edge-Fog-Cloud configuration. ECHO offers services such virtualized application status monitoring, resource discovery, deployment, and interfaces to data analytics runtime engines (Ogden and Guo 2019).

  2. (b)

    Inference

    When feasible, distributed deep network designs enable the deployment on edge-cloud infrastructure to support local inference on edge devices. A distributed neural network model’s ability to function effectively on minimising inter-device communication costs. Inference on the end-edge-cloud architecture is a dynamic problem because of evolving network conditions (Subedi et al. 2021). Static methods like remote inference only or on-device inference only are also not the best. Ogden and Guo have created a distributed architecture that provides a flexible answer to this problem for mobile deep inference. A centralised model manager will house many deep learning models, and the inference environment (memory, bandwidth, and power) will be used to dynamically determine which model should run on which device. If resources are scarce in the inference environment, one of the compressed models may be employed; if not, an uncompressed model with higher accuracy is used. Edge servers handle remote inference when networks are sluggish.

  3. (c)

    Privacy and security

    Edge devices can be used to filter personally identifiable information prior to data transfer in order to enhance user privacy and security when processing data remotely (Xu et al. 2020; Hu et al. 2023a, 2023b). Since data generated by end devices is not available to a central location, training deep learning models across several edge devices in a distributed way leads to more privacy. Personally identifiable information in photographs and videos can be removed at the edge before being uploaded to an external server, enhancing user privacy. The privacy of critical training data becomes an issue when training is conducted remotely. To ensure local and global privacy techniques, it is imperative to keep an eye out for any decline in accuracy, ensure low computing overheads, and provide resilience to communication errors and delays (Abou et al. 2023; Makkar et al. 2021).

3.2 Comprehensive analysis of lightweight object detection models

The development of extremely effective object detection outcomes has garnered increasing scientific attention in the small, transportable object detectors. With the use of efficient components and compression techniques like pruning, quantization, hashing, and other techniques, the effectiveness of deep learning lightweight object identification models has grown. Distillation, which employs a large network that has been used to train smaller models, has produced some surprising results as well. A comprehensive list containing multiple details of deep learning-based lightweight object detection models in the recent years is presented in Tables 3, 4. The categorization of anchor-based and anchor-free detectors for lightweight object detectors have been identified. Anchor-based methods are the mechanism of extracting RoIs employed in object detection models, such as Fast R-CNN (Girshick 2015). The anchor boxes are of various scales, which can be viewed as RoIs, as a priori for performing bounding box regression for coordinates values. The detectors including YOLOv2 (Redmon and Farhadi 2017), YOLOv3 (Redmon and Farhadi 2018), YOLOv4 (Bochkovskiy et al. 2020), RetinaNet (Lin et al. 2017a, 2017b), RefineDet (Zhang et al. 2018a, 2018b), EfficientDet (Tan et al. 2020), Faster R-CNN (Ren et al. 2015), Cascade R-CNN (Cai and Vasconcelos 2018), Trident-Net (Li et al. 2019), belonging to one and two-stage detectors have anchor mechanism to elevate the performance of deep learning-based object detection. Besides, anchor-free detectors have recently received more attention in academia and research by witnessing a large number of new anchor-free methods have been proposed. Earlier works such as YOLOv1 (Redmon et al. 2016), DenseBox (Huang et al. 2015) and UnitBox (Yu et al. 2016) can be considered as early anchor-free detectors. In anchor-free methods, anchor and key points are utilized to perform detection. The former approach does object bounding box regression-based on anchor points instead of anchor boxes, including FCOS (Detector 2022), FoveaBox (Kong et al. 2020a, 2020b), whereas latter approach reformulates the object detection as keypoints localization problem, including CornerNet (Law and Deng 2018; Law et al. 2019), CenterNet (Duan et al. 2019), ExtremeNet (Zhou et al. 2019b) and RepPoint (Yang et al. 2019). By eliminating the handcraft anchors’ restrictions, anchor-free techniques have a lot of promise for working with extremely large and small objects. The anchor-based detectors shown in Table 3 can compete with some newly proposed anchor-free lightweight object detectors in terms of performance. Further, input image type, code link and published sources are also mentioned in Table 3. While Table 4 reports crucial milestones such as AP, description, loss function etc. for individual deep learning-based light-weight detector.

Table 3 A comprehensive list containing details of deep learning-based lightweight object detection models
Table 4 A list of crucial properties and performance milestone of lightweight object detection methods

Tiny-DSOD (Li et al. 2018) a lightweight object detector inspired by a thoroughly supervised object detection framework, has been proposed for resource-constrained applications. With only 0.95 M parameters and 1.06B FLOPs, it uses depth-wise dense block as a backbone architecture and depth-wise FPN in neck components, which is by far the most advanced result with such a small resource demand. The context enhancement module and the spatial attention module of ThunderNet (Qin et al. 2019), a lightweight two-stage detector, are used as the backbone architectural blocks to produce more discriminative feature representation representation. The effective RPN used in a portable detecting head. ThunderNet outperforms earlier lightweight one-stage detectors by operating at 24.1 frames per second with 19.2 AP on COCO on an ARM-based smartphone. One of the most recent, cutting-edge lightweight object detection algorithms, PP-YOLO (Long et al. 2020a, 2020b) employs MobileNetV3 (Qian et al. 2021), a practical backbone architecture for edge devices. The depth-wise separable convolutions used by PPYOLOtiny’s detection head make it better suited for mobile devices. PPYOLOtiny adopts the optimisation techniques used by PPYOLO algorithms but does away with techniques that have a big impact on model size and performance. Block-punched pruning and a mobile acceleration unit with a mobile GPU-CPU collaboration approach are provided by YOLObile (Cai et al. 2021). Trident-YOLO (Wang et al. 2022a, 2022b, 2022c, 2022d) is an upgrade to YOLOV4-tiny (Jiang et al. 2020), designed for mobile devices with limited computing power. In neck components, Trident FPN (Picron and Tuytelaars 2021) improves the recall and accuracy of basic object recognition methods by reorganising the network topology of neck components. Trident-YOLO proposes fewer cross-stage partial RFBs and smaller cross-stage partial SPPs, as well as enlarging the receptive field of the network with the fewest FLOPs. Conversely, Trident-FPN significantly enhances lightweight object detection performance by increasing the computational complexity through an increase in a limited number of FLOPs and producing a multi-scale model feature map. In order to simplify computation, YOLOV4-tiny (Jiang et al. 2020) uses two ResBlock-D modules in place of two CSPBlock modules in the ResNet-D network. In order to extract more feature information about the object, such as global features, channel, and spatial attention, it also creates an auxiliary residual network block with consecutive 3 × 3 convolutions that is utilized to obtain 5 × 5 receptive fields with the goal of reducing detection error. Optimizing the original YOLOv4 (Bochkovskiy et al. 2020), Slim YOLOv4 (Ding et al. 2022) changes the backbone architecture from CSPDarknet53 to MobileNetv2 (Sandler et al. 2018). Separable convolution and depth-wise over-parameterized convolutional layers were chosen to minimize computation and enhance the performance of the detection network. Based on YOLOv2 (Redmon and Farhadi 2017; Wang et al. 2022a), YOLO-LITE (Huang et al. 2018; Wang et al. 2021a) offers a quicker, more effective lightweight variant for mobile devices. On a PC without a GPU, YOLO-LITE works at roughly 21 frames per second and 10 frames per second when used on a website with only 7 layers and 482 million FLOPS. Object recognition using Fully Convolutional One‐Stage (FCOS) (Detector 2022) addresses the issue of label overlap within the ground-truth data. Unlike previous anchor-free detectors, there is no complex hyper-parameter adjustment. Large-scale server detectors constitute the majority of anchor-free detectors in general. The two small minority of anchor-free mobile device detectors are NanoDet (Li et al. 2020a, 2020b) and YOLOX-Nano (Ge et al. 2021). The issue is that compact anchor-free detectors typically struggle to strike a good balance between efficiency and accuracy. In order to choose positive and negative samples, the FCOS method NanoDet employs Adaptive Training Sample Selection (ATSS) (Zhang et al. 2020a, 2020b, 2020c) and uses generalised focal loss as the loss function for classification and bounding box regression. The centerness branch of FCOS and numerous convolutions on this branch are eliminated by the application of this loss function, which lowers the computational cost of the detection head. A lightweight detector dubbed L-DETR (Li et al. 2022a) is created based on DETR and PP-LCNet to balance efficiency and accuracy. L-DETR has fewer parameters with the new backbone than the DETR. It is utilised to compute the overall data and arrive at the final prediction. Its normalisation and FFN are enhanced, and thus raises the precision of frame detection. In Table 5, some well-known metrics in calculating performance of lightweight object detection models have been highlighted. The metrics termed FLOPs are frequently used to determine how computationally complex deep learning models are. They provide a quick and simple method of figuring out how many arithmetic operations are needed to complete a particular computation. It can offer extremely helpful insights on computational costs or requirements or energy consumption, which is particularly helpful for edge computing. It is useful when we have to estimate the total number of arithmetic operations needed, which is usually when computing efficiency is being measured. As highlighted, YOLOv7-x has highest FLOPs i.e., 189.9G among the mentioned detectors. One of the more important components of using a deep network architecture in deployment is the network latency/inference time. The majority of real-world applications need inference times that are quick—a few milliseconds to a second. It needs in-depth knowledge to measure a neural network’s inference time accurately. The time it takes for a deep learning algorithm to process fresh input and produce a prediction is known as the inference time in deep learning. The number of layers, the complexity of the network, and the number of neurons in each layer can all impact this time. Inference times typically rise with network complexity and scale. In our analysis, YOLOv3-Tiny has lowest inference time of 4.5 ms. The Frame Per Second (FPS) is a measure of how rapidly a deep learning model can handle frames. It also specifies how quickly your object detection model will process your photos and videos and produce the desired results. YOLOv4-Tiny has highest FPS among presented ones in Table 5. Weight and bias are the model parameters in deep learning, which are characteristics of the training data that will be discovered throughout the learning process. The total number of parameters, which is a common indicator of a model’s performance, is the sum of all the weights and biases on the neural network. YOLO-X Nano has least learning parameters when compared with others. Moreover, with respect to each light-weight object detector, prediction regarding deployment of individual detector in real-time applications has been done on the basis of their AP values highlighted in Table 4. MobileNet-SSD, MobileNetV2-SSDLite, Tiny-DSOD, Pelee, YOLO-Lite, MnasNet-A1 + SSDLite, YOLOv3-Tiny, NanoDet and Mini YOLO are not efficient when deployed.

Table 5 Performance parameters of deep learning-based lightweight object detectors

Additionally, in latest years, one-stage YOLO-based lightweight object detectors have been developed which are mentioned in Table 6. In 2024, DSP-YOLO (Zhang et al. 2024) and YOLO-NL (Zhou 2024) emerged but not ready to be deployed in real-life applications. On the contrary, EL-YOLO (Hu et al. 2023a, 2023b), YOLO-S (Betti and Tucci 2023) GCL-YOLO (Cao et al. 2023), Light YOLO (Yin et al. 2023), Edge YOLO (Li and Ye 2023), GL-YOLO-Lite (Dai and Liu 2023) and LC-YOLO (Cui et al. 2023) can be merged in real-life applications of computing world. Further, we have added performance parameters in terms of FLOPs, Inference time, FPS and number of parameters with respect to each latest YOLO-based light-weight object detector. YOLO-S utilized least number of FLOPs i.e., 34.59B whereas Light YOLO has maximum FPS of 102.04 and GCL-YOLO has lease number of parameters as depicted in Table 6.

Table 6 Performance parameters of latest YOLO-based lightweight object detectors

3.3 Backbone architecture for deep learning-based lightweight object detection models

Deep learning-based models for image processing advanced and effectively outperformed more conventional methods in terms of object classification (Krizhevsky et al. 2012). The most effective deep learning object categorization architectures have been Convolutional Neural Networks (CNNs), which function similarly to human brains and include neurons that react to their surroundings in real time (Makantasis et al. 2015; Fernández-Delgado et al. 2014). Well-known CNN architectures based on deep learning have been used for object classification-based feature extractors to fine-tune the classifiers. Forward propagation is used to process the training with random seeds for the filters and parameters. However, due to severely resource-constrained conditions, notably in memory bandwidth, the development of specialised CNN architectures for lightweight object identification models has received less attention than expected. In this section, we summarized backbones i.e., feature extractors for deep learning-based lightweight object detection models. Backbone architectures are used to extract the features for conducting lightweight object identification tasks where an image is provided as an input and a feature map is produced as an output. The majority of backbone architectures for detection tasks are essentially networks for classification problems, which take into account the final fully linked layers. DetNaS convolutional neural network is shown in Fig. 4 to help understand how backbone architectures function in the context of lightweight object identification models. These architectures are shown block-by-block. ShuffleNetv2 5*5 and 7*7 blocks are what the blue and green blocks are made of. Kernel sizes for the blue blocks are 3. In comparison to pink 3*3 ShuffleNetv2 blocks, the peach colour blocks are Xception ShuffleNetv2 blocks (Ma et al. 2018). Each level has eight blocks and the total number of blocks is forty. Large-kernel blocks are found in low-level layers while deep blocks are found in high-level layers in the lightweight DetNAS architecture. Blocks of huge kernels (5*5, 7*7) are present in DetNAS’ stage 1 and stage 2’s low-level layers. Pink-colored blocks, on the other hand, have kernels that are 3*3. Stages 3 and 4 are comprised of peach and pink blocks, as shown in the centre of Fig. 4. Six of these eight blocks—Xception and ShufflNetv2 blocks—are deeper than standard 3*3 ShufflNetv2 blocks. These results lead us to the conclusion that lightweight object detection networks differ visually from conventional detection and classification networks. In the next section, a brief introduction about deep learning-based lightweight backbone architectures have been given:

Fig. 4
figure 4

Architectural details of backbone architecture DetNaS (Chen et al. 2019)

3.3.1 MobileNet (Howard et al. 2017)

MobileNet created an efficient network architecture made up of 28 depth-wise separable convolutions to factorise a standard convolution into a depth-wise convolution and a 1 × 1 point-wise convolution. By applying different kernels, isolating the filtering, and merging the features using pointwise convolution in depth-wise convolution, the computing cost and model size were reduced. Two more model-shrinking hyperparameters, width and resolution multiplier, were added in order to improve performance and reduce the size of the model. The model’s oversimplification and linearity, which resulted in fewer channels for gradient flow, were corrected in later versions.

3.3.2 MobileNetV2 (Sandler et al. 2018)

The inverted residual with linear bottleneck, a novel module, was added to MobileNetv2 in order to speed up calculations and improve accuracy. In the MobileNetv2 there were two convolutional layers followed by 19 bottleneck modules. The computationally efficient MobileNetv2 feature extractor was used by the SSD writers to detect objects. With respect to the original SSD, the new device, known as SSDLite, touted an 8 × reduction in parameters. It is simple to construct, generalises well to different datasets, and as a result, garnered positive feedback from the community.

3.3.3 MobileNetv3 (Qian et al. 2021)

In MobileNetv3, the unneeded portions of the network were iteratively removed during an automated platform search in a factorised hierarchical search space. This model is then modified to increase the desired metrics after the design proposal has been prepared. Since the architecture’s filters regularly reflect one another, accuracy can be maintained even if half of them are discarded, which reduces the need for further processing. MobileNetv3 merged harsh Swish and RELU activation filters because the latter was computationally more efficient while preserving accuracy.

3.3.4 ShuffleNet (Zhang et al. 2018a, 2018b)

According to the authors, many effective networks lose their effectiveness as they scale down because of expensive 1 × 1 convolutions. ShuffleNet is a neural network design that was very computationally effective and created especially for mobile devices. To overcome the issue of restricted information flow, it was suggested to use group convolution along with channel shuffling. The ShuffleNet unit, like the ResNet block, substituted a pointwise group convolution for the 1 × 1 layer and a depth-wise convolution for the 3 × 3 layer.

3.3.5 ShuffleNetv2 (Ma et al. 2018)

ShuffleNetv2 advocated in favour of using speed or latency as direct measures rather than FLOPs or other indirect metrics to determine how complex a computation is. Four guiding principles served as its foundation: equal channel width to lower memory access costs, group convolution selection based on target platform, multi-path ways to boost accuracy, and element-wise operations. The input was split in half by a channel split layer in this model, and the residual link was concatenated by three convolutional layers before being sent through a channel shuffle layer. ShuffleNetv2 outperformed other cutting-edge models of comparable complexity, outperforming its peers.

3.3.6 PeleeNet (Wang et al. 2018)

PeleeNet is an inventive and effective architecture based on traditional convolution that was created using a number of computation-saving strategies. PeleeNet’s design comprised four iterations of modified dense and transition layers, followed by the classification layer. The two-way dense layer helps to obtain distinct receptive field scales, which makes it simpler to identify larger things. Using a stem block minimised the loss of information. While our model’s performance was not as good as modern object detectors on mobile and edge devices, it did demonstrate how even seemingly little design decisions can have a substantial impact on total performance.

3.3.7 mNASNet (Tan et al. 2019)

Using NAS (Neural Architecture Search) automation, mNASNet was created. It conceptualised the search problem as a multi-object optimisation problem, with a dual focus on latency and accuracy. Unlike previous models that stacked identical blocks, this allowed for the design of individual blocks. By dividing the CNN into distinct blocks and then looking for operations and connections in each of those blocks separately, it factorised the search space. mNASNet was roughly twice as rapid as MobileNetv2 and more accurate.

3.3.8 Once for all (OFA) (Cai et al. 2019)

In recent years, modern models have been constructed using NAS for architecture design; nonetheless, the sampled model training resulted in costly computations. This model just needs to be trained once, after which sub-networks can be constructed from it based on the requirements. Thanks to the OFA network, such sub-networks can be variable in the four key dimensions of a convolutional neural network: depth, width, kernel size, and dimension. They slowed down the training process and caused layering within the OFA network, which eventually resulted in gradual shrinkage.

3.3.9 MobileViT (Mehta and Rastegari 2021)

Combining the benefits of CNNs and Vision Transformers (ViT), MobileViT is a transformer-based detector that is lightweight, portable, and compatible with edge devices. It was able to successfully identify both short- and long-range dependencies by utilising a unique MobileViT block. Alongside the MobileViT block, MobileNetv2 modules (Sandler et al. 2018) were made available in serial form. Unlike previous transformer-based networks, they used a transformer as a convolution, which automatically incorporated spatial bias, therefore location encoding was not necessary. MobileViT performed well on complex problems, supporting its claim to be a general-purpose backbone for various vision applications. Because of the constraints of transformers on mobile devices, it was able to attain better accuracy with a smaller parameter budget.

3.3.10 SqueezeNet (Iandola et al. 2016)

SqueezeNet attempts to maintain the accuracy of the network by using techniques with fewer parameters. Smaller filters, 3 × 3 filters for the input channels, and later network placement of the down-sampling layers were the design strategies employed. SqueezeNet’s core module, the fire module, consisted of an extended layer and a squeeze layer, each containing a ReLU activation. Eight Fire modules were stacked and jammed in between the convolution layers to form the SqueezeNet architecture. Accuracy was increased over the basic model using SqueezeNet with residual connections, which was also developed and inspired by ResNet (He et al. 2016). SqueezeNet showed out as a serious contender for boosting the hardware efficiency of neural network topologies.

The year and initial usage in which backbone architectures have been utilized, the number of parameters, merits, and top-1 accuracy have been elaborated in the given Table 7. According to research on deep learning-based backbone architectures, SqueezeNet (Iandola et al. 2016) and ShuffleNetv2 (Ma et al. 2018) are the most widely used lightweight backbone architectures used in edge devices today. The performance of the model built from depth-wise separable convolutions, inverted residual topologies with linear bottlenecks, and automatic complementary search structures is gradually enhanced by the MobileNet series (Howard et al. 2017; Qian et al. 2021; Sandler et al. 2018).

Table 7 Commonly utilized backbone architectures in deep learning-based lightweight object detection methods

4 Performance analysis of deep learning-based lightweight object detectors

In this section, a comprehensive analysis has been made from above-discussed lightweight object detectors and related backbone architectures. It can be said that lightweight object detectors based on deep learning strike a balance between accuracy and efficiency. Although the above-mentioned lightweight detectors from the previous sections have a quick inference rate, accuracy isn’t always up to par for some jobs. As shown in Fig. 5, performance evaluation of deep learning-based lightweight object detectors in terms of mAP on MS-COCO dataset, lightweight object detector YOLOv7-x performs best among mentioned detectors. The backbone architectures in deep learning-based lightweight object detectors play a vital role in determining the accuracy of models. The convolutional architectures specifically designed for edge devices in terms of limited bandwidth usage would be an ideal choice for embedding in detection models. The top-1 accuracy comparison of deep learning-based lightweight backbone architectures in detection models is presented in Fig. 6. The backbone architecture ShuffleNetV2 attains 70.9 accuracy, a large jump from SqueezeNet (Iandola et al. 2016) results. A marginal accuracy increase can be seen in the architectures such as PeleeNet (Wang et al. 2018), DetNas (Chen et al. 2019), mNASNet (Tan et al. 2019), GhostNet (Han et al. 2020a, 2020b) but the recently emerged transformers-based architecture, i.e., MobileViT (Mehta and Rastegari 2021) achieves best state-of-the-art results. Moreover, from year 2017 to 2023, we have shown literature summary in terms of number of publications for deep learning-based lightweight backbone architectures in Fig. 7. The most popular architecture SqueezeNet has been utilized over years in lightweight detectors as shown in Fig. 7. GhostNet (Paoletti et al. 2021) and MobileViT (Mehta and Rastegari 2021) backbone architectures have more literature in 2022 and 2023 year. As mentioned above, the state-of-the-art object detection works are either accuracy-oriented using a large model size (Ren et al. 2015; Liu et al. 2016; Bochkovskiy et al. 2020) or speed- oriented using a lightweight model but sacrificing accuracy (Wang et al. 2018; Sandler et al. 2018; Li et al. 2018; Liu and Huang 2018). It is difficult for any of existing lightweight detectors to meet the accuracy and latency requirements of real-world applications on mobile and edge devices at the same time. Therefore, we require a mobile device solution that can accomplish both high accuracy and low latency to deploy lightweight object detection models.

Fig. 5
figure 5

mAP Performance evaluation of major deep learning-based lightweight object detectors

Fig. 6
figure 6

Accuracy comparison of deep learning-based lightweight backbone architectures in detection models

Fig. 7
figure 7

Year-wise literature summary of backbone architectures in case of lightweight detection models

4.1 Benchmark detection databases for light-weight object detection models

In this section, most popular datasets have been discussed concerning to deep learning-based lightweight object detectors. Datasets are essential for lightweight object detection because they allow for standard comparisons of competing algorithms and the establishment of objectives for solutions.

4.1.1 PASCAL VOC (Everingham et al. 2010)

The most well-known object detection dataset is this one. The PASCAL-VOC versions VOC2007 and VOC2012 are frequently used in papers. 2501 training, 2510 validation, and 5011 testing images make up VOC2007. VOC2012, on the other hand, comprises of 10,991 training, 5823 validation images, and 5717 testing images. The PASCAL VOC datasets include 11,000 images spread across 20 visual object classes. Animals, vehicles, people, and domestic things are the four broad categories into which these 20 classes can be divided. Additionally, classifications of objects with semantic similarities, such as trucks and buses, enhance the complexity levels for detection. Visit http://host.robots.ox.ac.uk/pascal/VOC/to get the dataset.

4.1.2 MS-COCO (Lin et al. 2014)

A sizable image dataset called MS-COCO (Microsoft Common Objects in Context) contains 328,000 photographs of commonplace items and people. It is now one of the most well-liked and difficult object detection datasets. It has 897,000 tagged objects in 164,000 photos across 80 categories. For the training, validation, and testing sets, there are 118,287, 5000, and 40,670 photos, respectively. The distribution of objects in MS-COCO is more in line with real-world circumstances. There is no information available regarding the MS-COCO testing set’s annotations. The following categories of annotations are offered by MS-COCO, including those for captioning, keypoints, panoptic segmentation, dense pose, and object detection. The MS-COCO dataset provides a wide range of realistic images, showing disorganized scenes with various backgrounds, overlapping objects, etc. The URL of the dataset is http://cocodataset.org.

4.1.3 KITTI (Geiger et al. 2013)

It is a well-known dataset for traffic scene analysis and includes 7518 photos for testing and 7481 for training that have been labelled. There are 100,000 pedestrian cases, 6000 IDs, and an average of one person per photograph. The pedestrian and cyclist are the two subclasses of the human class in KITTI. Based on how much the objects are obscured and shortened, the object labels are divided into easy, moderate, and hard levels. In this dataset, there are two subcategories for people: pedestrians and cyclists. Utilizing three criteria that differ in the minimum bounding box height and maximum occlusion level, the models trained on it are assessed. Visit http://www.cvlibs.net/datasets/kitti/index.php to download the dataset.

We have presented performance of deep learning-based lightweight detection models on above-discussed detection datasets in Fig. 8. The lightweight object detector YOLOv4-dense achieves mAP value of 84.30 on KITTI dataset, 71.60 on PASCAL VOC dataset. L4Net detector attains mAP value of 71.68 on KITTI, 82.30 on PASCAL VOC and 42.90 on MSCOCO dataset. RefineDet-lite detector achieves mAP value of 26.80 on MSCOCO dataset. Further to compare performances, FE_YOLO performs best on KITTI dataset as presented in Fig. 8 whereas L4Net detector performs best on MSCOCO dataset and finally, lightweight YOLO-Compact detector outperforms other detectors on PASCAL VOC dataset.

Fig. 8
figure 8

Performance evaluation of deep learning-based lightweight models on leading datasets

4.2 Evaluation parameters

Lightweight object identification models based on deep learning use the same evaluation criteria as generic object detection models. Out of all predictions made, accuracy is the proportion of things that were successfully anticipated. When dealing with class unbalanced data, where the number of instances is not equal for each class, the accuracy result can be quite deceptive because it places more emphasis on learning the majority classes than the minority classes. Therefore, mean Average Precision (mAP), Frames Per Second, and the size of the model weight file serve as the primary evaluation indices for the effectiveness of the lightweight object identification model. The correct labelling data for each image provides the precise number of objects in each category in the image. Intersection Over Union (IoU) quantifies the similarity between the ground and predicted bounding box to evaluate how good the predicted bounding box is as represented in Eq. (1):

$$IoU_{pred}^{truth} = \frac{truth \cap pred}{{truth \cup pred}}$$
(1)

The calculation of IoU value takes place between each prediction box and ground data. Then consider the largest IoU value, and, based on the IoU threshold, we can calculate the number of True Positives (TP) and False Positives (FP) for each object category in an image. From this, the Precision of each category is calculated according to Eq. (2):

$$Precision=\frac{TP}{TP+FP}$$
(2)

When the correct number of TP is obtained, the number of False Negatives (FN) are measured through Recall as in Eq. (3).

$$Recall = \frac{TP}{TP+FN}$$
(3)

By figuring out various recall rates and associated accuracy rates for each category, PR curves for each can be plotted. The value of AP is identical to the region enclosed by the PR curve in the PASCAL VOC 2010 object detection competition evaluation criteria. Precision, recall rate, and average accuracy are three metrics that can be used to assess the model’s accuracy for detecting tasks. MS COCO averages mAP with a step of 0.05 over a range of IoU thresholds (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95). The main metric used to judge competitors is called “mAP,” which averages AP over all 80 COCO dataset categories and all 10 criteria. A higher AP score according to the COCO evaluation criteria denotes flawless bounding box localization of the discovered items. The typical COCO-style AP metric, which averages APs over IoU threshold ranges of 0.5 to 0.95 with 0.05 steps. The performance is measured using AP50, AP75 at various IoU thresholds and APs, APm, and APl on objects that are small, medium, and large in size. By averaging over all 10 IoU thresholds across all categories with a uniform step size of 0:05, the primary metric, AP(IoU) = 0.50:0.05:0.95, is determined.

4.3 A summary of edge devices-based platforms for lightweight object detectors

In the upcoming years, a ton of data will be produced by mobile users and IoT devices. Data growth will bring new problems like latency. Additionally, traditional methods cannot be relied upon for very long if intelligence is to be derived from deep learning-based object detection and recognition algorithms in real-time. Edge computing devices have drawn a lot of interest as a result of prominent firms’ efforts to make supercomputers affordable. It is vital to enable developers to swiftly design and deploy edge applications from lightweight detection models as the IoT, 5G, and portable processing device eras approach. As a result of advancements in the field of deep learning, numerous enhancements to object identification models have been presented that are aimed at edge device applications. DeepSense, TinyML, DeepThings, and, DeepIoT are just a few of the frameworks that have been published in recent years with the intention of compressing deep models for IoT edge devices. To satisfy the processing demands of deep learning-based lightweight object detectors, the model must be able to overcome several constraints like a limited battery, high energy consumption, limited computational capabilities, and a constrained memory while maintaining a level of accuracy. The primary goal should be to create a framework that makes it possible for machine learning models to be quickly implemented in Internet of Things devices. The well-known TinyML frameworks TensorFlow Lite from Google, ELL from Microsoft, ARM-NN and CMSIS-NN from ARM, STM 32Cube-Al from STMicroelectronics, and Alfes from Fraunhofer IMS enable the use of deep learning at the peripheral. When combined with other microcontroller-based tasks, low-latency, low-power, and low-bandwidth AI algorithms can function as part of an intelligent system at a low cost thanks to TinyML on a microcontroller. The DeepIoT framework reduces neural network designs into less dense matrices while preserving the performance of sensing applications by figuring out how few non-redundant hidden components, including filters and dimensions, are needed by each layer. Another well-known framework that provides deep learning-based lightweight object recognition is TensorFlow. TensorFlow Lite is a cross-platform, quick, and lightweight mobile and IoT framework (TFLite) to scale down their massive models. The majority of lightweight models employ TensorFlow lite quantization, which is easy to deploy on edge devices.

4.3.1 Mobile phones

The limitations imposed by mobile devices may be the reason why less research is being done on the deployment of object detectors on mobile phones than on other embedded platforms. Smartphone complexity and capabilities are rising quickly, but their size and weight are also probably going to decrease. Few literature studies have tried to perform implementation on smartphones-based devices (Lan et al. 2019; Liu et al. 2020a, 2020b, 2020c, 2020d; Liu et al. 2021a, 2021b, 2021c; Li et al. 2021a, 2021b, 2021c; Paluru et al. 2021). It can be seen that it puts a heavy burden on creating models that are small, light, and require a minimum number of computations. It is advised to test novel concepts for deep learning inference optimization on transportable models that are regularly utilized with cellphones (Xu et al. 2019). Either the spatial or temporal complexity of deep learning models can be reduced to the point where they can be fully implemented on mobile devices. But there may be a lot of security issues that need to be fixed Steimle et al. (2017). Although deep learning for smartphone item detection appears to be a promising field of study, success will need many more contributions (Wang et al. 2022a, 2022b, 2022c, 2022d).

4.3.2 IoT edge devices

Another way to enable deep learning on IoT edge devices is to transfer model inference to a cloud server. Another way to boost the power of these inexpensive devices is to add an accelerator. The price of using these accelerators is a major drawback, though. Some edge devices, like the Raspberry Pi, may require an extra accelerator, but some, like the Coral Dev Board, already have edge TPU accelerators built in. Deep learning can be more easily enabled to run locally or remotely using a distributed design that links computationally inefficient front-end devices with more potent back-end devices, like a cloud server or accelerator (Ran et al. 2017).

4.3.3 Embedded boards

To provide the finest design options, processor-FPGA combinations, and FPGAs with hard processor cores embedded into its fabric are widely used. Lattice Semiconductor, Xilinx, Microchip, and Altera from Intel are the well-known manufacturers. The literature suggests that the Xilinx boards family is the one that is most frequently utilized for deep learning-based applications. An additional accelerator is often needed when employing FPGA devices (Saidi et al. 2021) to get acceptable performance. Due to Integrated Development Environment (IDE) and high-level language support, the Arduino and Spark-based boards at the top of the device family allow for greater software level programming (Kondaveeti et al. 2021).

4.4 Applications specific to deep learning-based lightweight object detectors

In the above sections, we have discussed architectural details, leading datasets of deep learning-based lightweight object detection models. These models offer a multitude of applications such as in remote-sensing (Xu and Wu 2021; Ma et al. 2023), and aerial images (Xu and Wu 2021; Zhou et al. 2022), traffic monitoring (Jiang et al. 2023; Zheng et al. 2023), fire detection (Chen et al. 2023), indoor robots (Jiang et al. 2022), pedestrian detection (Jian et al. 2023) etc. A summary of literature findings for supporting the applications of deep learning-based lightweight object detection models is listed in Table 6. In (Zhou et al. 2019), for Range Doppler (RD) radar pictures, a lightweight detection network called YOLO-RD has been proposed. Additionally, a brand-new, lightweight mini-RD dataset has been created for effective network training. On the mini-RD dataset, YOLO-RD produced effective results with a smaller memory budget and a detection accuracy of 97.54%. Regarding both algorithm and hardware resource aspects in object detection, (Ding et al. 2019) introduced REQ-YOLO, a resource conscious, systematic weight quantization framework for object detection. For non-convex optimisation problems on FPGAs, it applied the block-circulant matrix approach and proposed a heterogeneous weight quantization. The outcomes demonstrated that the REQ-YOLO framework can greatly reduce the size of the YOLO model while just slightly reducing accuracy. It is suggested that autonomous vehicles use the L4Net Locating object suggestions from (Wu et al. 2021) which integrates a key point detection backbone with a co-attention strategy by attaining cheaper computation costs with improved detection accuracy under a variety of resource constraints. To generate more precise prediction boxes, the backbone capture context-wise information and co-attention method specifically combined the strength of both class-agnostic and semantic attention. With a 13.7 M model size and speeds of 149 FPS on an NVIDIA TX and 30.7 FPS on a Qualcomm-based device, respectively, L4Net achieved 71.68% mAP. The development of effective object detectors requires the rapid development of CPU-only hardware because to the huge data processing and resource-constrained scenarios on GPUs. With three orthogonal training strategies—IoU-guided loss, classes-aware weighting method, and balanced multi-task training approach, (Chen et al. 2020a, 2020b) proposed a lightweight backbone and light-head detecting component. On a single-thread CPU, the suggested RefineDetLite obtained 26.8 mAP at a pace of 130 ms/pic. LiraNet, a compact CNN, was suggested by (Long et al. 2020a, 2020b) for the recognition of marine ship objects in radar pictures. By creating Lira-YOLO, a compact model that is simple to set up on mobile devices, LiraNet was mounted on the already-existing detection framework Darknet. Additionally, a lightweight dataset of distant Doppler domain radar pictures known as mini-RD had been created to test the performance of the proposed model. Studies reveal that Lira-YOLO’s network complexity is minimal at 2.980 Bflops, and its parameter quantity is reduced at 4.3 MB thanks to its high detection accuracy of 83.21%. (Lu et al. 2020) developed a successful YOLO-compact network for real-time object detection in the single person category. The down sampling layer was separated in this network, which facilitated the modular design by enhancing the remaining bottleneck block. YOLO-compact’s AP result is 86.85% and its model size is 9 MB, making it smaller than tiny-yolov3, tiny-yolov2, and YOLOv3. By focusing on small targets and background complexity, (Xu and Wu 2021) presented FE-YOLO for deep learning-based target detection from remote sensing photos. The analyses on remote sensing datasets demonstrate that our suggested FE-YOLO outperformed existing cutting-edge target detection methods. A brand-new YOLOv4-dense model was put forth by (Jiang et al. 2023) for real-time object recognition on edge devices. To address the issue of losing small objects and further minimize the computing complexity, a dense block had been devised. With 20.3 M parameters, YOLOv4-dense obtained 84.3% mAP and 22.6 FPS. To improve the detection of small and medium-sized objects in aerial photos, (Zhou et al. 2022) developed Dense Feature Fusion Path Aggregation Network (DFF-PANet). The trials were conducted using the HRSC2016 dataset and the DOTA dataset, yielding 71.5% mAP and 9.2 M as a lightweight model. To help an indoor mobile robot solve the problem of object detection and recognition, (Jiang et al. 2022) presented ShuffleNet-SSD. Deep separable convolution, point-by-point grouping convolution, and channel rearrangement were all created using the suggested model. A dataset has been created for the mobility robot under the indoor scene. For the detection of dead trees, (Wang et al. 2022a, 2022b, 2022c, 2022d) suggested a novel, lightweight architecture called LDS-YOLO based on the YOLO framework. These plants assisted in the timely regeneration of dead trees, allowing the ecosystem to remain stable and efficiently withstand catastrophic disasters. With the addition of the SoftPool approach in Spatial Pyramid Pooling (SPP), a unique feature extraction module is provided that makes use of the features from earlier layers in order to ensure that small targets are not ignored. On the basis of UAV-captured photos, the suggested approach is assessed, and the experimental findings show that the LDS-YOLO architecture works well when compared to AP of 89.11% and parameter size of 7.6 MB. The categorization of several applications concerning to lightweight object detectors as shown in Table 8 with respect to image type such as remote-sensing, aerial, medical and video streams and application type of healthcare, medical, military and industrial use.

Table 8 A list of type-based applications with respect to deep learning lightweight object detection models

4.5 Discussion and contributions

According to the above-mentioned analysis of deep learning-based light-weight object detectors, there is a need of focus to develop detectors for edge devices which can strike a good balance between speed and accuracy. Furthermore, real-time deployment of these detectors on edge devices is also needed while achieving accuracy of lightweight detectors without compromising precision. In 2022, lightweight backbone architectures ShuffleNet and SqueezeNet have highest publications with respect to lightweight object detectors. In 2023, transformers based MobileViT started getting attention of researchers as top-1 accuracy of 78.4 is also achieved and MobileNet backbone architectures were maximum employed when compared with others. As shown in Table 8, according to input type, video streams have maximum employability in deep learning-based light-weight object detectors. With respect to diverse applications, traffic and pedestrians related detection problems, obstacles and driving assistance have highest studies whereas all other existing applications have limited light weight detectors on edge devices. As we witnessed, the majority of presented light-weight models are from the YOLO family, a number of deep network layers with increasing number of parameters needing to account for the improved accuracy. Therefore, the most important question to ask when a model migrates from a cloud device to an edge device is how to lower the parameters of a deep learning-based lightweight model. There are numerous approaches being used to address this which are described in the next section.

4.6 Recommendations for designing powerful deep learning-based lightweight models

Researchers have created new training methods that decrease the memory footprint in the edge device and speed up training on low-resource devices, in addition to specialized hardware for the deep learning model training process at the network edge. The techniques which we discussed in this section- pruning, quantification, knowledge distillation, and low-rank decomposition, are the four key categories used to compress pre-training networks (Kamath and Renuka 2023) and listed in the following (Koubaa et al. 2021; Makkar et al. 2021; Wang et al. 2020a, 2020b, 2020c):

4.6.1 Pruning

Network pruning is a useful technique for reducing the size of the object detection model and speeding up model reasoning. By cutting out connections between neurons that are irrelevant to the application, this method lowers the amount of computations needed to analyse fresh input. In addition to eliminating connections, it can also eliminate neurons that are deemed irrelevant when the majority of their weights are low in relation to the deep neural network’s overall context. With the use of this method, a deep neural network with reduced size, greater speed, and improved memory efficiency can be used in low-resource devices, such as edge devices.

4.6.2 Weights quantization

The weight quantization approach, which trades precision for speed, shrinks the model’s storage capacity by reducing the number of floating-point parameters. In addition to eliminating pointless associations, every weight is kept as separate values. The weights quantization technique aims to compress these values to integers or numbers that occupy as few bits as possible by clustering related weight values into a single value. Consequently, there will be a re-adjustment of the weights, indicating a modification of the precision as well. This results in a cyclical implementation where the weights are quantified following each training.

4.6.3 Knowledge distillation

Dissection of knowledge presents itself as a new mode of transfer learning. This technique can extract knowledge from a big and well-trained deep neural network, dubbed teacher in this case, into a reduced deep network, called student. By doing this, the student network can learn to achieve the same outcomes as the teacher network while decreasing in size and increasing processing speed. Through the process of knowledge distillation, information is transferred from a large, thoroughly trained end-to-end detection network to numerous, quicker sub-models.

4.6.4 Training tiny networks

The deep neural network’s initial convolution kernel is mostly broken-down using matrix decomposition in the low-rank decomposition method, although the accuracy of the results will noticeably improve. Directly training tiny networks can drastically reduce network accuracy loss and speed up reasoning.

4.6.5 Federated learning and model partition

Distributed learning or federated learning are two possible training approaches for dealing with complicated tasks or a period of training including a lot of data. The data would be broken into smaller groups that would be distributed among the nodes of the edge network. As part of the final deep neural network, each node would train based on the data it received, enabling active learning capabilities at the network edge. Model partitioning is a strategy that may be applied in the inferring phase using the same methodology. To divide the burden, a separate node would compute each layer of the deep neural network in a model split. This approach would also make scaling simple.

Moreover, to boost the flow of information in a constrained amount of time, multi-scale feature learning in lightweight detection models that comprise single feature maps, pyramidical feature hierarchies, and integrated features may be used. The feature pyramid networks, as well as their variations such as feature fusion, feature pyramid generation, and multi-scaled fusion module, aid in overcoming object detection difficulties. Additionally, in order to boost the effectiveness of lightweight object identification models, researchers also work to encourage the development of activation functions and normalization in various applications. Above-mentioned techniques accelerate the usage of deep learning models into edge devices. The deep learning-based lightweight object detection models have not yet achieved comparable results when compared with generic object detection. Moreover, to mitigate these differences, a need for designing powerful and innovative lightweight detectors is a must. Some recommendations for designing powerful lightweight deep learning-based detectors are mentioned in this section.

  1. 1.

    Incorporation of FPNs- The bidirectional FPN can be utilized to improve the semantic information while incorporating feature fusion operations (Wang et al. 2023a, 2023b). To successfully collect bottom-up and top-down features more than FPN, an effective feature-preserving and refining module can be introduced (Tang et al. 2020a, 2020b). Deep learning-based lightweight detectors can be designed with the help of cross-layer connections and the extraction of features at various sizes while using depth-wise separable convolution. It is possible to take advantage of a multi-scale FPN architecture with a lightweight backbone to take out features from the input image.

  2. 2.

    Transformer-based Solutions- To increase the precision of the transformer-based lightweight detectors, group normalisation can be implemented in the encoder-and-decoder module and h-sigmoid activation function in the multi-layer perceptron (Li, Wang and Zhang 2022).

  3. 3.

    Receptive Fields Enlargement- The capacity of single-scale features to express themselves and to be detected on a single scale are both improved by the multi-branch block involving various receptive fields. The network width may increase and performance may be slightly enhanced with the use of several network branches (Liu et al. 2022).

  4. 4.

    Feature Fusion Operation- In order to combine several feature maps of the backbone and the collection of multi-scale features into a feature pyramid, the fusion operation offers a concatenation model (Mao et al. 2019). To improve the extraction of information from the suggested lightweight model, the feature maps’ weight of various channels can be reassigned. Furthermore, performance improvement may result from the integration of the attention module and data augmentation technique (Li et al. 2022a, 2022b). The smooth fusion of semantic data from low-resolution scale to neighbourhood high-resolution scale is made possible by the implementation of FPN into the suggested lightweight detector architecture (Li et al. 2018).

  5. 5.

    Effect of Depth-wise Separable Convolution- The optimal design principle for lightweight object detection models consists of fewer channels with more convolutional layers (Kim et al. 2016). The approach to network scaling that modifies width, resolution, and network’s structure to reduce or balance the size of the feature map, keep the number of channels constant after convolution, and minimise convolutional input and output is where researchers can concentrate (Wang et al. 2021b). The typical convolution in the network structure can be replaced with an over-parameterized depth-wise convolutional layer, which significantly reduces computation and boosts network performance. To increase the numerical resolution, ReLU6 can be used in place of the activation function known as Leaky ReLU (Ding et al. 2022).

  6. 6.

    Increase in Semantic Information- To keep semantic features and high-level feature maps in the deep lightweight object network, the proposal of smaller cross-stage partial SPPs and RFBs facilitates the integration of high-level semantic information with low-level feature maps (Wang et al. 2022a, 2022b, 2022c, 2022d). The architectural additions of the context enhancement and spatial attention module can be employed to generate more discriminative feature representation (Qin et al. 2019).

  7. 7.

    Pruning Strategy- Block-punched pruning uses a fine-grained structured pruning method to maximise structural flexibility and minimise accuracy loss. High hardware parallelism can be achieved using the block-punched pruning strategy if the block size is suitable and compiler-level code generation is used (Cai et al. 2021).

  8. 8.

    Assignment Strategy- To improve the training of lightweight object detectors based on deep learning, use the SIMOTA dynamic label assignment method. When creating lightweight detection models, the combination of the regression method based on FCOS, dynamic and learnable sample assignment, and varifocal loss handling class imbalance works better (Yu et al. 2021). Designing lightweight object detectors using the anchor-free approach has been successful when combined with other cutting-edge detection methods using decoupled heads and the top label assignment strategy, known as SimOTA (Ge et al. 2021).

There are two ways of deployment of deep learning-based lightweight models on edge devices are when a light-weight model or compressed data are employed to match the compute capabilities of the limited edge devices. With regard to on-board object detection, this is true. The compromise between compression ratio and detection accuracy in this method is its drawback. Secondly, the model is distributed and data is exchanged when computations are spread over several devices and cloud server could be able to handle the computations in this situation. In this case, privacy and security seem to be the primary issues (Zhang et al. 2020a, 2020b, 2020c). Consideration must be given when establishing device coordination in this scenario as it may also result in extra overhead in order to avoid the edge devices being overworked while conducting the collaborative learning algorithm. No matter the plan, all of these deployment methods rely on edge devices and have to deal with the problems with edge devices present. The primary causes of the issue are data disparity in real-world scenarios and the need to manage real-time sensor data while performing numerous deep learning tasks. The powerful processing units, the high computing requirements of deep learning models, and short battery life makes validity of light-weight models tough. In the future, we’ll strive to create such standards-compliant light-weight detection deployment models.

5 Conclusion

This study asserted that deep learning-based lightweight object detection models are a good candidate for improving the hardware efficiency of neural network architectures. This survey has examined and provided the most recent lightweight edge gadget models. The commonly utilized backbone architectures in deep learning-based lightweight object detection methods have also been stated in which ShuffleNet and MobileNetV2 employed majorly in these models. Some critical aspects after analyzing current state-of-the-art deep learning-based lightweight object detection models on edge devices have been discussed. The comparison has been drawn between emerging lightweight object detection models on the basis of COCO-based mAP scores. The presentation of a summary of heterogeneous applications for lightweight object identification models that take into account diverse types of photos and application categories. This study also gives information on edge platforms for using portable detector models. A few recommendations are also given for creating a potent deep learning-based lightweight model, including multi-scale and multi-branch FPNs, federated learning, partitioning strategy, pruning, knowledge distillation, and label assignment algorithms. The lightweight detectors still fall more than 50% short in delivering such outcomes, although having demonstrated significant potential by matching classification errors with the thorough models.