1 Introduction

Vision systems based on deep learning demonstrated astounding performance in human activity recognition. YOLO was deemed the most popular of these systems because to its simplicity, detection speed, and precision [7, 33]. Recognition of human activity was suggested by Shinde et al. using the YOLO framework [40]. In [40], they used YOLO for activity identification from a security camera, and the findings shown that YOLO outperformed other models in terms of F-score accuracy with 88.4%. Liu et al. investigated the issue of recognition of intention and interaction [25]. In their work, they focused on the hybrid of YOLO and LSTM for identifying human intent and interaction. Initially, YOLO was utilized to identify hand-held items, and then LSTM was used to encode the temporal dependencies of the series. The problem of action localization was discussed by Hammam et al. [7]. They exploited YOLO in recent work for person localisation in a succession of video frames. Mutis et al. [25] described the use of YOLO for human occupancy forecasts and activity analysis.

Recently, YOLOv7 has been effectively implemented in a range of real-world applications, including the recognition of objects in marine cruise using photos acquired by a drone camera [19]. Their tests compare the performance of YOLOv7 to that of YOLOv3, YOLOv4, and YOLOv5. The findings indicated that YOLOv7 had the greatest accuracy and recall rate for detecting airplanes and oil tanks. This is due to the changed design of the network’s backbone. Li, Yongshuai, et al. [41] studied an additional use of YOLOv7 for fault identification in transmission line insulators using UAV photos. They developed an image-based inspection system employing several YOLO models, including YOLOv5, YOLOv6, and YOLOv7-X, in [41]. Using 1593 photos of the electrical grid, the investigation determined that YOLOv5 and YOLO7-X delivered the greatest results. In [55], another application of UAV-based object detection employing the YOLOv7-sea system was examined. The suggested YOLOv7-sea was utilized to identify swimmers, boats, jet skis, buoys, and lifesaving equipment from the SeaDronesSee dataset [46]. The usage of YOLOv7 for counting ducks is shown in [12]. They recommended adding an attention mechanism to YOLOv7 in order to capture key information. 200 photos were used for testing the trained YOLOv7 model, whereas 1300 images were utilized for training. According to the performed study, the proposed attention techniques slightly improved mAP accuracy by 1.15%. Yang et al. [26] presented the YOLOv7 ship identification model using satellite images.

Despite YOLO’s success, it employs computationally costly pre-trained backbone feature extraction networks, like AlexNet [17], ResNet [8], and Inception [42]. For example, the ResNet network described in [8] has 152 levels, and each layer comprises hundreds of convolutional filters. Researchers designed SqueezeNet, a compressed network, to alleviate network complexity concerns [11]. Actually, SqueezeNet is far smaller than AlexNet, with 5X less parameters, yet it was able to attain the same level of accuracy in recognition as AlexNet. Inspired by its compact size, several researchers have deployed SqueezeNet to solve a variety of computer vision issues in the real world, including vehicle identification in thermal infrared pictures [15], COVID-19 detection from CT scans [29], and industrial defect localization [51]. All of these investigations demonstrated SqueezeNet’s outstanding accuracy at a reasonable computing cost. Ren et al. [34] showed the integration of SqueezeNet with YOLO for real-time people counting. Using the SqueezeNet network resulted in further gains in speed, as determined by their study.

Nevertheless, the SqueezeNet used in [15, 29, 34, 51] consists of hundreds of convolutional filters that absorb the majority of computing effort during the detection period. Using optimization techniques like as PSO [16], TLPSO [37], and RLMPSO [36] to conduct an iterative search and pruning of SqueezeNet filters is a viable method for bridging this gap and reducing the complexity of SqueezeNet. TLPSO was the most effective of these optimizers in tackling large-scale optimization issues. In order to prune and remove less contributing SqueezeNet neural filters in human action identification, this project will include TLPSO into YOLO. Consequently, the goals of this research are threefold.

  1. 1-

    To reduce the complexity of SqueezeNet by eliminating less contributed convolutional filters.

  2. 2-

    To incorporate an efficient optimizer (i.e., TLPSO) for performing SqueezeNet filters elimination guided by YOLO detection accuracy and system complexity.

  3. 3-

    To investigate the optimized YOLO vision system with respect to the issue of human activity recognition (walking vs. running) using drone-captured images, as shown in Fig. 1.

Fig. 1
figure 1

Two different human actions are walking (yellow box) and running (green box)

The originality of this work may be summed up in two parts namely (i) It introduced a compact vision system using an optimized SqueezeNet-based feature extraction network that has been fine-tuned, and (ii) It consists of an effective TLPSO method that can function with high-dimensional SqueezeNet convolutional filter selection. The remaining part of this work is organized as follows. The related work is given in Section 2. In Section 3, the proposed optimized YOLO vision system is explained. A series of conducted experiments used to evaluate the effectiveness of the proposed vision system are given in Section 4, followed by the conclusions in Section 5.

2 Related work

Human detection and action recognition have been extensively studied in the literature. Hung et al. [10] presented a Faster-RCNN-based technique for people identification in drone-based photos. The findings suggested that the F1 measurement was accurate to a degree of 98%. In [10], both Faster-RCNN and YOLO were built and exhibited comparable detection accuracy; however, YOLO requires shorter execution time because to the one-stage architecture’s benefit [33]. CNN-based deep learning models have been described in [1, 21, 38, 54], and [14]. The merging of CNN and LSTM for activity identification was studied, for example, in [21, 54]. Maitre et al. [21] suggested a CNN-LSTM model for everyday human activity recognition. In their study, CNN was used to encode the characteristics of a 1-D radar signal, which was then given to the LSTM model for activity classification. The hybrid CNN-LSTM model described in [21] was verified using 10 subjects doing a variety of everyday activities such as drinking, sleeping, walking, etc. The results shown that CNN-LSTM classification accuracy approached 90%.

Amin Ullah et al. [45] did more research focusing on the topic of activity detection from security video streams. The system described in [21] comprises three phases: object localization using YOLO [33], feature extraction with CNN, and sequence recognition with Gated Recurrent Unit (GRU). Their technique was used to identify a variety of human activities, including sports participation, musical instrument playing, human-to-human contact, etc. According to reported statistics, their technology was able to attain an average accuracy rate of 80%. Other that human detection and action recognition, several studies addressed the issue of image classification. For instance, a comprehensive analysis that compares several features descriptors including SIFT,SURF, and ORB in image classification was conducted by [2]. In [3], the concept of transfer learning using a pre-trained VGG19 for image classification issues was presented. In [39], the approaches, problems, and opportunities in figner vien recognition are thoroughly reviewed.

Recently, YOLO has been adopted for various agriculture detection applications such as plant organs detection [44], Tea chrysanthemum detection [30], date palm tree detection from drone imagery [13], tea leaf detection [4], tomato detection [53], pinecone detection [6], fruit detection [43]. In [44], ResNet and DenseNet backbone feature extractor networks were utilised to train YOLO-v3 for plant orangutan identification. The final classification layer of YOLO-v3 generated four distinct classes, including leaves, flowers, fruits, and buds. In their trial, 4000 photos were used, and YOLO-v3 claimed a recall rate of 95.5% and a precision of 94.2%. The completed research revealed that the recall rate was boosted by 5.7% and the precision rate increased by 4.2% as a result of data augmentation. The application of YOLO for tea detection at flowering stage was investigated by Qi et al. [30]. Mainly, they have proposed modifications to the backbone component and neck of YOLO. Results showed that detection accuracy reached 92.49%. The YOLO-based face mask detection challenge was posed by Wu et al. [49]. Their system attained a 92.0% precision rate when tested on publicly available datasets at Kaggle, where they reviewed their methodology. Nevertheless, earlier reported approaches [30, 44, 49] utilised the ResNet backbone network, which has a sophisticated architecture and takes more processing effort than SqueezeNet. Intasuttisak et al. [13] recently adopted YOLO-v5 for the recognition of date palm trees in drone footage. They have accumulated 125 photos and put them into three categories: 60% for training, 20% for validation, and 20% for testing. The mean attained precision rate for test data was 92.34%. Some further applications of YOLO include road crack detection [32], semi-supervised YOLO for generic object detection [56], head detection [48], defect detection [28], YOLO-based a few-shot model [50], detecting small targets in infrared remote sensing [18], vehichle detection [5], traffic sign detection [52], colon cancer detection [24], and cattle body detection [31].

In spite of this, a small number of research, such as [22, 27] have been conducted to address the problem of human action recognition from UAVs. Mliki et al. [22] were able to concurrently find and identify individuals’ behaviours with the assistance of CNN. By extending their method to UAV-based video sequences, they were able to demonstrate that CNN outperforms more traditional detection and identification techniques. Peng et al. [27] accelerated the process by employing a region-based technique [35] with 3D CNN. The literature is summarised in Table 1 for relevant investigations.

Table 1 Related work

3 The proposed optimized DEEP learning vision system

Figure 2 depicts the primary architecture of the proposed optimised vision system. As noted, it employs TLPSO to identify and eleminate SqueezeNet filters with fewer contributions throughout the detection process. However, before embedding TLPSO, YOLO [33] will be trained with the SqueezeNet backbone network for human action recognition (i.e., walking vs. running); after that, it will undergo a pruning process (i.e., SqueezeNet filters selection/elimination). The main steps of the implemented process are depicted in Fig. 2, which include the SqueezeNet filters selection/elimination, fitness evaluation according to YOLO detection accuracy and percentage of filters reduction, and TLPSO particles updating phase.

Fig. 2
figure 2

The proposed optimized vision system

Figure 3 depicts the TLPSO method, which consists of one layer for global search operations and a second layer for local search activities. Controlling the selection between these processes is a Q-learning method [47]. Our prior work [37] contains additional information regarding TLPSO. In summary, the TLPSO will begin with a random micro swarm (three particles), and each particle in the swarm is associated with two vectors, namely the velocity (V) and position (X) vectors, as shown below.

$${X}_i=\left[{d}_i^1,{d}_i^2,{d}_i^3,\dots \dots \dots \dots \dots \dots \dots \dots .,{d}_i^D\right]$$
(1)
$${V}_i=\left[{v}_i^1,{v}_i^2,{v}_i^3,\dots \dots \dots \dots \dots \dots \dots \dots .,{v}_i^D\right]$$
(2)

where D represents the dimension of the optimization problem (i.e., the total number of SqueezeNet filters), and i represents the particle number in the swarm. These particles are evolved, and their positions are updated according to the following equations:

$${V}_i\leftarrow \omega \ast {V}_i+{c}_1\ast \mathit{\operatorname{ran}}{d}_{uniform}\left( pBest-{X}_i\right)+{c}_2\ast \mathit{\operatorname{ran}}{d}_{uniform}\left( gBest-{X}_i\right)$$
(3)
$${X}_i\leftarrow {X}_i+{V}_i$$
(4)

where w is the inertia weight, c1 is the cognitive acceleration coefficient, c2 is the social acceleration coefficient, randuniform is a uniformly distributed random number within [0, 1], pBest is the local best position achieved by a particular particle, and gBest is the global best position achieved by the whole population. As mentioned in [37], the local search operation is identical to the global search, but it modifies randomly chosen dimensions.

Fig. 3
figure 3

TLPSO [37]

After that, the fitness of each particle will be determined according to Eq. (6). Figure 4 depicts the proposed encoding strategy used to convert particle vector to SqueezeNet filters selection/elimination format. As can be observed, each convolution filter in the convolutional layers of SqueezeNet will be associated with a bin variable F that can take the values 0 or 1. Thus, it is a binary optimisation issue where a value of one signals that the associated filter is selected and will be active throughout the YOLO features extraction procedure. A value of zero meant that the relevant SqueezeNet filter would be eliminated from SqueezeNet since it would be seen to have contributed less.

Fig. 4
figure 4

Encoding scheme of SqueezeNet filters selection

Therefore, the optimizer iteratively performs a selection/elimination process of SqueezeNet filters guided by a fitness function defined as follows.

$$fitness=-1\ast \left(0.9999\ast F- score+0.0001\ast reduction\right)$$
(5)

where F − score measure represents the accuracy rate of YOLO on training data. reduction variable is related to the percentage of eliminated filters from SqueezeNet (i.e. number of activated filters divided by the total number of SqueezeNet filters). It should be noted that higher weightage (i.e. 0.9999) is given to accuracy over reduction due to the importance of accuracy. The optimisation process will end once the maximum number of iterations has been reached, at which point the best possible solution will be returned. Once the final SqueezeNet has been pruned, it will be added back into YOLO and retrained so that the last layer of YOLO may be fine-tuned (regressor and classifier).

3.1 YOLO steps

As can be seen in Fig. 2 that the main steps of the YOLO system are (1) input image division, (2) feature extraction, (3) cell classification and regression, (4) bounding boxes generation, and (5) final output prediction. These steps are explained as follows:

  • Step 1: Divide the input image into cells.

At this step, the input image is divided into S x S cells, and each cell is responsible for predicting several anchor boxes, as given in Fig. 5. In this work, the input image will be divided into 7 × 7 cells with three anchor boxes as suggested in [40].

Fig. 5
figure 5

Input image division

3.1.1 Feature extraction using SqueezeNet

At this stage, the entire image input is sent to SqueezeNet for feature extraction. The fundamental SqueezeNet architecture is depicted in Fig. 6 and Table 2 [11]. Evidently, SqueezeNet employs a number of filters to recognise visual characteristics such as edges, dots, etc. Even though, SqueezeNet convolutional filters consumed the most computational time, as each filter was convolved with the entire input image. Table 2 illustrates that, when SqueezeNet is explored further, the complexity of the network has increased and more convolutional filters will be required. Consequently, the objective of this research is to lower the number of convolutional filters in order to accelerate the feature extraction process of the suggested vision system.

Fig. 6
figure 6

Convolutional filters of SqueezeNet

Table 2 SqueezeNet layers

3.1.2 Cell classification/regression

This stage is responsible for providing the output of YOLO classification and regressor layer. As stated previously, YOLO will be trained to recognize two human activities, namely walking and running. Thus, throughout the training process of YOLO, both the classification and regression layers will be tweaked and updated based on the loss function defined in the following equation: (1).

$$loss={\lambda}_{coord}\ \sum\limits_{i=0}^{s^2}\sum\limits_{j=0}^B{1}_{ij}^{obj}\left[{\left({x}_i-{x}_j\right)}^2+{\left({y}_i-{y}_j\right)}^2\right]+{\lambda}_{coord}\ \sum\limits_{i=0}^{s^2}\sum\limits_{j=0}^B{1}_{ij}^{obj}\left[{\left(\sqrt{w_i}-\sqrt{w_j}\right)}^2+{\left(\sqrt{h_i}-\sqrt{h_j}\right)}^2\right]+{\lambda}_{coord}\ \sum\limits_{i=0}^{s^2}\sum\limits_{j=0}^B{1}_{ij}^{obj}\left({c}_i-{c}_j\right)+{\lambda}_{coord}\ \sum\limits_{i=0}^{s^2}\sum\limits_{j=0}^B{1}_{ij}^{noobj}\left({c}_i-{c}_j\right)+{\lambda}_{coord}\ \sum\limits_{i=0}^{s^2}{1}_i^{noobj}{\sum\limits}_{c\ \epsilon\ class}{\left(p\left({c}_j\right)-p\left({c}_j\right)\right)}^2\kern0.75em$$
(6)

where xi, yi are the center of the bounding box. wi, and hi are the width and the height of the bounding box. Ci is the objectness measure that identifies the confidence level of whether the cell contains an object or not. pi(c) is related to class probability score. Variables S and B are related to the implemented number of cells and anchor boxes. It should be noted that the value \({1}_{ij}^{obj}\) will be 1 when a cell contains an object; otherwise, it will be set to zero. However, the value \({1}_{ij}^{noobj}\) will be 1 when there is no object in that cell and zero elsewhere.

3.1.3 Bounding boxes generation

This activity is responsible for generating the outcomes of both the classification and regeneration layers. According to the predicted location of the regressor, as shown in Fig. 7, many boxes were drawn. In addition, yellow boxes indicate running action, whereas green boxes indicate walking action.

Fig. 7
figure 7

YOLO final output

3.1.4 YOLO final output prediction

This is the final phase of YOLO, which generates the final bounding boxes. First, only those bounding boxes with a confidence level more than 0.5 will be kept, while others will be discarded as false alarm boxes. Then, overlapping boxes will be combined, as seen in Fig. 7. The green box signifies a class of running motion, while the yellow box shows a class of walking movement.

4 Experimental results

4.1 Data collection

Figure 8 depicts the S30W drone used to acquire numerous photographs for this inquiry. The S30W drone has a control range of 400 m and a WIFI transmission range of 50 m. On the drone, a rotating 720P HD camera was fitted. Two volunteers were filmed as they walked and ran at different speeds and distances (measured in seconds). These films were filmed around the Engineering campus of Universiti Sains Malaysia from a number of view points and in a variety of weather conditions (USM). We selected 300 still images from these films to show the different stages of motion: 200 for walking and 100 for running. After data collection, the labelling method was executed using the imageLabeler MATLAB tool. Figure 9 depicts an example of this tool.

Fig. 8
figure 8

S-Series S30W drone

Fig. 9
figure 9

Snapshot of MATLAB image labeler tool

4.2 Data augmentation

This study use data augmentation to boost the quantity of training data. This technique generates new images by executing various image processing operations, such as translation, rotation, cropping, and random noise addition. Figure 10 contains a collection of some photos for illustrative purposes. Throughout YOLO’s training period, the augmentation operation was automatically executed. Consequently, random procedures were used to build an augmented image for each image in the training data. After augmentation, the size of training images will be increased from 200 to 400 images.

Fig. 10
figure 10

Data augmentation

4.3 Performance measures

For the purpose of evaluating the optimised YOLO vision system, random data partitioning based on three-fold cross-validation was used. In this method, the experiment is performed three times, with one repetition utilised for testing and the other two for training. In addition, the experiment was performed ten times for each fold to determine the efficacy of the TLPSO algorithm. Table 3 provides a summary of the analysis’s conditions. In each fold, YOLO must first be trained on training data with augmentation, and then the trained SqueezeNet is subjected to a filters selection/elimination process using TLPSO [15], as previously mentioned. The optimised YOLO will then be retrained in order to achieve additional tuning and performance enhancement. The final evaluation of the fine-tuned/optimized YOLO will be conducted on the test set. The performance of the proposed YOLO-based vision system was evaluated using the following three metrics.

$$f- scoe=2\ast \frac{precision\ast recall}{precision+ recall}$$
(7)
$$recall=\frac{TP}{TP+ FP}$$
(8)
$$precision=\frac{TP}{TP+ FN}$$
(9)

where TP is the true positive rate, FP is the false positive rate, and FN is the false-negative rate. These measures have been widely used to assess YOLO performances [40].

Table 3 Experimental settings

4.4 Performance analysis

Table 4 contains the outcomes of both the traditional YOLO system and the proposed improved YOLO. The optimised YOLO was able to attain more accuracy in terms of F-score for both actions, i.e., walking and running, with the exception of fold 1. These enhancements are a result of the generalisation of the pruned/optimized SqueezeNet, which possesses fewer convolutional filters than the regular YOLO. Table 4 demonstrates that the proposed optimised YOLO system was able to achieve a balance between precision and recall rate. In fold 1 walking motion, for instance, the conventional YOLO recorded a precision of 94.02% and a recall rate of 87.41%. However, the optimised YOLO was able to balance these metrics with precision and recall scores of 90.48 and 90.70%, respectively. It should be noticed that the mean F-score value for walking action was greater than that for running activity due to the greater number of training samples for walking action. Figure 11 depicts photographs of successfully detected situations as examples.

Table 4 Results of YOLO and optimized YOLO
Fig. 11
figure 11

Successfully recognized cases

Figure 12 illustrates a comparison between the precision-recall curves for walking and running motions. Due to the enhanced SqueezeNet feature extractor’s capacity for generalisation, the optimised YOLO creates a better curve than YOLO. Figure 12a demonstrates that the optimised YOLO provided a 100% precision rate in walking with an 85% recall rate, whereas YOLO reported only a 95% precision rate at the same point of recall rate. As seen in Fig. 12, it is important to note that both vision systems reported poorer performance during running (a-c). This is owing to the difficulty of the addressed recognition challenge, as the shape and textural characteristics of these motions are extremely similar. In addition, there are numerous training samples for the walking action.

Fig. 12
figure 12

The precision-recall curve in three folds

4.5 Time and complexity analysis

Additional investigation was undertaken by incorporating the developed TLPSO method and computing the speedup analysis [37]. In Table 5, the detection time in seconds and the filter reduction percentage have been calculated and tabulated. Note that both systems were implemented in MATLAB and tested on the same PC with Windows 10, an i7–8700 processor running at 3.2 GHz, and 32 GB of RAM. According to Table 5, the embedded optimizer was able to reduce around 52% of SqueezeNet filters. Importantly, the enhanced YOLO algorithm described required only 0.061 seconds per image. In other words, the optimizer was able to accelerate the YOLO vision system by a factor of seven. This is due to the advantage of removing less contributed SqueezeNet filters that incur additional computational time costs during the feature extraction procedure. Notably, the training/optimization time of TLPSO is not addressed here because it is only required during the construction of the YOLO system.

Table 5 Results of filters reduction

4.6 Compare with other optimizers

Comparing the outcomes of TLPSO against those of two comparable optimizers, namely PSO [16] and RLMPSO [36], yielded more insights. The parameters for these algorithms are listed in Table 6. All algorithms were run ten times with a maximum of 100 iterations, and Table 7 displays the results produced by each strategy in terms of the mean fitness function and the percentage of filter reduction. In terms of the mean fitness value, it is evident that the suggested TLPSO optimizer outperformed other optimizers. As described in [37], this is owing to the advantages of a small population size (3 particles) and a local search layer. TLPSO states that the height reduction reached 52% as measured by the percentage of reduction. Due to the large cost of fine-tuning operations spent by the local search optimizer, RLMPSO achieved the worst results [36]. Boxplots were used to undertake additional analysis, as shown in Fig. 13. It can be shown that the TLPSO achieves significantly better outcomes than other optimizers based on the range of reported fitness values in each run.

Table 6 Optimizers parameter settings
Table 7 Results of optimizers comparison
Fig. 13
figure 13

Box plot of fitness value

The TLPSO convergence curve was compared to those of PSO and RLMPSO in Fig. 14. This curve illustrates TLPSO’s behaviour during operation. It depicts the base-10 logarithmic mean values generated by ten iterations of each algorithm for each fold. TLPSO has a better convergence curve in all folds, as can be observed. This is owing to TLPSO’s capacity to manage large-scale problems, as described in [37].

Fig. 14
figure 14

The convergence curve analysis of TLPSO, PSO, and RLMPSO

The Wilcoxon statistical test [9] has been used to examine the efficiency of TLPSO from a statistical standpoint. The essential concept of this test is that the null hypothesis H0 implies that all optimizers have comparable performance and that their means are selected from the same distribution. Yet, according to the alternative hypothesis H1, they have a different distribution. The p value is set to 0.05, therefore the alternative hypothesis H1 is accepted if the p value is less than 0.005 (95% confidence level). The results of the statistical study are presented in Table 8, where it is evident that TLPSO statistically outperforms other algorithms with a p value less than 0.005. This is due to the advantages of a limited population size and the capacity to handle large-scale optimisation issues [37].

Table 8 Statistical test analysis

4.7 Evaluation using public dataset

In this section, the improved YOLO vision system is applied to a publicly available benchmark dataset for identifying people in drone-captured images. The UAV123 [23] dataset is utilised, which includes 3000 photographs. The findings of the improved YOLO algorithm suggested are compared with those of other methods published in [10]. These results are displayed in Table 9. In terms of precision, recall, and F-score, it is obvious that the modified YOLO method proposed produced the best results. Due to the lower network complexity of SqueezeNet, its generalisation performance is superior to that of YOLO and Faster R-CNN.

Table 9 Results of the UAV123 dataset [23]

5 Conclusion, limitation and future directions

In this study, we present a YOLO-based deep learning vision system that employs the efficient TLPSO algorithm to remove ineffective SqueezeNet filters. Improved YOLO vision technology has been incorporated into drones, allowing them to detect human motions like walking and running. The presented results confirmed the superior performance gains achieved by the proposed upgraded YOLO vision system. With TLPSO, we were able to get a 7X improvement in detection speed over the baseline unoptimized YOLO vision system, and we were also able to get rid of 52% of SqueezeNet filters. With a drop in detection time from 0.417 seconds to 0.061 seconds, the benefits of the integrated TLPSO algorithm are verified. TLPSO also achieves a higher mean fitness value than PSO and RLMSPO, and it does so at a considerably faster rate of convergence. The proposed approach’s lightweight architecture allows it to surpass previously published human recognition findings from drone photographs (i.e. the UAV123 dataset).

These are some of the key restrictions that might be placed on this study including; (i) the optimization task is a time-consuming process that requires sufficient hardware support (ii) the formulated fitness function is complicated where it combines the accuracy with the complexity, and (iii) SqueezeNet is not efficient as compared to other pre-trained CNN such as Mobilenet. Extending TLPSO into a multiobjective method, optimizing additional pre-trained networks, and implementing parallel that might speed up the optimization effort are just a few of the numerous concepts that could be utilized to further improve this work. Other real-world computer vision issues that the suggested improved YOLO vision system might tackle include vehicle identification, face recognition, and industrial flaw detection.