1 Introduction

In recent years, the logistics industry in China has experienced rapid growth. However, labor costs and the lack of mechanization are key limitations that hinder the development of logistics transformation [1]. To enhance logistics efficiency and reduce operating costs and human resources, automation and Internet of Things technology has found widespread use in logistics. This has enabled the development of automated warehousing, smart logistics vehicles, logistics robots, and other intelligent logistics systems, ultimately improving logistics tracking and management [2, 3]. The robot arm or automated guided vehicle in the intelligent logistics factory can replace manual processes such as picking, packing, and handling of items. Additionally, the robot arm can work continuously for a long period, reducing the logistics industry’s dependence on manpower while also mitigating the risk of manual picking errors and item damage. Moreover, it can provide additional safety guarantees by operating in hazardous and high-temperature logistics factory environments [4, 5]. In logistics warehouses, accurately locating items is crucial for improving the efficiency of logistics transportation and sorting, as well as for tracking and tracing items and enhancing logistics visualization tracking. The normal operation of the Robotic Arm (RA) depends on Visual Grasping (VG) technology, which uses computer vision to identify and manipulate objects [6]. However, the logistics environment presents greater complexity. Haphazard stacking of objects can diminish the accuracy of localizing an individual item. The changing shape of flexible parcels and the overlapping of multiple items cause difficulties in item localization, making it more challenging to determine the optimal grasping position for a single object [7, 8]. Currently, deep learning target detection algorithms and traditional vision methods, such as image segmentation and edge detection, have been combined to study localization in logistics scenarios. However, the traditional baseline localization model does not meet the requirements for complex backgrounds in logistics warehousing, simultaneous processing of multiple items, and partial occlusion on localization accuracy.

The study breaks down the task of sorting items in complex logistics scenarios into two components: object localization and RA grasping position generation. An object localization model is developed using the You Only Look Once v3 (YOLOv3) target detection technology and Example Segmentation Algorithm (ESA). The logistics sorting algorithm is then studied and designed based on the localization results and Sampling Evaluation (SE). The study is divided into four parts. First, a review of the current research on VG technology and position generation, both within domestic and international contexts, will be conducted. Second, an object localization model will be proposed, based on an improved version of YOLOv3 and ESA; an item sorting algorithm will also be designed, based on SE. Third, the localization model and sorting algorithm performances will be tested and analyzed experimentally. The fourth section provides a summary and analysis of the experimental findings. It is anticipated that this research will enhance the effectiveness of VG application in the sophisticated logistics environment and stimulate the development of automated logistics systems.

2 Related Works

VG plays a pivotal role in the development of industrial factory automation, with significant implications for enhancing intelligent sorting, loading and unloading in logistics warehouses, unmanned vehicles, and mechanical product packaging. As such, it has garnered substantial attention from a diverse range of stakeholders, with numerous researchers and scholars undertaking extensive investigations into VG and its related problem-bit-pose generation. Robots play a comprehensive role in the field of agriculture, and their work operations depend on visual servo technology. In order to solve the problem of visual servoing task failure due to Jacobi matrix singularity, Li et al. designed a 3D target visual servoing method based on feature learning and image moment learning, and successfully applied it to an agricultural material handling robot. The effectiveness of the algorithm was verified by the six-degree-of-freedom robot system tracking bagged flour experiments, and the displacement localization accuracy reached the millimeter level, and the orientation angle localization accuracy reached the level of 0.1° [9]. Traditional RA relies on preplanned routes to grasp objects, and it is difficult to obtain information about the external environment, which affects the accuracy of object grasping. Zhao designed a modular RA motion device based on multi-featured video, and experimentally verified the practical performance of this RA, which had a minimum relative error of 1.16% with the laser rangefinder and a grasping success rate of 88.9% [10]. To solve the dependence of yarn grasping operation on manual labor, Han et al. proposed a vision method based on the 3D somatosensory camera Kinect v2. Kinect v2 generated 3D yarn spinning point cloud data, which was filtered and processed to complete the rough alignment of the template point cloud. It was experimentally verified that the working time of the robot under this method was within 10 s, and the success rate of grasping was above 80%, which meets the requirements of industrial production [11]. Liu et al. researched unmanned sorting and automated warehousing in order to realize the effective operation of intelligent supply chains and digital logistics. They also established an image recognition model for arbitrary shape objects using convolutional neural networks in accordance with the driving and grasping modes of RA. This enabled RA to automatically select items from the picking list with any shape [12].

To develop sufficiently dexterous robotic systems to accomplish tedious and dangerous experimental work, Li et al. designed a benchtop experimental manipulator control system based on audiovisual information fusion algorithms. The system used a motion detection algorithm, an improved two-stream convolutional network, and motion control and grasping gesture recognition to teach and control the work of process-specific chemistry experiments [13]. Burguera et al. proposed a new visual simultaneous localization and mapping method for underwater robots that uses visual loop detection with superior robustness to reduce the number of false loops entering the attitude map optimizer. The method showed higher robustness in false loop detection and superior computational efficiency. Semi-synthesized data and real datasets obtained from autonomous robots validated the effectiveness and applicability of the method [14]. To solve the multi-target grasping detection problem of robots, Zhao et al. designed a multi-target grasping algorithm based on attention mechanism and non-uniform spatial pyramid pooling, which enhanced the size of the receptive field with the help of subluminous convolution of several different subluminous rates and preserved different ranges of features. The multi-target capture dataset constructed by the study verified the capture success rate of the algorithm [15]. To effectively accomplish the armless HMI control of RA to accomplish the grasping of various dense objects, Zhang et al. studied the realization of head control through armless HMI based on the armless HMI with mixed reality feedback and head control. The system can effectively accomplish the grasping area of dense objects with high point cloud tolerance [16]. Target detection and grip location estimates provide significant hurdles for Deep Convolutional Networks when it comes to data collection. Pattar et al. presented an automated data gathering approach utilizing mobile robots and invisible markers for object detection and grip location estimate in this regard. In terms of costs, consistency of the dataset, and time consumption, their solution fared better than human data annotation [17]. The motion state estimation and shape recognition of non-cooperative targets in space missions are very important. Du et al. combined the key point detection with the position estimation algorithm Perspective-n-Point to accomplish the recognition of targets and their characteristic points. With the help of real-time semantic segmentation to transmit the relative position of the captured points to the manipulator, the experimental results showed that the satellite recognition accuracy of this method was 99.48% [18].

Research on deep learning has made significant advancements in various fields. The complexity of the image background in remote sensing makes it difficult for traditional target detectors to perform well in terms of detection inference time and model volume identification. To address this issue, Chen et al. proposed a consistency and dependency-guided knowledge distillation method for image target detection. This method was based on spatial and channel-based structural discrimination modules, which effectively eliminate the effects of noise and complex background. Validated on a public dataset, the method achieved an average accuracy of 92% with a model volume of 3.3 M and a speed of 588.2 frames/sec, outperforming existing state-of-the-art target detection methods [19]. Current deep learning models for recognizing and classifying breast cancer histopathology images do not fully utilize the staining properties of these images. He et al. developed an inverse convolutional transform network model that uses a self-attention mechanism to match the independent attributes of the HED channel information obtained through color inverse convolution. This model also used residual connections to fuse multi-color spatial image information. The method achieved an average accuracy of 93.02% and an F1-score of 0.9389 on the BreakHis dataset [20]. Current network intrusion detection techniques often overlook the topology of network traffic, which is not helpful in dealing with class-imbalanced and highly dynamic network traffic problems. Zhong et al. proposed a dynamic multi-scale topology representation method that achieves multi-scale topology awareness. They also utilized group shuffling operations to achieve dynamic topology representation. The method was validated by various publicly available datasets and found to be effective in handling class imbalance and highly dynamic network traffic [21]. Current methods for complementing missing traffic data are time-consuming and do not provide sufficient information extraction. Chen et al. proposed a new non-negative tensor complementation model that preserves the time dimension. The model was based on a new tensor decomposition method, sigmoid mapper, and time constraints and AdamW format. It exhibited strong flexibility and convergence speed, which is better than the existing state-of-the-art models [22]. Loss of traffic data can significantly impact the accuracy of road traffic management decisions. To address this issue, Xu et al. optimized a low-rank matrix complementation model using Hessian regularized spatial constraints. They also designed a data complementation algorithm that captures time evolution through second-order difference with time series constraints. Validated on four different traffic datasets, the method improved the root-mean-square error by over 14% at a data missing rate of 90%, outperforming other existing state-of-the-art models [23].

Overall, the technology and research surrounding VG has been extensive. However, its application in the intricate domain of logistics remains limited, with deficiencies in gripping accuracy and optimal position determination. To further the development of the automated logistics industry, this study delves into object localization and position generation within VG.

3 Deep Learning-Based Localization and Sorting Model Design in Complex Logistics Environments

Common logistics sorting problems are typically not categorized by object localization and grasping. This study categorizes logistics scenarios and adopts different localization methods for objects based on their shape and edges. Improvement measures are designed to enhance the accuracy of the sorting process. The study evaluates candidate postures in a fixed space according to different logistics scenarios, taking into account factors such as sorting speed, accuracy, and cost. Adjusting the final sorting positions based on actual needs can enhance the adaptability of logistics processing.

3.1 Design of Document Class Object Localization Algorithm Based on YOLOv3

Facing the complex logistics sorting scene, items are usually placed in a messy stacked state, logistics robots need to visually understand the stacked scene, and complete the object localization on the basis of scene understanding. Commonly used object LA in the field of computer vision are feature matching, target detection and edge detection. Traditional target and edge detection algorithms are usually sensitive to complex scenes and background noise, easily interfered by light changes, shadows, occlusion and other factors, unable to deal with occlusion and multi-object situations, and have low localization accuracy. And the traditional target detection algorithms are sensitive to changes in the shape and attitude of the object, if the shape, angle or scale of the object changes, the extraction of features by localization algorithm (LA) is likely to be ineffective [24, 25]. Therefore, the research is based on deep learning to complete the target detection and segmentation of objects.

The study utilizes YOLOv3-based target detection to obtain the envelope frame of the document bag as the localization result of the item for flat regular documents or envelopes. The YOLOv3 structure composition is shown in Fig. 1 [26]. The YOLOv family is a typical target detection model for fast and reliable recognition, which is more widely used in the field of computer vision. The YOLOv framework has undergone several iterations, from YOLOv1 to the latest version, YOLOv7. YOLOv7 includes an extended high-efficiency layer aggregation network, which controls the gradient path length to speed up model learning and convergence. Compared to previous versions of the YOLOv framework, this results in faster learning and convergence. YOLOv7 is a connection-based architecture that guarantees optimal model structure by improving traditional scaling techniques [27, 28]. However, the two technical improvements of YOLOv7 contribute less to the accuracy improvement of document localization in complex environments. Additionally, utilizing the YOLOv7 architecture requires the use of more complex and labeled datasets to train the model, which is too time-consuming. Therefore, for this study, the classic framework structure YOLOv3 is chosen as it is more suitable for accuracy improvement training for small targets.

Fig. 1
figure 1

Schematic diagram of YOLOv3 structure

Firstly, the image dataset of envelope documents is established. Multi-angle shooting of logistics document bags in the natural stacking state, the shooting image size is uniformly adjusted to \(640 \times 480\). To prevent overfitting and ensure that the model learns the image features more accurately, the study uses the data enhancement technique, expands the image dataset using flipping, rotating, scaling, cropping, and shifting in both horizontal and vertical directions, uses the annotation tool Labelme to finish the dataset’s annotation, and finally The gathered dataset is split in an 8:2 ratio between the training and test sets.

The target detection algorithm YOLOv3 can quickly and accurately detect image targets with real-time, high-precision and multi-scale detection, and can run on images or videos in real time to complete the detection of very small targets [29]. YOLOv3 divides the image into \(s \times s\) grid of equal size. The confidence and intersection ratio of the enclosing frame is calculated in Eq. (1), where \(\Pr \left( {object} \right)\) denotes the probability of belonging to the object. \(\widehat{b}\) and \(b\) represent the predicted and true enveloping frame, respectively. \(\Pr \left( {object} \right) * IoU_{pred}^{teuth}\) then represents the definition of confidence level and \(IoU\) represents the cross-merge ratio.

$$ \left\{ {\begin{array}{*{20}c} {\Pr \left( {object} \right) * IoU_{pred}^{teuth} } \\ {IoU\left( {\widehat{b},b} \right) = \frac{{\left| {\widehat{b} \cap b} \right|}}{{\left| {\widehat{b} \cup b} \right|}}} \\ \end{array} } \right. $$
(1)

When the image needs to detect multiple objects, the lattice will predict the category confidence \(\Pr \left( {\left. {Class_{i} } \right|object} \right)\) for different objects, and the final confidence calculation process is shown in Eq. (2).

$$ \Pr \left( {\left. {Class_{i} } \right|object} \right) * \Pr \left( {object} \right) * IoU_{pred}^{teuth} $$
(2)

Considering that there may be cases where there is no intersection between the predicted and real frames and the non-uniqueness of the localization results of the predicted frames in the same case of \(IoU\), the loss function \(GIoU \, Loss\) designed by the study is calculated as shown in Eq. (3). \(a\) in Eq. (3) denotes the minimum frame that wraps \(\widehat{b}\) and \(b\).

$$ GIoU \, Loss = IoU - \frac{{a - \left( {\widehat{b} \cup b} \right)}}{a} $$
(3)

YOLOv3 in the envelope file bag detection process, the file bag mutual occlusion or adjacent file bag center point distance is too close, the object feature extraction is more rough. To enhance the object detection effect, the study incorporates a fusion attention technique into the YOLOv3 network topology. Figure 2 illustrates the fundamental basis of the convolutional neural network’s attention mechanism, which consists of the channel attention mechanism and the spatial attention mechanism. The channel attention mechanism consists of the two components of excitation and compression [30]. The compression part completes the compression of global information and then carries out feature learning in the channel dimension, and finally utilizes the excitation part to complete the weight allocation of each channel.

Fig. 2
figure 2

Schematic diagram of the principle of channel attention mechanism

Equation (4) illustrates the channel attention mechanism \(M_{c}\)‘s calculating process. The symbols \(AvgPool\) and \(MaxPool\) stand for maximum and average pooling, respectively. Activation function is represented by \(\sigma\), multilayer perceptual machine by \(MLP\), and picture features by \(F\).

$$ M_{c} \left( F \right) = \sigma \left( {MLP\left( {AvgPool\left( F \right)} \right) + MLP\left( {MaxPool\left( F \right)} \right)} \right) $$
(4)

The spatial attention mechanism is used to represent the model Spatial Transformer Networks (STNs), can accomplish the transformation of various deformation data in space and capture important regional features. The spatial attention mechanism \(M_{s}\) expression is shown in Eq. (5), where \(f^{7 \times 7}\) in Eq. (5) represents the convolution operation with convolution kernel 7.

$$ M_{s} \left( F \right) = \sigma \left( {f^{7 \times 7} \left[ {\left( {AvgPool\left( F \right)} \right);\left( {MaxPool\left( F \right)} \right)} \right]} \right) $$
(5)

Convolutional Block Attention Module (CBAM), whose structural composition is depicted in Fig. 3 [31], is formed by Channel Attention and Spatial Attention, which finish the learning of features in independent modules. To acquire the adjusted features, the Channel Attention Module and Spatial Attention Module in CBAM process the input characteristics in a sequential manner. Finally the study replaces the CBAM module with the residual block in the YOLOv3 structure to complete the improvement of the YOLOv3 network structure.

Fig. 3
figure 3

Schematic diagram of the hybrid attention mechanism module

3.2 Design of Soft Parcel Localization Algorithm Based on Mask-RCNN

Since soft object packages are prone to deformation during logistics transmission, object localization accuracy suffers. In this study, deep learning techniques such as ESA and Mask Regional Convolutional Neural Network (Mask R-CNN) are utilized to localize soft parcels. The Mask R-CNN accomplishes the detection of the object target and at the same time generates the Mask R-CNN achieves object detection and generates accurate semantic segmentation results at the pixel level, and generates binary masks after segmenting different instances of targets to achieve accurate localization and segmentation of target boundaries. Additionally, Mask R-CNN includes the Region Proposal Network (RPN) and Feature Pyramid Network (FPN), which are widely utilized in automatic driving task processing, remote sensing image analysis, and medicine. These networks may process features at different scales. The Mask R-CNN structure composition and RPN working principle are shown in Fig. 4 [32, 33].

Fig. 4
figure 4

Structure of Mask R-CNN and RPN working principle

The production of soft parcel dataset includes color image and depth image, the pixel data of two different images are made normalized, and the depth data of the depth image is scaled to between the pixel interval [0,255] of the color image, and the calculation process is shown in Eq. (6). In Eq. (6), \(X\) denotes the two-dimensional depth value matrix, \(x\) denotes the original depth value, and \(x{\prime}\) denotes the normalized depth value. The normalization operation is followed by data enhancement and labeling operations to complete the collection of the dataset.

$$ x^{\prime } = \frac{ - \min \left( X \right) + x}{{ - \min \left( X \right) + \max \left( X \right)}} $$
(6)

After completing the convolution operation using the shared convolutional layer, RPN can extract the candidate frames containing the localized objects in the image, after which the classification of the candidate regions is done. The RPN works by scanning the image using a sliding window with anchor frames at different scales. It then provides scores for the foreground and background of the image, along with displacement offsets of the localized objects in relation to the image [34, 35]. After the candidate region of interest (ROI) produced by RPN is mapped into a fixed-size feature image, the study uses ROI Align to finish feature map selection and bilinear interpolation to obtain image values on pixel points whose coordinates are floating-point numbers. This makes the feature aggregation process continuous and avoids the loss of accuracy caused by dividing the ROI region [36, 37]. The ROI Align back propagation process is shown in Eq. (7), where \(d\left( \cdot \right)\) denotes the distance between two points. \(i^{*} \left( {r,j} \right)\) denotes the floating point coordinates. \(\Delta h\) and \(\Delta w\) denote the difference between the horizontal and vertical coordinates of the two points, respectively. \(\frac{\partial L}{{\partial x_{i} }}\) and \(\frac{\partial L}{{\partial y_{r,j} }}\) denote the feature regions before and after pooling, respectively.

$$ \frac{\partial L}{{\partial x_{i} }} = \sum\limits_{r} {\sum\limits_{j} {\left[ {d\left( {i,i^{*} \left( {r,j} \right)} \right) < 1} \right]} } \left( {1 - \Delta h} \right)\left( {1 - \Delta w} \right)\frac{\partial L}{{\partial y_{r,j} }} $$
(7)

The loss function calculation of Mask R-CNN is shown in Eq. (8), where \(L_{cls}\), \(L_{cls}\), and \(L_{mask}\) correspond to the object categories, enclosing frames, and mask information of the network output, and denote the cross entropy, regression, and average binary cross entropy loss of binary classification, respectively.

$$ L = L_{cls} + L{}_{box} + L_{mask} $$
(8)

In addition, FPN has been widely introduced into the Mask R-CNN network structure in order to increase its ability to process features at different scales. FPN is a multi-scale feature extraction network, which utilizes bottom-up and top-down structures to complete the reduction and enlargement of the feature image, and utilizes a horizontal connection structure to complete the fusion of multi-scale feature images after unifying the size of the feature map. The obtained feature images contain richer spatial and semantic information at all scales [38, 39].

The weights and retrieved features from the network training are better suited for color images because the majority of the photos in the dataset are color images. In this study, a shallow network is designed for deep image features, consisting of encoding, decoding, jump connections and ReLu activation function. Finally, the study completes the fusion of color and depth features with a convolutional kernel of size \(1 \times 1\). The fused features are predicted in the segmentation task sub-network, and the structure of the Mask R-CNN network with fused features is shown in Fig. 5.

Fig. 5
figure 5

Schematic diagram of the network structure of Mask R-CNN with fused features

3.3 Design of Logistics Sorting Algorithm Based on Item Localization and Sampling Evaluation

Logistics sorting completes the grasping of objects from randomly stacked packages and the appropriate grasping position posture is the key to logistics sorting accuracy, but the appropriate grasping position of the same object in a fixed space is not unique [40, 41]. Therefore, the study evaluates the sorting position according to the evaluation model, and the final logistics sorting position generation results are determined from the evaluation results [42].

First of all, the object grasping space is limited, wrapped in the shape of a flat and regular bag of documents the best absorption location for the center of the enclosing frame \(\left( {x,y} \right)\), the search range is limited to \(\left( {x,y} \right)\) to extend outward \(\delta\) size range, that is, the absorption point \(\left( {p_{x} ,p_{y} } \right)\) to meet the \(p_{x} \in \left[ {x - \delta ,x + \delta } \right]\), \(p_{y} \in \left[ {y - \delta ,y + \delta } \right]\). Determining the direction of the absorption, the pixel points on the depth image to meet the planar expression \(ax + by + cz + d = 0\), the \(z\) coordinates represent the absorption point coordinates to the depth of the depth image of the distance, the planar schematic diagram as shown in Fig. 6.

Fig. 6
figure 6

Normal and plane diagram

The normal vector \(\nabla F\) of this plane and the procedure for finding it are shown in Eq. (9), and \(\nabla F\) denotes the value of the gradient at the point.

$$ \left\{ {\begin{array}{*{20}c} {n = \left( {a,b,c} \right)} \\ {n = \nabla F\left( {x,y,z} \right)} \\ \end{array} } \right. $$
(9)

For soft and irregularly shaped parcels, the sampling area is limited to the contour information of the parcel localization result, which consists of pixel points that change with the shape of the parcel with high resolution [16, 43]. Therefore, the parcel gripping position is set at the edge of the parcel. At the same time, the friction between the gripping end of the fixture and the object is coulomb friction, which satisfies Eq. (10) and ensures that the gripping point satisfies the principle of force closure. In Eq. (10), \(P\) denotes the contact position, \(u\) denotes the contact variable during grasping. \(c_{f} = \cos \left( {\tan^{ - 1} \mu } \right)\), \(\mu\) denotes the Coulomb friction.

$$ n\left( u \right) \cdot \frac{P\left( u \right) - P\left( u \right)}{{\left\| {P\left( u \right) - P\left( u \right)} \right\|}} > c_{f} $$
(10)

The wraparound grasping position should satisfy Eq. (11), where \(\delta_{\min }\), \(\delta_{\max }\), \(\delta_{{{\text{m}} ag}}\) denote the setting parameters. \(b_{center}\) denotes the center point of the wrapping frame, and \(p^{\prime }_{center}\) denotes the midpoint of the candidate gripping position. \(b_{i}\) denotes the corner point of the wrapping frame. The depth value of the candidate position is taken as the minimum depth value of the window with height \(h\) and width \(w\), and the fixture placement direction is defined by the vector direction of the line connecting the fixture endpoints.

$$ \left\{ {\begin{array}{*{20}c} {\delta_{\min } \le {\text{distance}}\left( {p^{\prime }_{{{\text{center}}}} ,b_{{{\text{center}}}} } \right) \le \delta_{\max } } \\ {{\text{distance}}\left( {p^{\prime }_{{{\text{center}}}} ,b_{i} } \right) \le \delta_{mag} } \\ \end{array} } \right. $$
(11)

After determining the parcel grasping position, it is necessary to determine the final sorting position according to the candidate position SE results, the study adopts the specific neural network module Grasp Quality Convolutional Neural Network (GQ-CNN) in the Grasping Pose Detection (PointNet GPD) algorithm to complete the scoring of the candidate position. The candidate pose grading is finished by GQ-CNN, and the network outcomes are displayed in Fig. 7. GQ-CNN is primarily composed of a sequence of convolutional, pooling, and fully-connected layers that translate the global feature vector recovered by PointNet GPD to the estimation of the grasping poses. It is specialized in regressing the grasping poses in point cloud data. GQ-CNN predicts the quality of fixture grasping and outputs the confidence level of the grasping quality.

Fig. 7
figure 7

GQ-CNN network framework

The goal of the fixture is to learn the robustness function \(Q_{{\theta^{*} }} \left( {u,y} \right)\) of the grasping bit pose and image, which can categorize different grasping bit poses according to the grasping success metric, the expression of which is shown in Eq. (12). In Eq. (12), \(\ell\) A denotes the cross-entropy function and \(\vartheta\) denotes the GQCNN parameters. \(P\left( {S,u,x,y} \right)\) denotes the joint probability distribution of fixture grasping success, bit pose, state and point cloud. \(S\) denotes the binary grasping success metric. \(Q_{\theta } \left( {u,y} \right) = {\rm E}\left[ {S\left| {u,y} \right.} \right]\) denotes success probability.

$$ \theta^{ * } = \mathop {\arg \min }\limits_{\theta \in \vartheta } {\rm E}_{{p\left( {S,u,x,y} \right)}} \left[ {\ell \left( {S,Q_{\theta } \left( {u,y} \right)} \right)} \right] $$
(12)

The goal of the sucker is to suck up the bit pose to satisfy the maximized robustness \(\pi^{*} \left( y \right) = \arg \max_{u \in C} Q\left( {y,u} \right)\) for a given point cloud and find the minimized cross-entropy loss of the weight parameter \(Q_{{\theta^{*} }}\) as shown in Eq. (13). In Eq. (13), \(Q_{\theta } \left( {y_{i} ,u_{i} } \right) = P\left( {R\left| {y,u} \right.} \right)\) denotes the probability of successful sucking. \(\left\{ {\left( {y_{i} ,u_{i} ,R_{i} } \right)} \right\}\) denotes the reward value, the point cloud and the sucking bit pose.

$$ \theta^{ * } = \mathop {\arg \min }\limits_{\theta \in \vartheta } \sum\limits_{i = 1}^{N} {\ell \left( {R_{i} ,Q_{\theta } \left( {y_{i} ,u_{i} } \right)} \right)} $$
(13)

The GQ-CNN network employs the cross-entropy loss function as its loss function, with Eq. (14) expressing the function. In Eq. (14), the real label is represented by \(y_{i}\) and the network prediction result is represented by \(p_{i}\).

$$ loss_{i} = - \left[ {\left( {1 - y_{i} } \right)\ln \left( {1 - p_{i} } \right) + y_{i} * \ln p_{i} } \right] $$
(14)

4 Performance Test of Logistics Algorithms Based on Deep Learning and Sampling Evaluation

The study conducted an analysis of the logistics sorting effect and designed performance test trials for bit position production algorithms and LA to confirm the efficacy of the sorting model designed by the study.

4.1 Performance Test of Deep Learning Based Object Localization Algorithm

The research collection logistics parcel dataset, 7Scenes dataset, InLoc dataset, and KITTI dataset are selected as the performance test datasets. 7Scenes dataset contains RGB-D images, real camera poses, and 3D models of seven indoor rooms, and the images consist of texture-less indications, motion blur, and repetitive structures, etc. The InLoc dataset consists of an RGB-D image database and RGB query images captured by handheld devices for indoor localization tasks. With over real scene photos, the KITTI dataset is now the largest international dataset for assessing computer vision algorithms in automated driving situations. To prevent the class imbalance problem from affecting the localization results, the study adjusted the class distribution of the dataset by using over-sampling and under-sampling to treat minority and majority class samples differently. The chosen photos from every dataset are split into training and testing sets at an 8:2 ratio.

To validate the performance of the research-improved YOLOv3 target detection algorithm for file bags and examine its localization effect for file bag-type documents, the study compares it with the traditional YOLOv3 detection method, the first-order target detector RetinaNet, and the deep-learning target detection method Region-Convolutional Neural Networks (RCNN) comparison. The average experimental findings of several datasets are displayed in Fig. 8, and the model accuracy and F1 value are selected as the evaluation indices. In Fig. 8, the average results of the four different datasets indicate that the research-designed Imp-YOLOv3 model is at the highest level in both F1 value and accuracy metrics, reaching 95.77% and 94.05%, respectively. The performance of the two different evaluation metrics is significantly better than other target detection algorithms, and the first-order target detector RetinaNet has the relatively worst overall performance, with an F1 value of only 77.41%, which is 18.36 percentage points lower than the research-designed LA, and an accuracy of only 78.29%, which is 15.76 percentage points lower than the optimal value. The F1 value provides a comprehensive evaluation of the positioning accuracy and accuracy of the positioning detection algorithm by considering both precision and recall of the model. It represents the degree of matching between the positioning algorithm’s results and the actual position. The Imp-YOLOv3 model has high utility on different types of datasets, with high localization accuracy and good comprehensive model evaluation performance, which is conducive to the sorting and grasping operation of RA.

Fig. 8
figure 8

Performance comparison results of different target detection algorithms

Target detection algorithms’ performance is assessed using the Receiver Operating Characteristic Curve (ROC), with Area under the Curve (AUC) serving as the particular evaluation metric. The classification impacts of the model under various thresholds are seen by the ROC, which is the rate of true positives as the vertical axis and the rate of false positives as the horizontal axis. Generally speaking, the AUC value falls between 0 and 1, with a bigger value indicating better model performance. Figure 9 displays the experimental findings of target detection models across various datasets. In this figure, the ROC curve of Imp-YOLOv3 model under four different data sets is located at the top of the coordinate axis, with the largest AUC value of 0.913, which is close to the perfect prediction, and the overall level of values taken is higher than other target detection models. In the same experimental setting, the highest AUC value of the traditional YOLOv3 model is only 0.784, the highest AUC value of RCNN is 0.827, and the highest AUC value of RetinaNet is 0.773. It is evident that in file-based object localization, the Imp-YOLOv3 model developed for the study outperforms the other models in terms of performance and localization impact.

Fig. 9
figure 9

Comparison of ROC curves of different target detection algorithms on four datasets

The localization model’s performance is assessed by comparing LAMask R-CNN with soft class parcels against Accelerated-KAZE (AKAZE), Oriented FAST and Rotated BRIEF (ORB), and Scale-Invariant Feature Transform (SIFT). The evaluation criteria chosen are the root mean square error (RMSE), mean absolute error (MAE), and solution error. Figure 10 displays the average experimental outcomes across various datasets. Since RMSE is more impacted by outliers and penalizes samples with higher prediction errors, it is chosen by the experiment to be used in addition to the MAE index to quantify the average degree of difference between the true value and the predicted value. In Fig. 10, with the increase in the iterations, the MAE and RMSE values of different target detection algorithms gradually decrease and differ significantly. The calculated values of the two error metrics of the Mask R-CNN with fused features designed in the study are significantly lower than those of the other three models, and the final error is stabilized under 0.60. In Fig. 10a, at the start of the iteration, the MAE value of Mask R-CNN with fused features is higher than that of ORB and SIFT methods. However, its error decreasing trend is more unstable, and the final error value is smaller. In Fig. 10b, the RMSE value of Mask R-CNN with fused features consistently remains at the lowest level, and the decreasing trend is smooth. In the process of object localization and recognition, the Mask R-CNN with fused features can show lower localization error, which is convenient for the implementation of sorting operation.

Fig. 10
figure 10

Comparison of localization error results for different target detection algorithms

The Mean Average Precision (MAP) and accuracy of the whole class are chosen to evaluate the localization accuracy of different target detection models on different datasets, MAP is obtained by taking a comprehensive weighted average of the average correct rate of all the class detection, and the higher the value taken represents the better overall performance of the model in the detection task. Figure 11 displays the model’s MAP test results. It is evident from these that the Mask R-CNN model with fused features, which was designed for the study, outperforms other target detection models on four different datasets in terms of accuracy and MAP values, with less variation in the values obtained. And on the Logistics parcel dataset, the MAP and accuracy of the Mask R-CNN model with fused features designed by the research reaches the highest level, both above 0.90, indicating that it is suitable for the localization of logistics parcels. In contrast, the stability of the MAP and accuracy of AKAZE, ORB, and SIFT is not consistent across different datasets, with a large range of variation in values, sometimes exceeding fifteen percentage points.

Fig. 11
figure 11

Comparison of all-class average precision and accuracy results for different target detection algorithms

The efficiency of the two localization methods designed in the study is compared with the existing fast target detection methods, comparing the algorithms including the traditional YOLO v3, Single Shot MultiBox Detector (SSD), EfficientDet, and Fully Convolutional One-Stage Detector (FCOS), and the accuracy of the different methods is analyzed simultaneously. Figure 12a shows that the two methods designed in the study are efficient and comparable to the existing advanced rapid detection methods in terms of speed. The difference is small compared to the rapid detection methods FCOS and SSD. Figure 12b demonstrates that the Mask R-CNN and Imp-YOLOv3 model have a significant advantage in accuracy. This indicates that the study design achieves high accuracy without sacrificing speed.

Fig. 12
figure 12

Comparison of time consumption and accuracy of different detection methods

4.2 Performance Test and Application Effect Analysis of Logistics Sorting Model Based on Sampling Evaluation

The performance of the research-designed SE-based GQ-CNN model is compared with the existing state-of-the-art bit-pose generation models, including Perspective-n-Points (PnP), Pose Graph Optimization (PGO) and Direct Linear Transform (DLT), and the model’s Relative Pose Error (RPE) as the evaluation index. The datasets Dex-Net 1.0, Dex-Net 2.0, GraspNet-1Biliion, and JACQUARD are selected as the experimental test datasets. Dex-Net 1.0 and Dex-Net 2.0 contain more than 150 kinds of grasping targets, respectively, and the GraspNet-1Billion dataset contains 97,280 RGB-D images, and the scenes in the dataset are densely labeled with the 6Dpose and grasping poses of the objects. The JACQUARD dataset contains a composition of 11,619 different scenes of 54,485 different objects, with a total of 4,967,454 grasping annotations.

RPE represents the difference between the positional differences of two frames separated by a fixed time difference relative to the true position, including translation and rotation errors, and is used to evaluate the degree of error in camera position estimation in image matching or localization tasks. The RPE test results for four different datasets are shown in Fig. 13. In this figure, the RPE of the SE-based GQ-CNN model designed in the study takes the smallest values, which are all under the range of 0.40, and the differences are obvious compared with the other three Position Generation Algorithm (PGA). The PGO model takes a larger value of the relative position error, and the grasping position generation of the object is the worst.

Fig. 13
figure 13

Comparison of relative position error results for different position generation algorithms

The Absolute Trajectory Error (ATE) is a metric that quantifies the discrepancy between the predicted and true trajectories. It is the direct difference between the estimated and true position, and it directly reflects the global consistency and algorithm correctness. The experimental results are displayed in Fig. 14, where it is evident that the experimentally designed model achieves more excellent results in four different data sets, with the relative position error evaluation results being closer to those of the relative position error and the ATE significantly lower than that of the other three models. The research-designed SE-based GQ-CNN model is compared with the ATE. The joint evaluation of ATE and relative positional error confirms the practicality of the research-designed PGA, which can improve the sorting effect in real-life logistics scenarios.

Fig. 14
figure 14

Comparison of absolute trajectory error results for different position generation algorithms

Finally, the combined values of grasping accuracy, RPE, and relative trajectory error (RTE), which combines RPE and ATE, which is the sum of ATE and average rotational error, are compared for the different pose generation models. Figure 15 presents the model’s metric evaluation findings. The highest grasping accuracy of the GQ-CNN model designed in the study reaches 90.27%, which is much higher than that of other models. In Fig. 15a, the grasping accuracy of various bit-position generating models steadily increases with the increase in iteration number. In Fig. 15b, with the increase of iteration number, the RTE of different pose generation models shows a gradual decrease, and the RTE index of GQ-CNN model decreases the fastest, and the final RTE value converges to the minimum value of about 0.100. Comprehensively, the SE-based GQ-CNN model designed in the study has the best effect on pose generation.

Fig. 15
figure 15

Comparison of the grasping accuracy of different position generation algorithms

The LA and PGA designed in the study are used in the logistics RA sorting operation, and 10 RAs and 10 logistics workers are arranged at the same time to carry out sorting operation comparisons, respectively, from the sorting speed and sorting accuracy. In Fig. 16, the time for RA to complete the sorting task is much lower than that of manual sorting, RA operation time grows slower, manual sorting time grows faster, and the difference between the two is more than half an hour. The gripping accuracy of RA to complete the sorting task is almost stable at more than 90%, and the fluctuation is small. However, the accuracy of manual sorting fluctuates a lot, and the accuracy of gripping fluctuates in the range of 65–85%, and the stability of the operation is poorer.

Fig. 16
figure 16

Comparison of manual sorting and robotic arm sorting

5 Conclusion

The logistics industry is undergoing a shift towards automation and digitization. To investigate the potential use of robots in smart supply chains and digital logistics, this study was undertaken to develop object recognition and RA position selection strategies in complex logistics environments. Our approach involved creating the item LA using an improved YOLOv3 target detection technique with convolutional ESA, and the RA’s PGA with SE. The testing findings indicated that, with scores of 95.77% and 94.05% on the F1 value and accuracy index, respectively, the Imp-YOLOv3 model performed better than the other models. These scores were 18.36 and 15.76 percentage points higher than the lowest performing model. Furthermore, the maximum AUC value of the Imp-YOLOv3 model pair was 0.913 across all four datasets, indicating better localization accuracy for document object localization. The MAE and RMSE values of the fused feature Mask R-CNN were all lower than 0.60, significantly surpassing the other three models. MAP and accuracy both scored above 0.90. The study’s SE-based GQ-CNN model achieved the lowest RPE and ATE values, both under 0.40, in comparison to the other three PGAs. This demonstrated the potential to enhance logistics sorting accuracy. The GQ-CNN model exhibited the highest grasping accuracy at 90.27%. Additionally, the RTE indicator decreased rapidly, ultimately converging around 0.100. Compared to the manual sorting algorithm, the research-developed sorting algorithm implemented in RA yields positive outcomes. It significantly enhances sorting efficiency and accuracy, with RA’s gripping accuracy for completing sorting duties staying consistent at over 90%. The study provides a framework for constructing an intelligent logistics system and introduces a fresh perspective for applying robot arms in logistics and warehousing. However, future research could focus on how RA absorbs soft objects. By comparing this to the act of grasping objects, it may be possible to improve the sorting speed of logistics even further.