Comparative Analysis of Deep Neural Networks for the Detection and Decoding of Data Matrix Landmarks in Cluttered Indoor Environments

Data Matrix patterns imprinted as passive visual landmarks have shown to be a valid solution for the self-localization of Automated Guided Vehicles (AGVs) in shop floors. However, existing Data Matrix decoding applications take a long time to detect and segment the markers in the input image. Therefore, this paper proposes a pipeline where the detector is based on a real-time Deep Learning network and the decoder is a conventional method, i.e. the implementation in libdmtx. To do so, several types of Deep Neural Networks (DNNs) for object detection were studied, trained, compared, and assessed. The architectures range from region proposals (Faster R-CNN) to single-shot methods (SSD and YOLO). This study focused on performance and processing time to select the best Deep Learning (DL) model to carry out the detection of the visual markers. Additionally, a specific data set was created to evaluate those networks. This test set includes demanding situations, such as high illumination gradients in the same scene and Data Matrix markers positioned in skewed planes. The proposed approach outperformed the best known and most used Data Matrix decoder available in libraries like libdmtx.


Introduction
The industrial demand and competitiveness foster the increase of new, more sophisticated, and effective techniques in daily industrial tasks [1,2]. One important and valuable task for a manufacturing facility is the automatic transportation of components and materials [3][4][5]. Tiago  Transportation tasks are being performed by AGVs, which are industrial robots that travel from point to point, usually by following a magnetic wire or stripe on the shop floor [6,7]. Although this is a widely used technique, these methods present serious disadvantages in terms of performance and logistics such as fixed path tracks and limited flexibility, plus the fact that the performance may decline over time [8].
Hence, solutions based on landmarks have been proposed to solve or improve the major drawbacks presented by those systems; in that line, visual passive landmarks appear as a simple but very interesting solution [9][10][11], and one promising approach has been tried and exploited in the work of Bergamin [8], as shown in Fig. 1.
The landmarks (markers) are encoded with specific information, such as the respective world coordinates, which, once detected, are used to compute the robot localization through trilateration techniques. The overall technique showed very interesting results in the localization and navigation procedures, but had a limited performance in detecting the markers (Data Matrix) in the images from the environment. In this work, the Data Matrix landmarks were selected to the detriment of the other existing 1D/2D markers (barcodes Fig. 1 Visual landmarks encoded with Data Matrix labels to perform robot localization using trilateration and related techniques [8] and QR-codes, respectively) since they allow to have larger unit cells. Furthermore, they are considered a quite cheap, flexible, and robust solution according to the problem addressed in this work [8].
In that context, the work developed in [12] took that challenge of locating visual encoded landmarks (Data Matrix) in the environment using a DNN. That work proved that DL architectures to locate Data Matrix are capable of overcoming the barriers imposed by the traditional techniques, namely accuracy and processing time during the detection task. Even though the architecture presented in that work shows a good performance in most situations, it still presents high processing time (144ms per image in a single Nvidia RTX2080ti GPU just for the detection task). Therefore, in this work we try to develop a solution that balances effectiveness with low-latency. In this way, several different Deep Learning approaches for the marker detection task are studied, described, and assessed. They all have their own advantages and disadvantages, and the objective is to select the most suitable taking into account both accuracy and latency.
The remaining part of the paper is organized as follows. In Section 2, we exploit and discuss several previous works related to the one presented in this paper. In Section 3, we present our methodology and the different DNNs deployed, as well as each variant that they may have in the following sections. Section 4 enumerates and describes the baselines for the training of each network. Then, we characterize each experiment and the respective results throughout Section 5. Finally, in Section 6, we conclude the paper and define future research ideas based on the current work.

Related Work
The proposed self-localization approach stands out for a constellation of markers in the environment (a workshop). There are several works in the literature addressing robotics problems that make use of other techniques and landmarks. In [13,14], the authors used ArUco markers and denoted many limitations in terms of the fiducial markers detection, which is crucial in the overall application. Recently, in [15], the authors designed a real-time solution for limited size landmarks detection. This technique would then be used in [16], where the authors do not present any relevant result regarding localization, but present and describe an overall architecture that allows to compute it. Moreover, in [17], the authors propose an algorithm based on classical techniques to detect ARTag markers. This system was not designed to localize the robot according to a world frame, but it yields the distance between the camera and the agent. Finally, in [18], the authors compute the robot pose through the recognition of customized circular landmarks displaced in the ceiling.
In the problem addressed in this paper, the detection of Data Matrix labels in wide images is a challenging task since the labels occur at multiple scales in cluttered and unstructured environments. The process of detecting and decoding these targets is very time-consuming and in some cases inaccurate for classical algorithms [19,20], that is why DNNs started being devised to perform the detection stage [21]. Therefore, in this work, an important additional characteristic for the DL architecture is included: low-latency. This could be quite cumbersome to obtain because there is usually a trade-off between the performance of multi-scale objects detection and the latency of the network [22]. There are several studies related to the detection of this type of markers in [21,[23][24][25]. However, they all present results for structured environments and in limited situations. In [12], a Faster R-CNN architecture was proposed to detect Data Matrix landmarks in unstructured scenarios. This architecture is quite accurate in detecting objects at multiple scales and outperformed by far the traditional algorithm provided by the libdmtx Python's library in processing time. However, those improvements are not sufficient because for a system operating in realtime, this architecture is not the most suited. That is the main reason to extend the study to several other types of DL-based models, and investigate the real implication in a full pipeline, i.e. including the decoding stage.
There are many types of DNNs to perform object detection, but the most well-known and used ones are region proposal networks represented, currently, by Faster R-CNN [26] and single-shot approaches, mainly composed of [27] and [28] -Single Shot Multibox Detector (SSD) and You Only Look Once (YOLO), respectively. The main differences between original single-shot methods [27,28] and proposal networks [26] are the accuracy in locating objects (higher for proposal networks) and the overall processing time (lower for single-shot methods). Despite these characteristics, each model can be customized in two forms to influence both processing time and detection performance: the choice of backbones (or feature extractors) [29,30] and the application of multi-scale detection techniques [31][32][33][34].
The function of the backbone network is to extract features from the input image [30]. This shows its importance in the final result since features with no semantic meaning imply networks' receptive fields empty of information [35]. Regarding Faster R-CNN and SSD, there are several conventional backbones that are used to perform the feature extraction task: VGG [36], ResNet [37], DenseNet [38], MobileNet [39], SqueezeNet [40], and ShuffleNet [41] are some of the available options. VGG was conceived in a work that proved that deeper networks could achieve better results than shallower networks. It is difficult to train from scratch as the straight layers connection, which may even cause vanishing gradient problems. After VGG, ResNet appeared with a novel layout for convolutional layers: residual blocks with skip connections, which allow an easier training of deeper networks. DenseNet is characterized by a novel transmission of semantic information between layers, in which every layer's result is passed throughout the following layers. Finally, the smallest architectures -MobileNet [39], SqueezeNet [40] and ShuffleNet [41] provide faster predictions in preference to higher accuracy.
The multiple scales detection techniques are characterized by neural networks design choices and training details such as different ways of concatenating strong semantic information from multiple scales and the usage of a particular loss function that values the detection of more difficult classes. Regarding concatenation techniques, one of the most well-known approaches is the usage of Feature Pyramid Networks (FPNs) [31], which upsample semantic stronger feature maps and merge them to semantic weaker activation maps from the downsampling pathway. SSD also uses multi-scale feature maps but lacks semantic information, which is crucial for small object detection. Similarly, Feature Fused SSD [32] also adds semantic information from deeper layers to shallower feature maps through concatenation or element-sum modules. The former allows reducing the interference of a noisy background and the latter enhances the contextual information. YOLOv3 [42] improves also the detection of small objects by concatenating global features of multi-scale convolutional layers. One step further in the multi-scale object detection problem is the application of a Spatial Pyramid Pooling (SPP) to YOLO [33], which also fuses multi-scale local region features from the same convolutional layer. Finally, one training detail that can produce better results is the usage of focal loss [34], which penalizes more the miss-classifications of the most challenging classes. Table 1 shows an overview of the main types of architectures useful for this work, and summarizes the main points of the networks implemented in this study, which are discussed in detail next.

Methodology and Deep Networks
The motivation behind the usage of a DNN to locate the landmarks in a full frame is that the libdmtx method performs much slower in full images than it does in small patches of the input image; so, the DL-based model locates the Data Matrix label bounding-box in the input image, and then each image patch is decoded by the libdmtx. Accordingly, the proposal in this paper consists of the development of the initial phase of the full pipeline, which can be divided into two stages: the first where the Data Matrix marker is located, and the second where the label is actually decoded. The pipeline's workflow is shown in Fig. 2 with a real example.
Indeed, the first stage of the pipeline is the focal point of this paper. In this way, throughout the document we discuss and describe the most suited DNN to perform the detection of the landmarks taking into consideration the balance between accuracy and latency. Hence, this study focuses on a comparison of several DL architectures with different backbones and training procedures for the Data Matrix detection task. The study starts with Faster R-CNN model, comparing the results of the ResNet50 FPN, ResNet50, and MobileNetV2 training, performed with the same data augmentation techniques. Then, faster predictions are provided by studying the influence of the usage of different feature extractors (ResNet50 and MobileNetV2) on SSD512. Finally, different versions of YOLO (v3 and v4) are trained, and the application of the Spatial Pyramid Pooling technique to the third version (YOLOv3 SPP) is also carried out.
This work was developed under the PyTorch framework and the code is publicly available 1 .

Faster R-CNN
Region proposal architectures usually provide high-quality results with high-latency. Faster R-CNN is one example of these type of networks and is composed of two stages: the proposals computation after features extraction, and the final detection through Fast R-CNN detector [43]. The MS multi-scale, FPN feature pyramid network, FM feature maps, FF feature fused, SPP spatial pyramid pooling extraction of feature maps is performed by the backbone -a common classification DNN. Then, a small network slides over the feature maps, predicting multiple possible boxes for each of their cells through its output -a lowerdimensional feature. This output is fed to two 1 × 1 convolutional layers, which yield the probability and the encoded coordinates of each proposal. Finally, the most semantic valuable features (with higher objectness, that is a measurement that describes an object class) pass through an ROI pooling layer, which crop and re-scale the feature maps into fixed size feature maps. During inference, the non-maximum-suppression (NMS) algorithm filters out the best-located bounding boxes. This technique is transverse to all object detection algorithms that are described and used in this work.

SSD
Single-shot methods, like SSD and YOLO, can process the input faster, since the location and classification tasks are done in a single forward fashion. SSD, similarly to Faster R-CNN, has a conventional classification network (here we just used ResNet50 and MobileNetV2) that produces feature maps. Then, it skips the region proposal stage and yields final predictions at once. To do so, some extra layers are attached to the backbone yielding multi-scaled feature maps. Moreover, each of these extra layers provides a fixed set of detection predictions using convolutional filters. Finally, the model outputs the score for each category and the location of the boxes that bound the targets.

YOLOv3, YOLOv3 SPP, and YOLOv4
The YOLOv3 network [42], differently from the architectures presented so far, has a custom feature extractor -DarkNet53. This is also a convolutional neural network and, similarly to SSD, predicts three multi-scaled feature maps. It has 106 layers and has an interesting particularity that improves the object detection results: it concatenates feature maps of shallower layers (with low-level features) to the result of upsampled and deeper feature maps (FPN approach). This provides activation maps more representative of global features of different-sized objects. Moreover, the application of SPP implies an additional block after the input's downsampling, which pools and concatenates multiscale local region features (through max pool layers). This enables the usage of both global and local multi-scale features for the object detection task. Finally, the detection is performed by applying 1 × 1 convolutional detection filters on the three different feature maps.
On the other hand, YOLOv4 [29] is composed of a Cross Stage Partial (CSP [44]) Darknet53 with an SPP module, a path-aggregation network (PANet [45]), and a YOLOv3 head. CSP networks have similar basis and purposes to a DenseNet. Therefore, this architecture enhances the features reuse by reducing the amount of repeated gradient information observed in a DenseNet. To do so, it divides the base feature map, then one part of the channels passes through a partial dense block and the remaining part undergoes to the final partial transition layer. After the activation maps production, the only difference between YOLOv3 and YOLOv4 in terms of architecture's layout is the global features concatenation. In YOLOv4, instead of the FPN technique, a custom PANet approach is used [46]. PANet is simply an enhanced version of FPN; after the FPN's block composed of a top-down pathway with lateral connections, PANet also propagates low-level features through a bottom-up path augmentation block. This block allows the addition (concatenation for YOLOv4) of the FPN resulting features with the output of those feature maps with 3 × 3 convolutions, which yields an even better understanding of the low-level features.

Conventional Backbones
Backbones have a key role to play in the aforementioned type of architectures, since they are the activation maps producers, which contain the semantic value that allows to identify an object in the input image. Here, we describe and explain the backbones used in Faster R-CNN and SSD networks.
Faster R-CNN and SSD were trained with the same ResNet50 and MobileNetV2 backbones. They differ in both feature maps quality (in terms of semantic importance) and, subsequently, inference time. ResNet50 is a deep convolutional neural network that provides high accuracy by employing residual blocks with skip connections. These blocks propose to fit a residual mapping from the input's layer to the output, instead of directly trying to fit an underlying transformation. MobileNetV2 is a shallower, variable-width neural network, which is based on depthwise separable convolutions. It is not a common convolution in which the kernel and input depths are the same, but a combination of a depthwise and a pointwise convolutions. In a depthwise convolution, the input and the kernel are divided into channels and each kernel is separately applied to each input channel. A pointwise convolution implies the application of an 1 × 1 filter throughout the input channels. Hence, a depthwise separable convolution is composed of two stages: the depthwise convolution and a final 1 × 1 convolutional operation. Furthermore, ResNet50 FPN was also implemented in Faster-RCNN. The only difference comparing to the original ResNet50's layout is the strengthening of semantic weaker feature maps by concatenating them to semantic stronger ones. As mentioned before, this technique can provide a better performance by detecting multi-scaled objects.

Baselines
In order to compare and evaluate precisely the networks, they were all trained from scratch. The training and validation sets are the same as those conceived in [12], but at a quarter of the original size of the images (1500 × 2000) because the previous input shape did not correspond to what was going to be used in the final system and hampered the training of low-latency networks like SSD and YOLO. Moreover, data augmentation techniques were performed to increase the training input variance. In this way, the models can generalize for more situations, which means a more robust solution.

Faster R-CNN
The first baseline, Faster R-CNN, groups together ResNet50 FPN, ResNet50, and MobileNetV2 backbones. These models were trained with one of the geometric transformations: random crops with a final size of 480 × 640 or 960 × 1280, or just a resize of 750 × 1000 (half of the training input size). In addition, random brightness, random contrast and horizontal flip were also applied. Further, the training was done for 200 epochs with an AdamW optimizer and a cosine annealing scheduler with a warm up of 100 iterations (the learning rate scheduler is common to all baselines). The learning rate was set to 10 −3 , the weight decay to 10 −4 , and the batch size to 4.

SSD512
The second baseline is composed of all SSD variants -ResNet50 and MobileNetV2. The augmentation performed here is the same as the one presented in [27], taking into account an input size of 512 × 512. These architectures were trained for 300 epochs with the AdamW optimizer. Finally, the learning rate was set to 10 −3 , the weight decay to 4 × 10 −5 , and the batch size to 16.

YOLO
The YOLO baseline is common to YOLOv3, YOLOv3 SPP, and YOLOv4. Here, two augmentation approaches were performed and compared: with and without mosaic augmentation (presented in YOLOv4 [29]), which can be seen in Fig. 3.
These YOLO approaches have in common the application of both random horizontal and vertical flip, an HSV color space augmentation, and input size of 672 × 896. The usage of mosaic augmentation (upper images in Fig. 3) allows the model to be more generic since it learns from 4 different contexts in one single image. One of the augmentation approaches, instead of applying mosaic The two augmentation approaches provide two different training procedures for the YOLO baseline. However, just one of those yields the models that are evaluated on the test set. To do so, after the hyperparameters tuning, the two approaches are compared on the validation set. The average precision (AP) and the average recall (AR) results of the YOLO variants are shown in Tables 2 and 3, respectively. There, the numerical subscripts represent the IoU threshold, i.e. 50 means an IoU threshold of 0.50, and the letter subscripts correspond to the scale of the objects, i.e. "S" represents small objects (pixel area < 32 2 ), "M" medium (pixel area ∈ [32 2 , 96 2 ]), and "L" large (pixel area > 96 2 ). Finally, the metric without numerical subscript means an average over multiple IoU thresholds within [0.50, 0.95] in steps of 0.05.
In most cases, Tables 2 and 3 show better results from mosaic augmentation training for every model, thus, from now on, any YOLO result that is shown in this document was obtained through this data augmentation procedure.

Summary
To sum up, three baselines are set up, which correspond to each type of architecture (Faster R-CNN, SSD, and YOLO). Thus, the models of the same baseline (i.e. same type of architecture with different combinations of backbones) have common training procedures. The main hyperparameters discussed in the previous sections are summarized in Table 4.

Experiments
This work presents a collection of deep networks for Data Matrix detection. Since all models were tuned by using the validation set, this could not be considered as the set to perform the numerical comparisons and assessments. Therefore, a test set was created similarly to the sets conceived in [12]. This set of frames was processed only once by each model and that is why the networks results presented throughout this Section are unbiased and valid. Additionally, in this Section, we also report Data Matrix decoding results for the model that presented the best overall trade-off between detection and processing time in the test set.

A data set of Data Matrix images
The full dataset was manually labeled in an online toolbox 2 and it is composed of three different subsets: the training set, the validation set, and the test set. The first two were designed in [12], where the training set has 156 frames equally distributed by two different scenarios (a lab and a workshop), and the validation set is also divided into two environments -a hallway (158 frames) and a different workshop (66 frames). Therefore, both training and validation sets were collected in different scenarios to avoid biased results during neural networks training. Additionally, the full test set was conceived during this work and it has 145 frames with 895 annotations. It comprises three scenarios, which are different from those used for the training and validation sets. One of them is a neat hall with overshadowed and over-lightened landmarks in different planes as can be seen in Fig 4. The second environment is a classroom laboratory with various electronic equipment arranged in an orderly manner (Fig. 5). Since the final robotic system is expected to operate in cluttered workshops, a third scenario was made very challenging with multiple machinery spread out all over the place and also with a more diversified range of materials (Fig. 6).

Detection results
The detection algorithms were evaluated in the set test described in Section 5.1. To do so, and analogously to [12] and Section 4.3, AP and AR were calculated. The results obtained for AP are presented in Table 5.
A broad view of the AP results shows that among all Faster R-CNN variants, the one with best results is ResNet50 FPN, as expected. Regarding SSD neural networks, the ResNet50 backbone is the most accurate one. As far as YOLO models are concerned, although all results were quite similar, YOLOv4 was the one that stood out most among the others. In addition to AP, Table 6 shows the AR and the averaged processing time results throughout the test set.
The results presented in Table 6 are in line with the AP results. The most relevant result is the 75.3% provided by Faster R-CNN with ResNet50 FPN, but all YOLO models, especially YOLOv4, performed well above average with the strengthening factor of a lower processing time, yielding results at 47.6fps.
An overview of both Tables 5 and 6 shows that shallower backbones like MobileNetV2 struggle to achieve results comparable to the deeper networks, but in return they are much faster. Moreover, FPN/PANet techniques enable to improve the detection of small objects (directly comparing SSD and Faster R-CNN with YOLO results).
All YOLO variants provide quite interesting results, namely in terms of recall, which is the most appealing metric since the characteristics of the system are such that false positives can easily be ruled out (false negatives are easily discarded by the downstream decoding stage, but add computational overhead). Also, it seems that the SPP and the PANet modules (YOLOv3 SPP and YOLOv4) help to increase both AP and AR results. Overall, YOLOv4 shows the best trade-off between qualitative and latency results and, therefore, it should be the solution deployed in the final robotic system.

Qualitative detection results on the test set
Theoretical insights have shown that DL-based models generalization is one of the most critical and important points to consider. This is because biased models for training/validation sets do not have the expected outcome in real world applications. Therefore, we show some interesting visual results/comparisons between some of the models whose results were presented and discussed in the previous Section.
The first visual example (Fig. 7) represents three frames from the test set processed by the models, which obtained the best numerical results for each baseline, i.e. ResNet50 FPN from Faster R-CNN, ResNet50 from SSD, and YOLOv4 from YOLO. It is worth mentioning that the confidence threshold used for the YOLO variants in this   Highest scores for each metric are highlighted Highest scores for each metric are highlighted Fig. 7 Visual results from the test set, where each zoom color has a different meaning: red is a true positive, green corresponds to a false negative, blue pertains to the false positives, and yellow highlights ineffective NMS outcomes. Each row contains a frame that was processed by the best variant of each architecture. In the first row there is an example from the "workshop" scenario with machinery; the second row is an image from the hall environment, in which the landmarks are on different illumination conditions; finally, the images of the third row are from the classroom where there is one partial label correctly classified subsection is not the same as the one used on the test set evaluation. In this way, the false positives are greatly reduced, thus presenting better overall results. Therefore, the YOLO minimum object confidence was changed from 0.001 to 0.05. From these images, it is possible to infer that Faster R-CNN with the ResNet50 FPN as the backbone and YOLOv4 are the most accurate models with very similar predictions. SSD512, in the "workshop" image, faced problems inferring labels in skewed planes and also failed a label on the hall scenario with very poor illumination conditions.
The second example joins the results provided by the faster variants of each architecture, i.e. MobileNetV2 from Faster R-CNN and SSD512, and YOLOv4 from YOLO. The results can be seen in Fig. 8, which shows a clear example of why YOLOv4 is the best model to perform the Data Matrix detection in this context because, besides being qualitatively more performing than the other two models of each baseline, it is also a fast algorithm. Furthermore, we can also note that throughout the YOLOv4 results, the NMS threshold might not be optimally defined since there are several failures in the suppression of similar bounding boxes. Therefore, in the next Section we discuss the most suited value for this parameter, because it could be a bottleneck of the pipeline, since the decoder spends time trying to decode twice the same Data Matrix. The proposed methodology yields the same number of decoded landmarks 11x faster than the existing standalone methodology, within 1970 detections

Decoding Results
Indeed, the best model to deploy is the YOLOv4 according to the trade-off between processing time and accuracy required by the application. In this section, we show a comparison between the standalone methodology of libdmtx and our hybrid proposal -YOLOv4 (detection) plus libdmtx (decoding). In Table 7, we show the decoding results throughout the test set of the two approaches producing the same amount of decoded markers -458, which represents the maximum decoded landmarks provided by the libdmtx standalone approach. red is a true positive, green corresponds to a false negative, blue pertains to the false negatives, and yellow zooms ineffective NMS results. In the first row there is an example from the "workshop" scenario where long distance detection is tested; the second row is an image from the hall environment, with several landmarks in different planes; finally, the third row of images are from the classroom where the targets are in shelves and walls The confidence and NMS thresholds of the deep model also influence the effectiveness/speed trade-off of the full pipeline. In other words, if we set a higher threshold for the confidence of the model, the model would return much less possible markers, and consecutively, the decoding stage would be less requested, which turns the overall pipeline faster; similarly, in case of a lower NMS threshold, the number of inputs of the decoding stage would be less. It is worth mentioning that the impact of the NMS threshold is less than the one provided by the confidence limit since the unique NMS's objective is to suppress overlapped bounding boxes. Therefore, we also studied the influence of these factors by applying a grid search algorithm. For instance, the results obtained for the same number of decoding Data Matrix, presented in the Table above, correspond to a confidence threshold of 0.003 and a NMS limit of 0.6. This represents another advantage of DNNs on the detection stage of the pipeline as the user may set different thresholds according to the accuracy/processing time expectation: in case of having more decoded markers than necessary (e.g. in our localization problem, two decoded Data Matrix in each time step ensure a fully operational system), the user can increase the detections confidence and decrease the NMS limit in order to take advantage of a faster pipeline. In fact, the settings that allow more decoded targets, 479 decoded markers within 13065 detections in 13105.4s, correspond to the minimum confidence and the maximum NMS threshold used in the grid search experiment, i.e. 0.001 and 0.95 for the confidence and NMS thresholds, respectively.
In a nutshell, the inclusion of YOLOv4 in the detection stage of the overall pipeline provides faster and, in most of the settings (relying on the thresholds discussed above), better results. Finally, both number of decoded Data Matrix and time results presented so far are merely comparative, since a fraction of the Data Matrix labels placed in each scenario are arranged in such a way that it is impossible to decode them, as can be seen from the examples depicted in Fig. 9. In this way, both time and number of decoded markers were equally affected as there was time spent attempting to decode Data Matrix proposed by YOLOv4 that were undecodable. This means that the results presented here can be considered almost a lower bound when comparing to the real system because in this case (real application) the markers would be placed in such a way that are more likely to be decoded, so there would not be much time spent in undecodable landmarks.
Finally, as the developed system needs at least two decoded Data Matrix labels to compute the robot positioning [8] (more than two would allow an enhanced localization), this number can be used as an example to exploit a fairer (and faster) comparison between the two decoding methods (the classical and the proposed in this document). The average processing speed results of the proposed solution are presented in Fig. 10, taking into account that the classical algorithm produces the same results at 0.34fps.
This result shows that, in average, the proposed method provides decoded labels at 13.8fps when the two best scorer predictions were decoded. This has occurred in 43% of the test set frames. Moreover, when just one prediction of the top-confidence outputs is non-decodable, the pipeline yields results at 3.9fps (18% of the test set images). Finally, when there are two undecoded targets, it outputs the two decoded Data Matrix targets at 3.5fps (6% of the evaluated test set). The remaining situations are unrepresentative (i.e. scant) in the test set, considering that 27% of the test set frames did not produce 2 decoded Data Matrix labels. Hence, the most representative situations occur when the two labels are decoded through the first four predictions (0 to 2 undecoded predictions) -67% of the test set. In the worst case scenario, the DL-based solution yields results 10x faster than the classical algorithm in average, but can reach up to 40x speed improvement.

Summary
In this section, we started by presenting the full dataset comprising heterogeneous subsets for the respective training, validation and test procedures. This allows us to ensure fair results and comparisons throughout the remaining parts of this work. Afterwards, we reported both qualitative and Fig. 9 Two examples where the markers are correctly detected, but their position in the environment will hardly allow them to be decoded with libdmtx Fig. 10 Number of undecoded detected Data Matrix markers until two of them are successfully decoded, and the respective average processing rate for each case numerical results for the detection of the landmarks in the test set. In doing so, one can infer that the most promising Data Matrix detector is the YOLOv4 according to the accuracy-latency trade-off (prime priority for this problem). Finally, considering YOLOv4 as the DNN in the pipeline described above, we propose a final comparison with the standalone libdmtx approach. Therewith, we found that our proposed method, YOLOv4 combined with libdmtx, is faster (and possibly better) than the standalone libdmtx when selecting the appropriate hyperparameters as shown in Section 5.4.

Conclusion
This work describes, assesses and compares several DNNs when performing Data Matrix detection task. The model that had the best performance in this work is the one whose average precision, recall and processing speed form the best combination. Therefore, YOLOv4 was considered the best network to detect this type of landmarks. However, this DLbased model only represents the detection part of the entire decoding pipeline. Thus, the paper evaluates/compares the decoding system (with YOLOv4 followed by the classical decoder) with the original full-frame decoder from libdmtx. This comparison proved that the proposed method outperforms by far the classical algorithm in terms of processing time.
Nevertheless, during the decoding system evaluation, bottlenecks were found in the system, such as: trying to decode false positives and non-decodable Data Matrix markers. In order to deploy a very robust robotic selflocalization system, in the future, four situations should be studied to suppress the bottlenecks mentioned, or to improve the overall self-localization system: -During YOLOv4 training, the objectness confidence yielded by the model become the probability of the prediction being decodable. This custom YOLOv4 would be more confident if the object is decodable, and as the decoder acts from the most confident to the least confident locations, it would decode the two best scorer predictions more times than it does right now. -Alternatively, the conception of a decoding DL-based network. The bottleneck of our approach is the decoder stage, thus a DL network that locates and decodes Data Matrix in a single fashion, would decrease the latency of the overall system by discarding non-decodable markers and returning only the results of decodable labels. -The detection model can be modified to output a warped bounding box, instead of a regular one. This output would be used downstream to perform a homographic transformation of the marker image, producing a better input to the decoding stage. There are several methods that can provide such an output, such as the ExtremeNet [47]. -The usage of a tracking system. This solution allows to only run the pipeline studied in this document if one of the labels stops being tracked. Therefore, the overall latency of the system would decrease.
Finally, as the overall contribution, this document proposes a pipeline for Data Matrix decoding based on the YOLOv4 detector, after the study of several different DL models to perform the detection task. This method is faster and potentially better (according to the confidence and NMS thresholds), but still has bottlenecks resulting from the decoder that can be improved/suppressed with the replacement of the entire pipeline by a single DL-network that locates and decodes Data Matrix, or more simply, with the usage of a tracking system allied to a new loss function that would value more the decodable objects, or even with the deployment of a detection network that yields also the homographic transformation of each bounding box.

Consent to Participate
The authors consent to participate in this work.

Competing Interests
The authors declare that they have no conflict of interest.

Availability of Data and Material
Code at github.com/tmralmeida/ data-matrix-detection-benchmark Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.