Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Tschirschwitz, David; Benz, Christian; Florek, Morris; Norderhus, Henrik; Stein, Benno; Rodehorst, Volker

doi:10.1007/978-3-031-54605-1_39

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14264))

Included in the following conference series:

DAGM German Conference on Pattern Recognition

85 Accesses

Abstract

The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Computed by dividing the number of instances by the product of the number of images and the number of annotators.

References

Asman, A.J., Landman, B.A.: Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (collate). IEEE Trans. Med. Imaging 30(10), 1779–1794 (2011)
Article Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2019). https://doi.org/10.1109/tpami.2019.2956516. https://dx.doi.org/10.1109/tpami.2019.2956516
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348 (2018)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
Google Scholar
Cheng, Y., et al.: Flow: a dataset and benchmark for floating waste detection in inland waters. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10953–10962 (2021)
Google Scholar
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 20–28 (1979)
Google Scholar
Feng, D., et al.: Labels are not perfect: inferring spatial uncertainty in object detection. IEEE Trans. Intell. Transp. Syst. 23(8), 9981–9994 (2021)
Article Google Scholar
Gao, J., Wang, J., Dai, S., Li, L.J., Nevatia, R.: Note-RCNN: noise tolerant ensemble RCNN for semi-supervised object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9508–9517 (2019)
Google Scholar
Gao, Z., et al.: Learning from multiple annotator noisy labels via sample-wise label fusion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 407–422. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_24
Chapter Google Scholar
Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)
Article Google Scholar
Khetan, A., Lipton, Z.C., Anandkumar, A.: Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577 (2017)
Khodabandeh, M., Vahdat, A., Ranjbar, M., Macready, W.G.: A robust learning approach to domain adaptive object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 480–490 (2019)
Google Scholar
Langerak, T.R., van der Heide, U.A., Kotte, A.N., Viergever, M.A., Van Vulpen, M., Pluim, J.P.: Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (simple). IEEE Trans. Med. Imaging 29(12), 2000–2008 (2010)
Article Google Scholar
Le, K.H., Tran, T.V., Pham, H.H., Nguyen, H.T., Le, T.T., Nguyen, H.Q.: Learning from multiple expert annotators for enhancing anomaly detection in medical image analysis. arXiv preprint arXiv:2203.10611 (2022)
Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Michaelis, C., et al.: Benchmarking robustness in object detection: autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019)
Nguyen, D.B., Nguyen, H.Q., Elliott, J., KeepLearning, Nguyen, N.T., Culliton, P.: VinBigData chest X-ray abnormalities detection (2020). https://kaggle.com/competitions/vinbigdata-chest-xray-abnormalities-detection
Nguyen, H.Q., et al.: VinDr-CXR: an open dataset of chest X-rays with radiologist’s annotations. Sci. Data 9(1), 429 (2022)
Article Google Scholar
Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10213–10224 (2021)
Google Scholar
Ramamonjison, R., Banitalebi-Dehkordi, A., Kang, X., Bai, X., Zhang, Y.: SimROD: a simple adaptation method for robust object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 3550–3559. IEEE, October 2021. https://doi.org/10.1109/ICCV48922.2021.00355. https://ieeexplore.ieee.org/document/9711168/
Raykar, V.C., et al.: Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 889–896 (2009)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (2015)
Google Scholar
Rodrigues, F., Pereira, F.: Deep learning from crowds. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)
Google Scholar
Sheng, V.S., Zhang, J.: Machine learning with crowdsourcing: a brief summary of the past research and future directions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9837–9843 (2019)
Google Scholar
Sinha, V.B., Rao, S., Balasubramanian, V.N.: Fast Dawid-Skene: a fast vote aggregation scheme for sentiment classification. arXiv preprint arXiv:1803.02781 (2018)
Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes from different object detection models. Image Vis. Comput. 107, 104117 (2021)
Article Google Scholar
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. 34, 8135–8153 (2022)
Article Google Scholar
Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D.C., Silberman, N.: Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253 (2019)
Google Scholar
Tschirschwitz, D., Klemstein, F., Stein, B., Rodehorst, V.: A dataset for analysing complex document layouts in the digital humanities and its evaluation with Krippendorff’s alpha. In: Andres, B., Bernard, F., Cremers, D., Frintrop, S., Goldlücke, B., Ihrke, I. (eds.) DAGM GCPR 2022. LNCS, vol. 13485, pp. 354–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16788-1_22
Chapter Google Scholar
Wang, X., et al.: Robust object detection via instance-level temporal cycle confusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9143–9152 (2021)
Google Scholar
Wang, Z., Li, Y., Guo, Y., Fang, L., Wang, S.: Data-uncertainty guided multi-phase learning for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4568–4577 (2021)
Google Scholar
Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)
Article Google Scholar
Whitehill, J., Wu, T.F., Bergsma, J., Movellan, J., Ruvolo, P.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems 22 (2009)
Google Scholar
Wu, Y., et al.: Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10186–10195 (2020)
Google Scholar
Wu, Z., Suresh, K., Narayanan, P., Xu, H., Kwon, H., Wang, Z.: Delving into robust object detection from unmanned aerial vehicles: a deep nuisance disentanglement approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210 (2019)
Google Scholar
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.: Adversarial examples for semantic segmentation and object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1369–1378 (2017)
Google Scholar
Zhang, H., Wang, J.: Towards adversarially robust object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, pp. 421–430. IEEE, October 2019. https://doi.org/10.1109/ICCV.2019.00051. https://ieeexplore.ieee.org/document/9009990/
Zhang, Z., Zhang, H., Arik, S.O., Lee, H., Pfister, T.: Distilling effective supervision from severe label noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9294–9303 (2020)
Google Scholar
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proc. VLDB Endow. 10(5), 541–552 (2017)
Article Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Google Scholar
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177 (2004)
Article Google Scholar

Download references

Acknowledgment

This work was supported by the Thuringian Ministry for Economy, Science and Digital Society/Thüringer Aufbaubank (TMWWDG/TAB).

Author information

Authors and Affiliations

Bauhaus-Universität Weimar, Weimar, Germany
David Tschirschwitz, Christian Benz, Morris Florek, Henrik Norderhus, Benno Stein & Volker Rodehorst

Authors

David Tschirschwitz
View author publications
You can also search for this author in PubMed Google Scholar
Christian Benz
View author publications
You can also search for this author in PubMed Google Scholar
Morris Florek
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Norderhus
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar
Volker Rodehorst
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Tschirschwitz .

Editor information

Editors and Affiliations

IWR, Heidelberg University, Heidelberg, Germany
Ullrich Köthe
IWR, Heidelberg University, Heidelberg, Germany
Carsten Rother

Appendices

Appendix 1

This section presents our adaptation of the weighted box fusion (WBF) technique, tailored specifically for instance segmentation as a weighted mask fusion (WMF).

In their study, [18] propose a method for combining annotations from multiple annotators using a weighted box fusion [32] approach. In this method, bounding boxes are matched greedily only with boxes of the same class, and no annotations are discarded. The WBF algorithm fuses boxes that exceed a specified overlap threshold, resulting in new boxes that represent the weighted average of the original boxes. The approach also allows for inclusion of box confidence scores and prior weights for each annotator.

To extend the WBF method for instance segmentation, we introduce an option to fuse segmentation masks, which involves four steps: (1) calculating the weighted area and weighted center points from the different masks, (2) compute the average center point and average area from the selected masks, (3) determining the closest center point of the original masks to the weighted center point and selecting this mask, and (4) dilating or eroding the chosen mask until the area is close to the averaged area. The resulting mask is used as the aggregated segmentation mask and is also used as the averaging operation during the aggregation for LAEM and MJV with uniform weight.

Moreover, we integrate the WBF approach with LAEM, yielding WBF+EM. This integration involves assessing annotator confidence using LAEM, and subsequently incorporating it into the WBF method to produce weighted average areas instead of simply averaged areas. While the differences between LAEM and WBF might seem subtle, WBF+EM offers a more thorough approach to annotator fusion. This modification is relatively minor, and its impact is modest, as corroborated by our experiments delineated in Appendix 2.

Appendix 2

Table 5. Cross-Validation of ground truth inference combinations between training and test data, for the DetectoRS with a ResNet-50 backbone on the TexBiG dataset. Showing the mAP@[.5 : .95] for instance masks and bounding boxes. Union is represented by \(\cup \), intersection by \(\cap \) and averaging by \(\mu \). RL denotes training conducted on un-aggregated noisy labels. The two rows on the bottom show how the training methods perform on average.

Full size table

In this experiment, we carried out a comparative analysis of different ground truth inference methods. To do this, we separated the annotations for training and testing, and created various combinations of train-test datasets using the available ground truth estimation methods. Afterward, a model was trained on these combinations. The results from this experiment reveal how aggregation methods can impact the performance of the trained models and show how these outcomes can vary based on the specific combination of training and testing aggregation used.

Tables 5 and 6 present the application of various ground truth estimation methods on repeated labels. In the TexBiG dataset, each method is employed to aggregate the labels of both training and test data, and all possible train-test combinations are learned and tested to perform a cross comparison of the different ground truth inference methods, as shown in Table 5. The hyperparameter for the area combination is denoted as \(\cup \) for union, \(\mu \) for averaging and \(\cap \) for intersection. Additionally, the plain repeated labels, without any aggregation, are compared with the different aggregated test data. Our findings reveal that on a high-agreement dataset, weighted boxes fusion does not perform well. This could be attributed to the inclusion of most annotations by WBF, whereas in cases with high agreement, it is more desirable to exclude non-conforming instances. Majority voting and localization-aware expectation maximization perform similarly; however, LAEM provides a more elegant solution for addressing edge cases. Calculating the annotator confidence, as performed in LAEM, is highly advantageous. However, in rare cases, spammer annotators could potentially circumvent annotation confidence by annotating large portions of simple examples correctly but failing at hard cases. Such cases would result in a high confidence level for the spammer, potentially outvoting the correct annotators on challenging and crucial cases.

Table 6. Comparing results with the private Kaggle leaderboard [22] for the VinDr-CXR dataset using the double headed R-CNN at \(mAP_{40}\). Union is represented by \(\cup \), intersection by \(\cap \) and averaging by \(\mu \). RL denotes training conducted on un-aggregated noisy labels.

Full size table

The main performance differences between MJV and LAEM arise due to the application of the three different combination operations – union, averaging, and intersection. Combining areas by taking their union results in larger areas, making it easier for a classifier to identify the respective regions. Analysis of the mean results of the training methods reveals that both MJV+\(\cup \) and LAEM+\(\cup \) exhibit the highest performance across various test configurations. On the contrary, methods parameterized with intersection \(\cap \) yield the lowest mean results. Training with repeated labels without any aggregation yields results similar to training with aggregated labels. However, while it may be generally feasible to train with noisy labels, the performance is slightly dampened. Since the test data aggregation method is LAEM-\(\mu \) as described in Sect. 3.2, the best performing training method LAEM-\(\cup \) is chosen as the aggregation method for the training data in the experiments shown in Sect. 4.2 and 4.3.

For the VinDr-CXR dataset, a smaller, similar experiment is performed as shown in Table 6. As the Kaggle leaderboard already provides an aggregated ground truth and labels are unavailable, only the training data are aggregated. Our findings indicate that training with plain repeated labels leads to higher results. Given the low agreement of the dataset, training with repeated labels may be seen as a form of “label augmentation.” Interestingly, the methods used to aggregate the test data, such as WBF, do not outperform the other methods. However, ground truth estimation methods are not designed to boost performance but rather to provide a suitable estimation for the targeted outcome. Based on these results, the following experiments on VinDr-CXR will be run with the repeated labels for training.

Appendix 3

This section shows three more comparisons between different ground truth aggregation methods, exemplary on the VinDr-CXR dataset [23]. All of them follow the same structure. Left: the original image with the repeated labels indicated by the different line types. Right: the four smaller images from top left to bottom right are, MJV+\(\cap \), LAEM+\(\mu \), LAEM+\(\cup \) and WBF (Fig. 3).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tschirschwitz, D., Benz, C., Florek, M., Norderhus, H., Stein, B., Rodehorst, V. (2024). Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-54605-1_39
Published: 08 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54604-4
Online ISBN: 978-3-031-54605-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels