Skip to main content

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2023)

Abstract

The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Computed by dividing the number of instances by the product of the number of images and the number of annotators.

References

  1. Asman, A.J., Landman, B.A.: Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (collate). IEEE Trans. Med. Imaging 30(10), 1779–1794 (2011)

    Article  Google Scholar 

  2. Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2019). https://doi.org/10.1109/tpami.2019.2956516. https://dx.doi.org/10.1109/tpami.2019.2956516

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  5. Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348 (2018)

    Google Scholar 

  6. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

    Google Scholar 

  7. Cheng, Y., et al.: Flow: a dataset and benchmark for floating waste detection in inland waters. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10953–10962 (2021)

    Google Scholar 

  8. Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 20–28 (1979)

    Google Scholar 

  9. Feng, D., et al.: Labels are not perfect: inferring spatial uncertainty in object detection. IEEE Trans. Intell. Transp. Syst. 23(8), 9981–9994 (2021)

    Article  Google Scholar 

  10. Gao, J., Wang, J., Dai, S., Li, L.J., Nevatia, R.: Note-RCNN: noise tolerant ensemble RCNN for semi-supervised object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9508–9517 (2019)

    Google Scholar 

  11. Gao, Z., et al.: Learning from multiple annotator noisy labels via sample-wise label fusion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 407–422. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_24

    Chapter  Google Scholar 

  12. Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  14. Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)

    Article  Google Scholar 

  15. Khetan, A., Lipton, Z.C., Anandkumar, A.: Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577 (2017)

  16. Khodabandeh, M., Vahdat, A., Ranjbar, M., Macready, W.G.: A robust learning approach to domain adaptive object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 480–490 (2019)

    Google Scholar 

  17. Langerak, T.R., van der Heide, U.A., Kotte, A.N., Viergever, M.A., Van Vulpen, M., Pluim, J.P.: Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (simple). IEEE Trans. Med. Imaging 29(12), 2000–2008 (2010)

    Article  Google Scholar 

  18. Le, K.H., Tran, T.V., Pham, H.H., Nguyen, H.T., Le, T.T., Nguyen, H.Q.: Learning from multiple expert annotators for enhancing anomaly detection in medical image analysis. arXiv preprint arXiv:2203.10611 (2022)

  19. Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)

  20. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  21. Michaelis, C., et al.: Benchmarking robustness in object detection: autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019)

  22. Nguyen, D.B., Nguyen, H.Q., Elliott, J., KeepLearning, Nguyen, N.T., Culliton, P.: VinBigData chest X-ray abnormalities detection (2020). https://kaggle.com/competitions/vinbigdata-chest-xray-abnormalities-detection

  23. Nguyen, H.Q., et al.: VinDr-CXR: an open dataset of chest X-rays with radiologist’s annotations. Sci. Data 9(1), 429 (2022)

    Article  Google Scholar 

  24. Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10213–10224 (2021)

    Google Scholar 

  25. Ramamonjison, R., Banitalebi-Dehkordi, A., Kang, X., Bai, X., Zhang, Y.: SimROD: a simple adaptation method for robust object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 3550–3559. IEEE, October 2021. https://doi.org/10.1109/ICCV48922.2021.00355. https://ieeexplore.ieee.org/document/9711168/

  26. Raykar, V.C., et al.: Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 889–896 (2009)

    Google Scholar 

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (2015)

    Google Scholar 

  28. Rodrigues, F., Pereira, F.: Deep learning from crowds. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  29. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)

    Google Scholar 

  30. Sheng, V.S., Zhang, J.: Machine learning with crowdsourcing: a brief summary of the past research and future directions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9837–9843 (2019)

    Google Scholar 

  31. Sinha, V.B., Rao, S., Balasubramanian, V.N.: Fast Dawid-Skene: a fast vote aggregation scheme for sentiment classification. arXiv preprint arXiv:1803.02781 (2018)

  32. Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes from different object detection models. Image Vis. Comput. 107, 104117 (2021)

    Article  Google Scholar 

  33. Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. 34, 8135–8153 (2022)

    Article  Google Scholar 

  34. Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D.C., Silberman, N.: Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253 (2019)

    Google Scholar 

  35. Tschirschwitz, D., Klemstein, F., Stein, B., Rodehorst, V.: A dataset for analysing complex document layouts in the digital humanities and its evaluation with Krippendorff’s alpha. In: Andres, B., Bernard, F., Cremers, D., Frintrop, S., Goldlücke, B., Ihrke, I. (eds.) DAGM GCPR 2022. LNCS, vol. 13485, pp. 354–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16788-1_22

    Chapter  Google Scholar 

  36. Wang, X., et al.: Robust object detection via instance-level temporal cycle confusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9143–9152 (2021)

    Google Scholar 

  37. Wang, Z., Li, Y., Guo, Y., Fang, L., Wang, S.: Data-uncertainty guided multi-phase learning for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4568–4577 (2021)

    Google Scholar 

  38. Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)

    Article  Google Scholar 

  39. Whitehill, J., Wu, T.F., Bergsma, J., Movellan, J., Ruvolo, P.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems 22 (2009)

    Google Scholar 

  40. Wu, Y., et al.: Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10186–10195 (2020)

    Google Scholar 

  41. Wu, Z., Suresh, K., Narayanan, P., Xu, H., Kwon, H., Wang, Z.: Delving into robust object detection from unmanned aerial vehicles: a deep nuisance disentanglement approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210 (2019)

    Google Scholar 

  42. Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.: Adversarial examples for semantic segmentation and object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1369–1378 (2017)

    Google Scholar 

  43. Zhang, H., Wang, J.: Towards adversarially robust object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, pp. 421–430. IEEE, October 2019. https://doi.org/10.1109/ICCV.2019.00051. https://ieeexplore.ieee.org/document/9009990/

  44. Zhang, Z., Zhang, H., Arik, S.O., Lee, H., Pfister, T.: Distilling effective supervision from severe label noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9294–9303 (2020)

    Google Scholar 

  45. Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proc. VLDB Endow. 10(5), 541–552 (2017)

    Article  Google Scholar 

  46. Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)

    Google Scholar 

  47. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177 (2004)

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by the Thuringian Ministry for Economy, Science and Digital Society/Thüringer Aufbaubank (TMWWDG/TAB).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Tschirschwitz .

Editor information

Editors and Affiliations

Appendices

Appendix 1

This section presents our adaptation of the weighted box fusion (WBF) technique, tailored specifically for instance segmentation as a weighted mask fusion (WMF).

In their study, [18] propose a method for combining annotations from multiple annotators using a weighted box fusion [32] approach. In this method, bounding boxes are matched greedily only with boxes of the same class, and no annotations are discarded. The WBF algorithm fuses boxes that exceed a specified overlap threshold, resulting in new boxes that represent the weighted average of the original boxes. The approach also allows for inclusion of box confidence scores and prior weights for each annotator.

To extend the WBF method for instance segmentation, we introduce an option to fuse segmentation masks, which involves four steps: (1) calculating the weighted area and weighted center points from the different masks, (2) compute the average center point and average area from the selected masks, (3) determining the closest center point of the original masks to the weighted center point and selecting this mask, and (4) dilating or eroding the chosen mask until the area is close to the averaged area. The resulting mask is used as the aggregated segmentation mask and is also used as the averaging operation during the aggregation for LAEM and MJV with uniform weight.

Moreover, we integrate the WBF approach with LAEM, yielding WBF+EM. This integration involves assessing annotator confidence using LAEM, and subsequently incorporating it into the WBF method to produce weighted average areas instead of simply averaged areas. While the differences between LAEM and WBF might seem subtle, WBF+EM offers a more thorough approach to annotator fusion. This modification is relatively minor, and its impact is modest, as corroborated by our experiments delineated in Appendix 2.

Appendix 2

Table 5. Cross-Validation of ground truth inference combinations between training and test data, for the DetectoRS with a ResNet-50 backbone on the TexBiG dataset. Showing the mAP@[.5 : .95] for instance masks and bounding boxes. Union is represented by \(\cup \), intersection by \(\cap \) and averaging by \(\mu \). RL denotes training conducted on un-aggregated noisy labels. The two rows on the bottom show how the training methods perform on average.

In this experiment, we carried out a comparative analysis of different ground truth inference methods. To do this, we separated the annotations for training and testing, and created various combinations of train-test datasets using the available ground truth estimation methods. Afterward, a model was trained on these combinations. The results from this experiment reveal how aggregation methods can impact the performance of the trained models and show how these outcomes can vary based on the specific combination of training and testing aggregation used.

Tables 5 and 6 present the application of various ground truth estimation methods on repeated labels. In the TexBiG dataset, each method is employed to aggregate the labels of both training and test data, and all possible train-test combinations are learned and tested to perform a cross comparison of the different ground truth inference methods, as shown in Table 5. The hyperparameter for the area combination is denoted as \(\cup \) for union, \(\mu \) for averaging and \(\cap \) for intersection. Additionally, the plain repeated labels, without any aggregation, are compared with the different aggregated test data. Our findings reveal that on a high-agreement dataset, weighted boxes fusion does not perform well. This could be attributed to the inclusion of most annotations by WBF, whereas in cases with high agreement, it is more desirable to exclude non-conforming instances. Majority voting and localization-aware expectation maximization perform similarly; however, LAEM provides a more elegant solution for addressing edge cases. Calculating the annotator confidence, as performed in LAEM, is highly advantageous. However, in rare cases, spammer annotators could potentially circumvent annotation confidence by annotating large portions of simple examples correctly but failing at hard cases. Such cases would result in a high confidence level for the spammer, potentially outvoting the correct annotators on challenging and crucial cases.

Table 6. Comparing results with the private Kaggle leaderboard [22] for the VinDr-CXR dataset using the double headed R-CNN at \(mAP_{40}\). Union is represented by \(\cup \), intersection by \(\cap \) and averaging by \(\mu \). RL denotes training conducted on un-aggregated noisy labels.

The main performance differences between MJV and LAEM arise due to the application of the three different combination operations – union, averaging, and intersection. Combining areas by taking their union results in larger areas, making it easier for a classifier to identify the respective regions. Analysis of the mean results of the training methods reveals that both MJV+\(\cup \) and LAEM+\(\cup \) exhibit the highest performance across various test configurations. On the contrary, methods parameterized with intersection \(\cap \) yield the lowest mean results. Training with repeated labels without any aggregation yields results similar to training with aggregated labels. However, while it may be generally feasible to train with noisy labels, the performance is slightly dampened. Since the test data aggregation method is LAEM-\(\mu \) as described in Sect. 3.2, the best performing training method LAEM-\(\cup \) is chosen as the aggregation method for the training data in the experiments shown in Sect. 4.2 and 4.3.

For the VinDr-CXR dataset, a smaller, similar experiment is performed as shown in Table 6. As the Kaggle leaderboard already provides an aggregated ground truth and labels are unavailable, only the training data are aggregated. Our findings indicate that training with plain repeated labels leads to higher results. Given the low agreement of the dataset, training with repeated labels may be seen as a form of “label augmentation.” Interestingly, the methods used to aggregate the test data, such as WBF, do not outperform the other methods. However, ground truth estimation methods are not designed to boost performance but rather to provide a suitable estimation for the targeted outcome. Based on these results, the following experiments on VinDr-CXR will be run with the repeated labels for training.

Fig. 3.
figure 3

Qualitative results on three test images from the VinDr-CXR. Left: the original image with the repeated labels indicated by the different line types. Right: the four smaller images from top left to bottom right are, MJV+\(\cap \), LAEM+\(\mu \), LAEM+\(\cup \) and WBF.

Appendix 3

This section shows three more comparisons between different ground truth aggregation methods, exemplary on the VinDr-CXR dataset [23]. All of them follow the same structure. Left: the original image with the repeated labels indicated by the different line types. Right: the four smaller images from top left to bottom right are, MJV+\(\cap \), LAEM+\(\mu \), LAEM+\(\cup \) and WBF (Fig. 3).

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tschirschwitz, D., Benz, C., Florek, M., Norderhus, H., Stein, B., Rodehorst, V. (2024). Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54605-1_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54604-4

  • Online ISBN: 978-3-031-54605-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics