Skip to main content
Log in

On-line object detection: a robotics challenge

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Object detection is a fundamental ability for robots interacting within an environment. While stunningly effective, state-of-the-art deep learning methods require huge amounts of labeled images and hours of training which does not favour such scenarios. This work presents a novel pipeline resulting from integrating (Maiettini et al. in 2017 IEEE-RAS 17th international conference on humanoid robotics (Humanoids), 2017) and (Maiettini et al. in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2018), which naturally trains a robot to detect novel objects in few seconds. Moreover, we report on an extended empirical evaluation of the learning method, justifying that the proposed hybrid architecture is key in leveraging powerful deep representations while maintaining fast training time of large scale Kernel methods. We validate our approach on the Pascal VOC benchmark (Everingham et al. in Int J Comput Vis 88(2): 303–338, 2010), and on a challenging robotic scenario (iCubWorld Transformations (Pasquale et al. in Rob Auton Syst 112:260–281, 2019). We address real world use-cases and show how to tune the method for different speed/accuracy trades-off. Lastly, we discuss limitations and directions for future development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://amazonpickingchallenge.org/.

  2. https://github.com/LCSL/FALKON_paper.

  3. https://robotology.github.io/iCubWorld/.

  4. https://github.com/tzutalin/labelImg.

  5. https://youtu.be/eT-2v6-xoSs.

  6. https://www.csie.ntu.edu.tw/~cjlin/liblinear/.

  7. https://www.mathworks.com/.

  8. https://robotology.github.io/iCubWorld/.

  9. https://robotology.github.io/iCubWorld/#icubworld-transformations-modal/.

  10. https://github.com/tzutalin/labelImg.

References

  • Bajcsy, R., Aloimonos, Y., & Tsotsos, J. K. (2018). Revisiting active perception. Autonomous Robots, 42(2), 177–196.

    Article  Google Scholar 

  • Browatzki, B., Tikhanoff, V., Metta, G., Bülthoff, H. H., & Wallraven, C. (2012). Active object recognition on a humanoid robot. In 2012 IEEE international conference on robotics and automation, pp. 2021–2028.

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 379–387). Curran Associates Inc.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In Jebara, T. and Xing, E. P., (Eds.), Proceedings of the 31st international conference on machine Learning (ICML-14), pp. 647–655. JMLR workshop and conference proceedings.

  • Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010a). Cascade object detection with deformable part models. In 2010 IEEE Computer society conference on computer vision and pattern recognition, pp. 2241–2248. IEEE.

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Georgakis, G., Mousavian, A., Berg, A. C., & Kosecka, J. (2017). Synthesizing training data for object detection in indoor scenes. CoRR, arXiv:1702.07836.

  • Girshick, R. (2015). Fast R-CNN. In Proceedings of the international conference on computer vision (ICCV).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In: 2017 IEEE international conference on computer vision (ICCV), pp. 2980–2988.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia—MM ’14, pp. 675–678. ACM Press.

  • Kaiser, L., Nachum, O., Roy, A., & Bengio, S. (2017). Learning to remember rare events. CoRR, arXiv:1703.03129.

  • Lin, T., Goyal, P., Girshick, R. B., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In IEEE international conference on computer vision, ICCV 2017, Venice, Italy, pp. 2999–3007.

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV), Zürich. Oral.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. E. (2015). Ssd: Single shot multibox detector. CoRR, arXiv:1512.02325.

  • Maiettini, E., Pasquale, G., Rosasco, L., & Natale, L. (2017). Interactive data collection for deep learning object detectors on humanoid robots. In 2017 IEEE-RAS 17th international conference on humanoid robotics (Humanoids), pp. 862–868.

  • Maiettini, E., Pasquale, G., Rosasco, L., & Natale, L. (2018). Speeding-up object detection training for robotics with falkon. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Metta, G., Fitzpatrick, P., & Natale, L. (2006). Yarp: Yet another robot platform. International Journal of Advanced Robotics Systems, 3(1),

  • Metta, G., Natale, L., Nori, F., Sandini, G., Vernon, D., Fadiga, L., et al. (2010). The icub humanoid robot: An open-systems platform for research in cognitive development. Neural Networks, 23(8–9), 1125–34.

    Article  Google Scholar 

  • Parmiggiani, A., Fiorio, L., Scalzo, A., Sureshbabu, A. V., Randazzo, M., Maggiali, M., Pattacini, U., Lehmann, H., Tikhanoff, V., Domenichelli, D., Cardellino, A., Congiu, P., Pagnin, A., Cingolani, R., Natale, L., & Metta, G. (2017). The design and validation of the r1 personal humanoid. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 674–680.

  • Pasquale, G., Ciliberto, C., Odone, F., Rosasco, L., & Natale, L. (2019). Are we done with object recognition? The icub robot’s perspective. Robotics and Autonomous Systems, 112, 260–281.

    Article  Google Scholar 

  • Pasquale, G., Ciliberto, C., Rosasco, L., & Natale, L. (2016a). Object identification from few examples by improving the invariance of a deep convolutional neural network. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 4904–4911.

  • Pasquale, G., Mar, T., Ciliberto, C., Rosasco, L., & Natale, L. (2016b). Enabling depth-driven visual attention on the icub humanoid robot: Instructions for use and new perspectives. Frontiers in Robotics and AI, 3, 35.

    Article  Google Scholar 

  • Patten, T., Zillich, M., & Vincze, M. (2018). Action selection for interactive object segmentation in clutter. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 6297–6304.

  • Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems 28 (pp. 1990–1998). Curran Associates Inc.

  • Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollár, P. (2016). Learning to refine object segments. In ECCV.

  • Pinto, L., Gandhi, D., Han, Y., Park, Y.-L., & Gupta, A. (2016). The Curious Robot: Learning Visual Representations via Physical Interactions. arXiv:1604.01360 [cs].

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Redmon, J., & Farhadi, A. (2016). Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS).

  • Rudi, A., Carratino, L., & Rosasco, L. (2017). Falkon: An optimal large scale kernel method. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., (Eds.), Advances in neural information processing systems (Vol. 30, pp. 3888–3898). Curran Associates, Inc.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. In Society for industrial and applied mathematics, Philadelphia, PA, USA, 2nd edn.

  • Schwarz, M., Milan, A., Periyasamy, A. S., & Behnke, S. (2018). Rgb-d object detection and semantic segmentation for autonomous manipulation in clutter. The International Journal of Robotics Research, 37(4–5), 437–451.

    Article  Google Scholar 

  • Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1–114.

    Article  MathSciNet  Google Scholar 

  • Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops.

  • Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.

    Article  Google Scholar 

  • Shrivastava, A., Gupta, A., & Girshick, R. B. (2016). Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. IEEE Computer Society.

  • Smola, A. J., & Schökopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the seventeenth international conference on machine learning, ICML ’00 (pp. 911–918), San Francisco: Morgan Kaufmann Publishers Inc.

  • Sunderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., et al. (2018). The limits and potentials of deep learning for robotics. The International Journal of Robotics Research, 37(4–5), 405–420.

    Article  Google Scholar 

  • Sung, K. K. (1996). Learning and Example Selection for Object and Pattern Detection. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA. AAI0800657.

  • Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30.

  • Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

    Article  Google Scholar 

  • Viola, P., Jones, M., et al. (2001). Rapid object detection using a boosted cascade of simple features. CVPR (1), 1(511–518):3.

  • Wang, K., Yan, X., Zhang, D., Zhang, L., & Lin, L. (2018). Towards human-machine cooperation: Self-supervised sample mining for object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Williams, C. K. I., & Seeger, M. (2001). Using the nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 682–688). MIT Press.

  • Yun, P., Tai, L., Wang, Y., Liu, C., & Liu, M. (2019). Focal loss in 3d object detection. IEEE Robotics and Automation Letters, 4(2), 1263–1270.

    Article  Google Scholar 

  • Zeng, A., Song, S., Yu, K., Donlon, E., Hogan, F. R., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., Fazeli, N., Alet, F., Dafle, N. C., Holladay, R., Morena, I., Nair, P. Q., Green, D., Taylor, I., Liu, W., Funkhouser, T., & Rodriguez, A. (2018). Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8.

  • Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges (pp. 391–405). Cham: Springer International Publishing.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elisa Maiettini.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the financial support of the AFOSR projects FA9550-17-1-0390, BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS - DLV-777826 and Axpo Italia SpA.

Appendices

Complete Minibootstrap procedure

This section reports the complete pseudo-code (Alg. 2) for the Minibootstrap procedure described in Sect. 3.3.

figure f

The iCubWorld Transformations Dataset

For validating the proposed pipeline in a robotic scenario, we considered the iCubWorld Transformations Dataset (iCWT). This dataset is part of a robotic project, called iCubWorldFootnote 8 which main goal is to benchmark the development of the visual recognition capabilities of the iCub Humanoid Robot (Metta et al. 2010). Datasets from the iCubWorld project are collections of images recording the visual experience of iCub while observing objects in its typical environment, a laboratory or an office. iCWT is the last released dataset and the largest one of the project. We refer to Pasquale et al. (2019) for details about the acquisition setup.

1.1 Dataset description

iCWT contains images for 200 objects instances belonging to 20 different categories (10 instances for each category). Each object instance is acquired in two separate days, in a way that isolates, for each day, different viewpoints transformations: planar 2D rotation (2D ROT), generic rotation (3D ROT), translation with changing background (BKG) and scale (SCALE) and, finally, a sequence that contains all transformations (MIX).

While the dataset has been acquired as benchmark for object recognition, lately we also provided object detection annotations in Imagenet-like format. Moreover we manually annotated a subset of images, that could be used to validate object detection methods trained with automatically collected data, as we did in Maiettini et al. (2017). Specifically, for this work we released a new and larger set of manually annotated images.Footnote 9

Fig. 6
figure 6

Randomly sampled examples of detections on the PASCAL VOC 2007 test set, obtained using the proposed learning pipeline. CNN backbone for Faster R-CNN for feature extraction is Resnet101, train data is the set of images voc07++12 and the configuration used is FALKON + Minibootstrap\(10 \times 2000\) (1 min and 40 s of Train Time and 70.4% of mAP)

For the experiments of this work we consider as train set, for the objects of the considered task, a subset of the union set of 2D ROT, 3D ROT, BKG and SCALE, while as test set we used a subset of 150 images from the first day of acquisition of the MIX sequence for each object, manually annotated adopting the labelImg tool.Footnote 10 We fixed an annotating policy such that an object must be annotated if at least a 50-25% of its total shape is visible (i.e. not cut out from the image or occluded).

For defining the FEATURE-TASK presented in 4.3, we consider all the 10 instances of the categories: ’cellphone’, ’mouse’, ’perfume’, ’remote’, ’soapdispenser’, ’sunglasses’, ’glass’, ’hairbrush’, ’ovenglove’, ’squeezer’, while we define the TARGET-TASKs presented in Sect. 4.3 and in Sect. 6 by considering the remaining 10 categories by choosing the instances as following:

  • 1 Object Task: obtained by averaging results from considering ’sodabottle2’, ’mug1’, ’sprayer6’, ’hairclip2’

  • 10 Objects Task: ’sodabottle2’, ’mug1’, ’pencilcase5’, ’ringbinder4’, ’wallet6’, ’flower7’, ’book6’, ’bodylotion8’, ’hairclip2’, ’sprayer6’

  • 20 Objects Task: 10 Objects Task + ’sodabottle3’, ’mug3’, ’pencilcase3’, ’ringbinder5’, ’wallet7’, ’flower5’, ’book4’, ’bodylotion2’, ’hairclip8’, ’sprayer8’

  • 30 Objects Task: 20 Objects Task + ’sodabottle4’, ’mug4’ , ’pencilcase6’, ’ringbinder6’, ’wallet10’, ’flower2’, ’book9’, ’bodylotion5’, ’hairclip6’, ’sprayer9’

  • 40 Objects Task: 30 Objects Task + ’sodabottle5’, ’mug9’, ’pencilcase1’, ’ringbinder7’, ’wallet2’, ’flower9’, ’book1’, ’bodylotion4’, ’hairclip9’, ’sprayer2’

Examples of detected images

We report in Figs. 6 and 8 examples of detections predicted by the FALKON + Minibootstrap on random sampled images from respectively the test set of Pascal VOC (Everingham et al. 2010) and of iCWT (Pasquale et al. 2019).

Stopping Criterion for Faster R-CNN Fine-tuning

Fig. 7
figure 7

We show the validation accuracy trend with respect to the number of epochs for the Pascal VOC dataset (blue line) and we highlight (red star) the number of epochs chosen to train the Faster R-CNN baseline, reported in Table 1)

Fig. 8
figure 8

Randomly sampled examples of detections on iCWT, obtained using the proposed learning pipeline. CNN backbone for Faster R-CNN for feature extraction is Resnet50, FEATURE-TASK and TARGET-TASK are respectively the 100 and 30 objects tasks described in Sect. 4.3 and the configuration used is FALKON + Minibootstrap\(10 \times 2000\) (40 s of Train Time and 71.2% of mAP)

Fig. 9
figure 9

We show the validation accuracy trend with respect to the number of epochs for the iCWT dataset (blue line) and we highlight (red star) the number of epochs chosen to train the Faster R-CNN baselines, reported in Table 2 and Fig. 3

Fig. 10
figure 10

We show the validation accuracy trend for different configurations of epochs of the 4-Steps alternating training procedure (Ren et al. 2015) for the iCWT dataset (blue line). Each tick of the horizontal axis represents a different configuration. The numbers of epochs used for the RPN and for the Detection Network are reported, respectively after the labels RPN and DN. We highlight (red star) the configuration chosen to train the Faster R-CNN baseline, reported in Table 3

In this section we report on the cross validation carried out to study the convergence of Faster R-CNN and to choose when to stop the learning for the tasks of this work.

In Fig. 7 we report the validation accuracy trend on the Pascal VOC dataset, when learning the last layers of Faster R-CNN for increasing number of epochs. To this aim, within the Pascal VOC, we split the available images considering the union of the validation sets of Pascal 2007 and 2012 as validation set and the union of the training sets of Pascal 2007 and 2012 as training set.

Similarly, in Fig. 9 we report the validation accuracy trend with respect to the number of epochs for the iCWT dataset. In this case, we considered as train set the same \(\sim \)8k images used for the TARGET-TASK in Sect. 4, while we selected a different set of 4.5k images as validation set, considering the remaining images in the 2D ROT, 3D ROT, SCALE and TRANSL transformations.

Finally, in Fig. 10 we show the validation accuracy trend of the full train of Faster R-CNN (i.e. the optimization of the convolutional layers, RPN, feature extractor and output layers on the TARGET-TASK). Specifically, in this case, since we used the 4-Steps alternating training procedure as in Ren et al. (2015), we report the mAP trend, considering different numbers of epochs, when learning the RPN and the Detection Network. Therefore, the two numbers reported for each tick of the horizontal axis, represent respectively the number of epochs (i) for learning the RPN during steps 1 and 3 of the procedure and (ii) for learning the Detection Network during steps 2 and 4 of the procedure. We consider the same training and validation splitting as in Fig. 9.

Note that, we used these results for our stopping criterion, that consists in choosing the model at the epoch achieving the highest mAP on the validation set (we stopped when no mAP gain was observed in the three plots). We highlighted in red in the three plots, the configurations chosen to train the baselines on the tasks at hand, in Tables 12 and 3 and in Fig. 3.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maiettini, E., Pasquale, G., Rosasco, L. et al. On-line object detection: a robotics challenge. Auton Robot 44, 739–757 (2020). https://doi.org/10.1007/s10514-019-09894-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-019-09894-9

Keywords

Navigation