On-line object detection: a robotics challenge

Maiettini, Elisa; Pasquale, Giulia; Rosasco, Lorenzo; Natale, Lorenzo

doi:10.1007/s10514-019-09894-9

On-line object detection: a robotics challenge

Published: 25 November 2019

Volume 44, pages 739–757, (2020)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Elisa Maiettini ORCID: orcid.org/0000-0002-0127-3014^1,2,3,
Giulia Pasquale^1,2,
Lorenzo Rosasco^2,3 &
…
Lorenzo Natale¹

1490 Accesses
12 Citations
Explore all metrics

Abstract

Object detection is a fundamental ability for robots interacting within an environment. While stunningly effective, state-of-the-art deep learning methods require huge amounts of labeled images and hours of training which does not favour such scenarios. This work presents a novel pipeline resulting from integrating (Maiettini et al. in 2017 IEEE-RAS 17th international conference on humanoid robotics (Humanoids), 2017) and (Maiettini et al. in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2018), which naturally trains a robot to detect novel objects in few seconds. Moreover, we report on an extended empirical evaluation of the learning method, justifying that the proposed hybrid architecture is key in leveraging powerful deep representations while maintaining fast training time of large scale Kernel methods. We validate our approach on the Pascal VOC benchmark (Everingham et al. in Int J Comput Vis 88(2): 303–338, 2010), and on a challenging robotic scenario (iCubWorld Transformations (Pasquale et al. in Rob Auton Syst 112:260–281, 2019). We address real world use-cases and show how to tune the method for different speed/accuracy trades-off. Lastly, we discuss limitations and directions for future development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Score to Learn: A Comparative Analysis of Scoring Functions for Active Learning in Robotics

Object Classification for Robotic Platforms

Humanoid Robot Detection Using Deep Learning: A Speed-Accuracy Tradeoff

Notes

References

Bajcsy, R., Aloimonos, Y., & Tsotsos, J. K. (2018). Revisiting active perception. Autonomous Robots, 42(2), 177–196.
Article Google Scholar
Browatzki, B., Tikhanoff, V., Metta, G., Bülthoff, H. H., & Wallraven, C. (2012). Active object recognition on a humanoid robot. In 2012 IEEE international conference on robotics and automation, pp. 2021–2028.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 379–387). Curran Associates Inc.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In Jebara, T. and Xing, E. P., (Eds.), Proceedings of the 31st international conference on machine Learning (ICML-14), pp. 647–655. JMLR workshop and conference proceedings.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
Article Google Scholar
Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010a). Cascade object detection with deformable part models. In 2010 IEEE Computer society conference on computer vision and pattern recognition, pp. 2241–2248. IEEE.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Georgakis, G., Mousavian, A., Berg, A. C., & Kosecka, J. (2017). Synthesizing training data for object detection in indoor scenes. CoRR, arXiv:1702.07836.
Girshick, R. (2015). Fast R-CNN. In Proceedings of the international conference on computer vision (ICCV).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In: 2017 IEEE international conference on computer vision (ICCV), pp. 2980–2988.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia—MM ’14, pp. 675–678. ACM Press.
Kaiser, L., Nachum, O., Roy, A., & Bengio, S. (2017). Learning to remember rare events. CoRR, arXiv:1703.03129.
Lin, T., Goyal, P., Girshick, R. B., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In IEEE international conference on computer vision, ICCV 2017, Venice, Italy, pp. 2999–3007.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV), Zürich. Oral.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. E. (2015). Ssd: Single shot multibox detector. CoRR, arXiv:1512.02325.
Maiettini, E., Pasquale, G., Rosasco, L., & Natale, L. (2017). Interactive data collection for deep learning object detectors on humanoid robots. In 2017 IEEE-RAS 17th international conference on humanoid robotics (Humanoids), pp. 862–868.
Maiettini, E., Pasquale, G., Rosasco, L., & Natale, L. (2018). Speeding-up object detection training for robotics with falkon. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS).
Metta, G., Fitzpatrick, P., & Natale, L. (2006). Yarp: Yet another robot platform. International Journal of Advanced Robotics Systems, 3(1),
Metta, G., Natale, L., Nori, F., Sandini, G., Vernon, D., Fadiga, L., et al. (2010). The icub humanoid robot: An open-systems platform for research in cognitive development. Neural Networks, 23(8–9), 1125–34.
Article Google Scholar
Parmiggiani, A., Fiorio, L., Scalzo, A., Sureshbabu, A. V., Randazzo, M., Maggiali, M., Pattacini, U., Lehmann, H., Tikhanoff, V., Domenichelli, D., Cardellino, A., Congiu, P., Pagnin, A., Cingolani, R., Natale, L., & Metta, G. (2017). The design and validation of the r1 personal humanoid. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 674–680.
Pasquale, G., Ciliberto, C., Odone, F., Rosasco, L., & Natale, L. (2019). Are we done with object recognition? The icub robot’s perspective. Robotics and Autonomous Systems, 112, 260–281.
Article Google Scholar
Pasquale, G., Ciliberto, C., Rosasco, L., & Natale, L. (2016a). Object identification from few examples by improving the invariance of a deep convolutional neural network. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 4904–4911.
Pasquale, G., Mar, T., Ciliberto, C., Rosasco, L., & Natale, L. (2016b). Enabling depth-driven visual attention on the icub humanoid robot: Instructions for use and new perspectives. Frontiers in Robotics and AI, 3, 35.
Article Google Scholar
Patten, T., Zillich, M., & Vincze, M. (2018). Action selection for interactive object segmentation in clutter. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 6297–6304.
Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems 28 (pp. 1990–1998). Curran Associates Inc.
Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollár, P. (2016). Learning to refine object segments. In ECCV.
Pinto, L., Gandhi, D., Han, Y., Park, Y.-L., & Gupta, A. (2016). The Curious Robot: Learning Visual Representations via Physical Interactions. arXiv:1604.01360 [cs].
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Redmon, J., & Farhadi, A. (2016). Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS).
Rudi, A., Carratino, L., & Rosasco, L. (2017). Falkon: An optimal large scale kernel method. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., (Eds.), Advances in neural information processing systems (Vol. 30, pp. 3888–3898). Curran Associates, Inc.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. In Society for industrial and applied mathematics, Philadelphia, PA, USA, 2nd edn.
Schwarz, M., Milan, A., Periyasamy, A. S., & Behnke, S. (2018). Rgb-d object detection and semantic segmentation for autonomous manipulation in clutter. The International Journal of Robotics Research, 37(4–5), 437–451.
Article Google Scholar
Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1–114.
Article MathSciNet Google Scholar
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops.
Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
Article Google Scholar
Shrivastava, A., Gupta, A., & Girshick, R. B. (2016). Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. IEEE Computer Society.
Smola, A. J., & Schökopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the seventeenth international conference on machine learning, ICML ’00 (pp. 911–918), San Francisco: Morgan Kaufmann Publishers Inc.
Sunderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., et al. (2018). The limits and potentials of deep learning for robotics. The International Journal of Robotics Research, 37(4–5), 405–420.
Article Google Scholar
Sung, K. K. (1996). Learning and Example Selection for Object and Pattern Detection. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA. AAI0800657.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30.
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Article Google Scholar
Viola, P., Jones, M., et al. (2001). Rapid object detection using a boosted cascade of simple features. CVPR (1), 1(511–518):3.
Wang, K., Yan, X., Zhang, D., Zhang, L., & Lin, L. (2018). Towards human-machine cooperation: Self-supervised sample mining for object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Williams, C. K. I., & Seeger, M. (2001). Using the nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 682–688). MIT Press.
Yun, P., Tai, L., Wang, Y., Liu, C., & Liu, M. (2019). Focal loss in 3d object detection. IEEE Robotics and Automation Letters, 4(2), 1263–1270.
Article Google Scholar
Zeng, A., Song, S., Yu, K., Donlon, E., Hogan, F. R., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., Fazeli, N., Alet, F., Dafle, N. C., Holladay, R., Morena, I., Nair, P. Q., Green, D., Taylor, I., Liu, W., Funkhouser, T., & Rodriguez, A. (2018). Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8.
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges (pp. 391–405). Cham: Springer International Publishing.

Download references

Author information

Authors and Affiliations

Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Genoa, Italy
Elisa Maiettini, Giulia Pasquale & Lorenzo Natale
Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia, Genoa, Italy
Elisa Maiettini, Giulia Pasquale & Lorenzo Rosasco
Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genoa, Genoa, Italy
Elisa Maiettini & Lorenzo Rosasco

Authors

Elisa Maiettini
View author publications
You can also search for this author in PubMed Google Scholar
Giulia Pasquale
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Rosasco
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Natale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elisa Maiettini.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the financial support of the AFOSR projects FA9550-17-1-0390, BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS - DLV-777826 and Axpo Italia SpA.

Appendices

Complete Minibootstrap procedure

This section reports the complete pseudo-code (Alg. 2) for the Minibootstrap procedure described in Sect. 3.3.

The iCubWorld Transformations Dataset

For validating the proposed pipeline in a robotic scenario, we considered the iCubWorld Transformations Dataset (iCWT). This dataset is part of a robotic project, called iCubWorld^{Footnote 8} which main goal is to benchmark the development of the visual recognition capabilities of the iCub Humanoid Robot (Metta et al. 2010). Datasets from the iCubWorld project are collections of images recording the visual experience of iCub while observing objects in its typical environment, a laboratory or an office. iCWT is the last released dataset and the largest one of the project. We refer to Pasquale et al. (2019) for details about the acquisition setup.

1.1 Dataset description

iCWT contains images for 200 objects instances belonging to 20 different categories (10 instances for each category). Each object instance is acquired in two separate days, in a way that isolates, for each day, different viewpoints transformations: planar 2D rotation (2D ROT), generic rotation (3D ROT), translation with changing background (BKG) and scale (SCALE) and, finally, a sequence that contains all transformations (MIX).

While the dataset has been acquired as benchmark for object recognition, lately we also provided object detection annotations in Imagenet-like format. Moreover we manually annotated a subset of images, that could be used to validate object detection methods trained with automatically collected data, as we did in Maiettini et al. (2017). Specifically, for this work we released a new and larger set of manually annotated images.^{Footnote 9}

For the experiments of this work we consider as train set, for the objects of the considered task, a subset of the union set of 2D ROT, 3D ROT, BKG and SCALE, while as test set we used a subset of 150 images from the first day of acquisition of the MIX sequence for each object, manually annotated adopting the labelImg tool.^{Footnote 10} We fixed an annotating policy such that an object must be annotated if at least a 50-25% of its total shape is visible (i.e. not cut out from the image or occluded).

For defining the FEATURE-TASK presented in 4.3, we consider all the 10 instances of the categories: ’cellphone’, ’mouse’, ’perfume’, ’remote’, ’soapdispenser’, ’sunglasses’, ’glass’, ’hairbrush’, ’ovenglove’, ’squeezer’, while we define the TARGET-TASKs presented in Sect. 4.3 and in Sect. 6 by considering the remaining 10 categories by choosing the instances as following:

1 Object Task: obtained by averaging results from considering ’sodabottle2’, ’mug1’, ’sprayer6’, ’hairclip2’
10 Objects Task: ’sodabottle2’, ’mug1’, ’pencilcase5’, ’ringbinder4’, ’wallet6’, ’flower7’, ’book6’, ’bodylotion8’, ’hairclip2’, ’sprayer6’
20 Objects Task: 10 Objects Task + ’sodabottle3’, ’mug3’, ’pencilcase3’, ’ringbinder5’, ’wallet7’, ’flower5’, ’book4’, ’bodylotion2’, ’hairclip8’, ’sprayer8’
30 Objects Task: 20 Objects Task + ’sodabottle4’, ’mug4’ , ’pencilcase6’, ’ringbinder6’, ’wallet10’, ’flower2’, ’book9’, ’bodylotion5’, ’hairclip6’, ’sprayer9’
40 Objects Task: 30 Objects Task + ’sodabottle5’, ’mug9’, ’pencilcase1’, ’ringbinder7’, ’wallet2’, ’flower9’, ’book1’, ’bodylotion4’, ’hairclip9’, ’sprayer2’

Examples of detected images

We report in Figs. 6 and 8 examples of detections predicted by the FALKON + Minibootstrap on random sampled images from respectively the test set of Pascal VOC (Everingham et al. 2010) and of iCWT (Pasquale et al. 2019).

Stopping Criterion for Faster R-CNN Fine-tuning

In this section we report on the cross validation carried out to study the convergence of Faster R-CNN and to choose when to stop the learning for the tasks of this work.

In Fig. 7 we report the validation accuracy trend on the Pascal VOC dataset, when learning the last layers of Faster R-CNN for increasing number of epochs. To this aim, within the Pascal VOC, we split the available images considering the union of the validation sets of Pascal 2007 and 2012 as validation set and the union of the training sets of Pascal 2007 and 2012 as training set.

Similarly, in Fig. 9 we report the validation accuracy trend with respect to the number of epochs for the iCWT dataset. In this case, we considered as train set the same \(\sim \)8k images used for the TARGET-TASK in Sect. 4, while we selected a different set of 4.5k images as validation set, considering the remaining images in the 2D ROT, 3D ROT, SCALE and TRANSL transformations.

Finally, in Fig. 10 we show the validation accuracy trend of the full train of Faster R-CNN (i.e. the optimization of the convolutional layers, RPN, feature extractor and output layers on the TARGET-TASK). Specifically, in this case, since we used the 4-Steps alternating training procedure as in Ren et al. (2015), we report the mAP trend, considering different numbers of epochs, when learning the RPN and the Detection Network. Therefore, the two numbers reported for each tick of the horizontal axis, represent respectively the number of epochs (i) for learning the RPN during steps 1 and 3 of the procedure and (ii) for learning the Detection Network during steps 2 and 4 of the procedure. We consider the same training and validation splitting as in Fig. 9.

Note that, we used these results for our stopping criterion, that consists in choosing the model at the epoch achieving the highest mAP on the validation set (we stopped when no mAP gain was observed in the three plots). We highlighted in red in the three plots, the configurations chosen to train the baselines on the tasks at hand, in Tables 1, 2 and 3 and in Fig. 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maiettini, E., Pasquale, G., Rosasco, L. et al. On-line object detection: a robotics challenge. Auton Robot 44, 739–757 (2020). https://doi.org/10.1007/s10514-019-09894-9

Download citation

Received: 02 March 2019
Accepted: 11 October 2019
Published: 25 November 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10514-019-09894-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On-line object detection: a robotics challenge

Abstract

Access this article

Similar content being viewed by others

Score to Learn: A Comparative Analysis of Scoring Functions for Active Learning in Robotics

Object Classification for Robotic Platforms

Humanoid Robot Detection Using Deep Learning: A Speed-Accuracy Tradeoff

Notes

References