Abstract
Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors together. The appearance of the container, the sound of the filling, and the depth data provide essential information. We propose a multi-modal method to predict three key indicators of the filling mass: filling type, filling level, and container capacity. These indicators are then combined to estimate the filling mass of a container. Our method obtained Top-1 overall performance among all submissions to CORSMAL 2020 Challenge on both public and private subsets while showing no evidence of overfitting. Our source code is publicly available: github.com/v-iashin/CORSMAL.
V. Iashin, F. Palermo, G. Solak and C. Coppola—Equally contributed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
We also trained the GRU model on R(2+1)d features, yet it reached only F1Â =Â 67.3.
References
The 2020 CORSMAL Challenge. Multi-modal fusion and learning for robotics. https://corsmal.eecs.qmul.ac.uk/ICPR2020challenge.html. Accessed 22 Nov 2020
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, Dec 2014 (2014)
Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. PLoS One 10(12), e0144610 (2015)
Griffith, S., Sukhoy, V., Wegter, T., Stoytchev, A.: Object categorization in the sink: learning behavior-grounded object categories with water. In: Proceedings of the 2012 ICRA Workshop on Semantic Perception, Mapping and Exploration. Citeseer (2012)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3196–3206 (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135 (2017). https://doi.org/10.1109/ICASSP.2017.7952132
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7310–7311 (2017)
Iashin, V., Rahtu, E.: A better use of audio-visual cues: dense video captioning with bi-modal transformer. In: British Machine Vision Conference (BMVC) (2020)
Iashin, V., Rahtu, E.: Multi-modal dense video captioning. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 958–959 (2020)
King, R.D., et al.: Is it better to combine predictions? Protein Eng. 13(1), 15–19 (2000)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)
Kokic, M., Kragic, D., Bohg, J.: Learning to estimate pose and shape of hand-held objects from RGB images. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3980–3987. IEEE (2019)
Liang, H., et al.: Making sense of audio vibration for liquid height estimation in robotic pouring. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5333–5339 (2019). https://doi.org/10.1109/IROS40897.2019.8968303
Liang, H., et al.: Robust robotic pouring using audition and haptics. arXiv preprint arXiv:2003.00342 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: video retrieval using representations from collaborative experts. In: British Machine Vision Conference (2019)
Mottaghi, R., Schenck, C., Fox, D., Farhadi, A.: See the glass half full: reasoning about liquid containers, their volume and content. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1871–1880 (2017)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Phillips, C.J., Lecce, M., Daniilidis, K.: Seeing glassware: from edge detection to pose estimation and shape recovery. In: Robotics: Science and Systems, vol. 3 (2016)
Sajjan, S., et al.: Clear grasp: 3D shape estimation of transparent objects for manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3634–3642. IEEE (2020)
Sanchez-Matilla, R., et al.: Benchmark for human-to-robot handovers of unseen containers with unknown filling. IEEE Robot. Autom. Lett. 5(2), 1642–1649 (2020). https://doi.org/10.1109/LRA.2020.2969200
Statistics, L.B., Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Wang, C., et al.: Densefusion: 6D object pose estimation by iterative dense fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352 (2019)
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes (2018)
Xompero, A., Sanchez-Matilla, R., Modas, A., Frossard, P., Cavallaro, A.: Multi-view shape estimation of transparent containers. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2363–2367 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054112
Xompero, A., Sanchez-Matilla, R., Mazzon, R., Cavallaro, A.: CORSMAL containers manipulation (2020). https://doi.org/10.17636/101CORSMAL1, http://corsmal.eecs.qmul.ac.uk/containers_manip.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Iashin, V., Palermo, F., Solak, G., Coppola, C. (2021). Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-Robot Handovers. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12668. Springer, Cham. https://doi.org/10.1007/978-3-030-68793-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-68793-9_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68792-2
Online ISBN: 978-3-030-68793-9
eBook Packages: Computer ScienceComputer Science (R0)