Automatic semantic maps generation from lexical annotations


The generation of semantic environment representations is still an open problem in robotics. Most of the current proposals are based on metric representations, and incorporate semantic information in a supervised fashion. The purpose of the robot is key in the generation of these representations, which has traditionally reduced the inter-usability of the maps created for different applications. We propose the use of information provided by lexical annotations to generate general-purpose semantic maps from RGB-D images. We exploit the availability of deep learning models suitable for describing any input image by means of lexical labels. Lexical annotations are more appropriate for computing the semantic similarity between images than the state-of-the-art visual descriptors. From these annotations, we perform a bottom-up clustering approach that associates each image with a different category. The use of RGB-D images allows the robot pose associated with each acquisition to be obtained, thus complementing the semantic with the metric information.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Available for download at


  1. Blanco, J., Fernández-Madrigal, J., & Gonzalez, J. (2007). A new approach for large-scale localization and mapping: Hybrid metric-topological slam. In International conference on robotics and automation (pp. 2061–2067). IEEE.

  2. Burgard, W., Stachniss, C., Grisetti, G., Steder, B., Kümmerle, R., Dornhege, C., et al. (2009). A comparison of slam algorithms based on a graph of relations. In International conference on intelligent robots and systems (pp. 2089–2095). IEEE.

  3. Bylow, E., Sturm, J., Kerl, C., Kahl, F., & Cremers, D. (2013). Real-time camera tracking and 3d reconstruction using signed distance functions. In N. Paul, D. Fox, & D. Hsu (Eds.), Robotics: Science and Systems (RSS) (Vol. 9, p. 8). Germany: Berlin.

    Google Scholar 

  4. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British machine vision conference.

  5. Chen, Z., Lam, O., Jacobson, A., & Milford, M. (2014). Convolutional neural network-based place recognition. arxiv:1411.1509.

  6. Choset, H., & Nagatani, K. (2001). Topological simultaneous localization and mapping (slam): Toward exact localization without explicit localization. IEEE Transactions on Robotics and Automation, 17(2), 125–137.

    Article  Google Scholar 

  7. Dai, A., Nießner, M., Zollöfer, M., Izadi, S., & Theobalt, C. (2016). BundleFusion: Real-time globally consistent 3D reconstruction using on-the-fly surface re-integration. arXiv preprint arXiv:1604.01093.

  8. Endres, F., Hess, J., Engelhard, N., Sturm, J., Cremers, D., Burgard, W. (2012). An evaluation of the RGB-D slam system. In International conference on robotics and automation (pp. 1691–1696). IEEE.

  9. Endres, F., Hess, J., Sturm, J., Cremers, D., & Burgard, W. (2014). 3-d mapping with an RGB-D camera. IEEE Transactions on Robotics, 30(1), 177–187.

    Article  Google Scholar 

  10. Fuentes-Pacheco, J., Ruiz-Ascencio, J., & Rendón-Mancha, J. (2012). Visual simultaneous localization and mapping: A survey. Artificial Intelligence Review, 43(1), 55–81.

    Article  Google Scholar 

  11. Galindo, C., Saffiotti, A., Coradeschi, S., Buschka, P., Fernandez-Madrigal, J. A., Gonzalez, J. (2005). Multi-hierarchical semantic maps for mobile robotics. In 2005 IEEE/RSJ international conference on intelligent robots and systems (pp. 2278–2283).

  12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on multimedia (pp. 675–678), ACM, New York, NY, USA.

  13. Kostavelis, I., & Gasteratos, A. (2013). Learning spatially semantic representations for cognitive robot navigation. Robotics and Autonomous Systems, 61(12), 1460–1475.

    Article  Google Scholar 

  14. Kostavelis, I., & Gasteratos, A. (2015). Semantic mapping for mobile robotics tasks: A survey. Robotics and Autonomous Systems, 66, 86–103.

    Article  Google Scholar 

  15. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  16. Labbé, M., & Michaud, F. (2011). Memory management for real-time appearance-based loop closure detection. In 2011 IEEE/RSJ international conference on intelligent robots and systems (pp. 1271–1276).

  17. Lemaire, T., Berger, C., Jung, I., & Lacroix, S. (2007). Vision-based slam: Stereo and monocular approaches. International Journal of Computer Vision, 74(3), 343–364.

    Article  Google Scholar 

  18. Lin, Y., Liu, T., & Chen, H. (2005). Semantic manifold learning for image retrieval. In Proceedings of the 13th annual ACM international conference on multimedia (pp. 249–258). ACM.

  19. Liu, Z., & Von Wichert, G. (2013). Applying rule-based context knowledge to build abstract semantic maps of indoor environments. In Intelligent robots and systems (IROS) (pp. 5141–5147).

  20. Maddern, W., Milford, M., & Wyeth, G. (2012). Cat-slam: Probabilistic localisation and mapping using a continuous appearance-based trajectory. In The International Journal of Robotics Research 31 (4), 429–451.

  21. Martínez-Gómez, J., Caputo, B., Cazorla, M., Christensen, H., Fornoni, M., García-Varea, I., et al. (2015). The robot vision challenge. Where are we after 5 editions? IEEE Robotics and Automation Magazine, 22(4), 147–156.

    Article  Google Scholar 

  22. Martínez-Gomez, J., Cazorla, M., García-Varea, I., & Morell, V. (2015b). ViDRILO: The visual and depth robot indoor localization with objects information dataset. International Journal of Robotics Research, 34(14), 1681–1687.

    Article  Google Scholar 

  23. Martínez-Gómez, J., Morell, V., Cazorla, M., & García-Varea, I. (2016). Semantic localization in the PCL library. Robotics and Autonomous Systems 75(Part B), 641–648.

  24. Mozos, O. M., Triebel, R., Jensfelt, P., Rottmann, A., & Burgard, W. (2007). Supervised semantic labeling of places using information extracted from sensor data. Robotics and Autonomous Systems, 55(5), 391–402.

    Article  Google Scholar 

  25. Meshgi, K., & Ishii, S. (2015). Expanding histogram of colors with gridding to improve tracking accuracy. In International conference on machine vision applications (pp. 475–479). IEEE.

  26. Nieto, J., Guivant, J., & Nebot, E. (2006). Denseslam: Simultaneous localization and dense mapping. The International Journal of Robotics Research, 25(8), 711–744.

    Article  Google Scholar 

  27. Paul, R., & Newman, P. (2010). Fab-map 3d: Topological mapping with spatial and visual appearance. In 2010 IEEE international conference on robotics and automation (ICRA) (pp. 2649–2656).

  28. Pronobis, A., Mozos, O. M., Caputo, B., & Jensfelt, P. (2010). Multi-modal semantic place classification. The International Journal of Robotics Research (IJRR), Special Issue on Robotic Vision, 29(2–3), 298–320.

    Google Scholar 

  29. Rangel, J., Cazorla, M., García-Varea, I., Martínez-Gómez, J., Fromont, E., & Sebban, M. (2016a). Scene classification based on semantic labeling. Advanced Robotics, 30(11–12), 758–769.

    Article  Google Scholar 

  30. Rangel, J. C., Martí-nez-Gómez, J., Garcí-a-Varea, I., & Cazorla, M. (2016b). Lextomap: Lexical-based topological mapping. Advanced Robotics, 1–14.

  31. Rituerto, A., Murillo, A., & Guerrero, J. (2014). Semantic labeling for indoor topological mapping using a wearable catadioptric system. Robotics and Autonomous Systems 62(5), 685–695., special Issue Semantic Perception, Mapping and Exploration.

  32. Romero, A., & Cazorla, M. (2010). Topological SLAM using omnidirectional images: Merging feature detectors and graph-matching (pp. 464–475). Berlin: Springer.

    Google Scholar 

  33. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  34. Se, S., Lowe, D., & Little, J. (2005). Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21(3), 364–375.

    Article  Google Scholar 

  35. Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops.

  36. Sünderhauf, N., Dayoub, F., Shirazi, S., Upcroft, B., & Milford, M. (2015a). On the performance of convnet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4297–4304).

  37. Sünderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., et al. (2015b). Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In Robotics: Science and systems, auditorium antonianum, Rome.

  38. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2014). Going deeper with convolutions. arxiv:1409.4842.

  39. Thrun, S., & Leonard, J. (2008). Simultaneous localization and mapping. In B. Siciliano & O. Khatib (Eds.), Springer handbook of robotics (pp. 871–889). Berlin: Springer.

    Google Scholar 

  40. Thrun, S. (2002). Robotic mapping: A survey. In G. Lakemeyer & B. Nebel (Eds.), Exploring artificial intelligence in the new millennium (pp. 1–35). San Francisco: Morgan Kaufmann.

    Google Scholar 

  41. Wang, M. L., & Lin, H. Y. (2011). An extended-hct semantic description for visual place recognition. The International Journal of Robotics Research, 30(11), 1403–1420.

    Article  Google Scholar 

  42. Whelan, T., Salas-Moreno, R. F., Glocker, B., Davison, A. J., & Leutenegger, S. (2016). Elasticfusion: Real-time dense slam and light source estimation. The International Journal of Robotics Research, 35(14), 1697–1716.

    Article  Google Scholar 

  43. Wu, D., Zhu, F., & Shao, L. (2012). One shot learning gesture recognition from rgbd images. In Conference on computer vision and pattern recognition workshops (pp. 7–12).

  44. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 487–495). Red Hook: Curran Associates, Inc.

Download references


This work has been partially sponsored by the Spanish Ministry of Economy and Competitiveness under grant number TIN2015-65686-C5-3-R, and by the Regional Council of Education, Culture and Sports of Castilla-La Mancha under grant number PPII-2014-015-P. It has been also supported by the Spanish Government DPI2016-76515-R Grant, supported with Feder funds. Cristina Romero-González is funded by the MECD Grant FPU12/04387. José Carlos Rangel is funded by the IFARHU Grant 8-2014-166 of the Republic of Panamá.

Author information



Corresponding author

Correspondence to José Carlos Rangel.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rangel, J.C., Cazorla, M., García-Varea, I. et al. Automatic semantic maps generation from lexical annotations. Auton Robot 43, 697–712 (2019).

Download citation


  • Semantic map
  • Lexical annotations
  • 3D registration
  • RGB-D data
  • Deep learning