InstaIndoor and multi-modal deep learning for indoor scene recognition

Glavan, Andreea; Talavera, Estefanía

doi:10.1007/s00521-021-06781-2

InstaIndoor and multi-modal deep learning for indoor scene recognition

Original Article
Published: 22 January 2022

Volume 34, pages 6861–6877, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

2658 Accesses
6 Citations
37 Altmetric
5 Mentions
Explore all metrics

Abstract

Indoor scene recognition is a growing field with great potential for behaviour understanding, robot localization, and elderly monitoring, among others. In this study, we approach the task of scene recognition from a novel standpoint, using multi-modal learning and video data gathered from social media. The accessibility and variety of social media videos can provide realistic data for modern scene recognition techniques and applications. We propose a model based on fusion of transcribed speech to text and visual features, which is used for classification on a novel dataset of social media videos of indoor scenes named InstaIndoor. Our model achieves up to 70% accuracy and 0.7 F1-Score. Furthermore, we highlight the potential of our approach by benchmarking on a YouTube-8M subset of indoor scenes as well, where it achieves 74% accuracy and 0.74 F1-Score. We hope the contributions of this work pave the way to novel research in the challenging field of indoor scene recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Low-Intensity Human Activity Recognition Framework Using Audio Data in an Outdoor Environment

Influence of Different Activation Functions on Deep Learning Models in Indoor Scene Images Classification

Article 18 March 2022

Basavaraj S. Anami & Chetan V. Sagarnal

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

Article 12 August 2023

Venkatesh Spoorthy & Shashidhar G. Koolagudi

Notes

Both datasets and corresponding pipeline code are available at https://github.com/andreea-glavan/multimodal-audiovisual-scene-recognition.
Both datasets and corresponding pipeline code are available at https://github.com/andreea-glavan/multimodal-audiovisual-scene-recognition.

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G.S, Davis A, Dean J, Devin M, et al (2016) Tensorflow: A system for large-scale machine learning. USENIX Conference on Operating Systems Design and Implementation pp. 265–283
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
Alayrac JB, Bojanowski P, Agrawal N, Sivic J, Laptev I, Lacoste-Julien S (2016) Unsupervised learning from narrated instruction videos. IEEE Conference on Computer Vision and Pattern Recognition pp. 4575–4583
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Conference on Computer Vision and Pattern Recognition pp. 5297–5307
Bradski G (2000) The openCV library. Dr. Dobb’s J Softw Tools 120:122–125
Google Scholar
Caruana R, Lawrence S, Giles L (2001) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Advances in Neural Information Processing Systems pp. 402–408
Castro FM, Marin-Jimenez MJ, Guil N, de la Blanca NP (2020) Multimodal feature fusion for CNN-based gait recognition: an empirical comparison. Neural Comput Appl 32:14173
Article Google Scholar
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proceedings of IEEE 105(10):1865–1883
Article Google Scholar
Chollet F, et al (2015) Keras, https://github.com/fchollet/keras
Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, et al (2018) Scaling egocentric vision: the epic-kitchens dataset. European Conference on Computer Vision pp. 720–736
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition pp. 248–255
DeSouza GN, Kak AC (2002) Vision for mobile robot navigation: a survey. Trans Anal Mach Intell 24(2):237–267
Article Google Scholar
DeTone D, Malisiewicz T, Rabinovich A (2018) Superpoint: self-supervised interest point detection and description. IEEE Conference on Computer Vision and Pattern Recognition workshops pp. 224–236
Devlin J, Chang M.W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT pp. 4171–4186
Diwadkar VA, McNamara TP (1997) Viewpoint dependence in scene recognition. Psychol Sci 8(4):302–307
Article Google Scholar
Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, Sattler T (2019) D2-net: A trainable CNN for joint description and detection of local features. IEEE Conference on Computer Vision and Pattern Recognition pp. 8092–8101
Espinace P, Kollar T, Soto A, Roy N (2010) Indoor scene recognition through object detection. IEEE International Conference on Robotics and Automation pp. 1406–1413
Fouhey DF, Kuo Wc, Efros AA, Malik J (2018) From lifestyle vlogs to everyday interactions. IEEE Conference on Computer Vision and Pattern Recognition pp. 4991–5000
Gelli F, Uricchio T, Bertini M, Del Bimbo A, Chang SF (2015) Image popularity prediction in social media using sentiment and context features. International Conference on Multimedia pp. 907–910
Google C (2021) Google speech to text, https://pypi.org/project/google-cloud-speech/
Gwi G Social media trends in 2021: Latest trends & statistics. GWI https://www.gwi.com/reports/social
Harouni A, Karargyris A, Negahdar M, Beymer D, Syeda-Mahmood T (2018) Universal multi-modal deep network for classification and segmentation of medical images. International Symposium on Biomedical Imaging pp. 872–876
Hassanpour S, Tomita N, DeLise T, Crosier B, Marsch LA (2019) Identifying substance use risk based on deep neural networks and instagram social media data. Neuropsychopharmacology 44(3):487–494
Article Google Scholar
Hausler S, Garg S, Xu M, Milford M, Fischer T (2021) Patch-netvlad: multi-scale fusion of locally-global descriptors for place recognition. IEEE Conference on Computer Vision and Pattern Recognition pp. 14141–14152
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang J, Liu Z, Wang Y (2005) Joint scene classification and segmentation based on hidden markov model. IEEE Trans Multimed 7(3):538–550
Article Google Scholar
Huang Q, Xiong Y, Xiong Y, Zhang Y, Lin D (2018) From trailers to storylines: an efficient way to learn from movies. European Conference on Computer Vision
Huang W, Wai AAP, Foo SF, Biswas J, Hsia CC, Liou K (2010) Multimodal sleeping posture classification. International Conference on Pattern Recognition pp. 4336–4339
Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383
Article MathSciNet Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980
Lan T, Chen TC, Savarese S (2014) A hierarchical representation for future action prediction. European Conference on Computer Vision pp. 689–704
Lee J, Reade W, Sukthankar R, Toderici G, et al (2018) The 2nd youtube-8m large-scale video understanding challenge. In: European Conference on Computer Vision Workshops
Leyva-Vallina M, Strisciuglio N, Petkov N (2021) Generalized contrastive optimization of siamese networks for place recognition. arXiv preprint arXiv:2103.06638
Li H, Ma X, Wang F, Liu J, Xu K (2013) On popularity prediction of videos shared in online social networks. ACM International Conference on Information & Knowledge Management pp. 169–178
Liu M, Chen R, Li D, Chen Y, Guo G, Cao Z, Pan Y (2017) Scene recognition for indoor localization using a multi-sensor fusion approach. Sensors 17(12):2847
Article Google Scholar
Liu Y, Yan X, Ca Zhang, Liu W (2019) An ensemble convolutional neural networks for bearing fault diagnosis using multi-sensor data. Sensors 19(23):5300
Article Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. ICLR pp. 1–15
Liu Z, Wang Y, Chen T (1998) Audio feature extraction and analysis for scene segmentation and classification. J VLSI Sig Process Syst 20(1):61–79
Article Google Scholar
Lowry S, Sünderhauf N, Newman P, Leonard JJ, Cox D, Corke P, Milford MJ (2015) Visual place recognition: a survey. IEEE Trans Robot 32(1):1–19
Article Google Scholar
Lu D, Weng Q (2007) A survey of image classification methods and techniques for improving classification performance. J Remote Sens 28(5):823–870
Article Google Scholar
Marszalek M, Laptev I, Schmid C (2009) Actions in context. IEEE Conference on Computer Vision and Pattern Recognition pp. 2929–2936
Martinez ET, Leyva-Vallina M, Sarker MMK, Puig D, Petkov N, Radeva P (2019) Hierarchical approach to classify food scenes in egocentric photo-streams. J Biomed Health Inform 24(3):866–877
Article Google Scholar
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. Trans Pattern Anal Mach Intell 24(2):198–213
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space
Modiri Assari S, Roshan Zamir A, Shah M (2014) Video classification using semantic concept co-occurrences. IEEE Conference on Computer Vision and Pattern Recognition pp. 2529–2536
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. International Conference on Machine Learning pp. 1–9
Oh J, Guo X, Lee H, Lewis R, Singh S (2015) Action-conditional video prediction using deep networks in Atari games. International Conference on Neural Information Processing Systems pp. 1–9
Patterson E.K, Gurbuz S, Tufekci Z, Gowdy JN (2002) CUAVE: A new audio-visual database for multimodal human-computer interface research. IEEE International Conference on Acoustics, Speech, and Signal Processing 2, II–2017
Perrin A (2015) Pew research center. Soc Med Usage 125:52–68
Google Scholar
Quattoni A, Torralba A (2009) Recognizing indoor scenes. IEEE Conference on Computer Vision and Pattern Recognition pp. 413–420
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. pp. 3982–3992. Association for Computational Linguistics
Roach M, Mason JS (2001) Classification of video genre using audio. European Conference on Speech Communication and Technology pp. 1–4
Sanabria R, Caglayan O, Palaskar S, Elliott D, Barrault L, Specia L, Metze F (2018) How2: a large-scale dataset for multimodal language understanding. Advances in Neural Information Processing Systems Workshop on Visually Grounded Interaction and Language
Shah S, Aggarwal JK (1997) Mobile robot navigation and scene modeling using stereo fish-eye lens system. Mach Vis Appl 10(4):159–173
Article Google Scholar
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. IEEE Conference on Computer Vision and Pattern Recognition pp. 1227–1236
Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor. International Conference on Computer Vision Workshops pp. 601–608
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556
Singh B, Sharma DK (2021) Predicting image credibility in fake news over social media using multi-modal approach. Neural Computing and Applications pp. 1–15
Sureka A, Kumaraguru P, Goyal A, Chhabra S (2010) Mining youtube to discover extremist videos, users and hidden communities. Asia Information Retrieval Symposium pp. 13–24
Taira H, Okutomi M, Sattler T, Cimpoi M, Pollefeys M, Sivic J, Pajdla T, Torii A (2018) Inloc: Indoor visual localization with dense matching and view synthesis. IEEE Conference on Computer Vision and Pattern Recognition pp. 7199–7209
Tan W, Tiwari P, Pandey HM, Moreira C, Jaiswal AK (2020) Multimodal medical image fusion algorithm in the era of big data. Neural Computing and Applications pp. 1–21
Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) MovieQA: Understanding stories in movies through question-answering. IEEE Conference on Computer Vision and Pattern Recognition pp. 4631–4640
Toft C, Maddern W, Torii A, Hammarstrand L, Stenborg E, Safari D, Okutomi M, Pollefeys M, Sivic J, Pajdla T, et al (2020) Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–14
Tsai G, Xu C, Liu J, Kuipers B (2011) Real-time indoor scene understanding using bayesian filtering with motion cues. International Conference on Computer Vision pp. 121–128
Van Rossum G, Drake FL (2009) Python 3 Reference Manual. Scotts Valley, CA
Google Scholar
Warburg F, Hauberg S, Lopez-Antequera M, Gargallo P, Kuang Y, Civera J (2020) Mapillary street-level sequences: A dataset for lifelong place recognition. IEEE Conference on Computer Vision and Pattern Recognition pp. 2626–2635
Xiao J, Hays J, Ehinger K.A, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. IEEE Conference on Computer Vision and Pattern Recognition pp. 3485–3492
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems pp. 802–810
Xu Y, Huang J, Wang J, Wang Y, Qin H, Nan K (2021) Esa-vlad: a lightweight network based on second-order attention and netvlad for loop closure detection. IEEE Robot Autom Lett 6(4):6545–6552
Article Google Scholar
Yang X, Luo J (2017) Tracking illicit drug dealing and abuse on instagram using multimodal analysis. Trans Intell Syst Technol 8(4):1–15
Article MathSciNet Google Scholar
Ye H, Wu Z, Zhao RW, Wang X, Jiang YG, Xue X (2015) Evaluating two-stream CNN for video classification. 5th ACM on International Conference on Multimedia Retrieval pp. 435–442
Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674
Article Google Scholar
Zhang C, Peng Y (2018) Visual data synthesis via GAN for zero-shot video classification. 27th International Joint Conference on Artificial Intelligence pp. 1128–1134
Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. J Mach Learn Cybernet 1(1–4):43–52
Article Google Scholar
Zheng JY, Tsuji S (1992) Panoramic representation for route recognition by a mobile robot. Int J Comput Vis 9(1):55–76
Article Google Scholar
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. Transactions on Pattern Analysis and Machine Intelligence pp. 1–23
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. Neural Information Processing Systems Foundation pp. 1–9
Zhou L, Xu C, Corso J (2018) Towards automatic learning of procedures from web instructional videos. AAAI Conference on Artificial Intelligence pp. 7590–7598

Download references

Acknowledgements

We would like to thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high performance computing cluster.

Author information

Authors and Affiliations

Department of Computer Science, University of Groningen, Nijenborgh 9, 9747, AG, Groningen, Netherlands
Andreea Glavan & Estefanía Talavera

Authors

Andreea Glavan
View author publications
You can also search for this author in PubMed Google Scholar
Estefanía Talavera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreea Glavan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Glavan, A., Talavera, E. InstaIndoor and multi-modal deep learning for indoor scene recognition. Neural Comput & Applic 34, 6861–6877 (2022). https://doi.org/10.1007/s00521-021-06781-2

Download citation

Received: 18 June 2021
Accepted: 22 November 2021
Published: 22 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00521-021-06781-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

InstaIndoor and multi-modal deep learning for indoor scene recognition

Abstract

Access this article

Similar content being viewed by others

Low-Intensity Human Activity Recognition Framework Using Audio Data in an Outdoor Environment

Influence of Different Activation Functions on Deep Learning Models in Indoor Scene Images Classification

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

InstaIndoor and multi-modal deep learning for indoor scene recognition

Abstract

Access this article

Similar content being viewed by others

Low-Intensity Human Activity Recognition Framework Using Audio Data in an Outdoor Environment

Influence of Different Activation Functions on Deep Learning Models in Indoor Scene Images Classification

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation