Dataset and semantic based-approach for image sonification

Toffa, O. K.; Mignotte, M.

doi:10.1007/s11042-022-12914-z

Dataset and semantic based-approach for image sonification

Published: 03 May 2022

Volume 82, pages 1505–1518, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

259 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

This paper presents an image-audio dataset and a mid-level image sonification system that strives to help visually impaired users understand the semantic content of an image and access visual information via a combination of semantic audio and an easily decodable audio generated in real time, both triggered by sliding, taping, holding actions when the users explore the image on a touch screen or with a pointer. Firstly, we segmented the original image using a label fusion model and based on the user position in the image, a sonified signal is generated using musical notes and meaningful visual information within the active region like the color and the luminance, then the gradient and the texture. Secondly, we integrated the semantic understanding of the image into our model using DeepLab semantic segmentation of the image and created a dataset of audio and images aligned on the 20 classes of the PASCAL VOC 2012 dataset. The dataset of images are organized based on color, gradient, texture for low-level sonification and on semantic content with sounds for mid-level sonification. Thirdly, in order to provide both types of information in a complementary way, the slide, tap and hold actions of a touch screen are incorporated in the model. The semantic audio providing a brief description of the visual object is played on slide action, the generated signal with color details of the object on the tap action, gradient and texture of the object on hold action. Finally, we validated our sonification model on the provided dataset during a pilot study and the subjects were generally able to identify the objects in the image, the color of the objects and even provide a general description of the scene of the image. Our system could be useful to visually impaired persons in a photo sharing application using a smartphone or for painting art description in a digital museum.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual attention network

Article Open access 28 July 2023

Learning a Deep Convolutional Network for Image Super-Resolution

Image Inpainting: A Review

Article 06 December 2019

Notes

References

Balakrishnan G, Sainarayanan G, Nagarajan R, Yaacob S (2008) A stereo image processing system for visually impaired. International Journal of Information, Control and Computer Sciences 2(9):1–10
Google Scholar
Banf M, Blanz V (2013) Sonification of images for the visually impaired using a multi-level approach. In: Proceedings of the 4th augmented human international conference (AH ’13), pp 162–169
Banf M, Mikalay R, Watzke B, Blanz V (2016) Picturesensation - a mobile application to help the blind explore the visual world through touch and sound. Journal of Rehabilitation and Assistive Technologies Engineering 3
Bartolome JI, Quero LC, Sunhee K, Um MY, Cho J (2019) Exploring art with a voice controlled multimodal guide for blind people. In: Proceedings of the Thirteenth international conference on tangible, embedded, and embodied interaction. TEI ’19. Association for Computing Machinery, New York, NY, USA, pp 383–390, DOI https://doi.org/10.1145/3294109.3300994
Capp M, Picton P (2000) The optophone: An electronic blind aid. Engineering Science and Education Journal 9(3):137–143
Article Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European conference on computer vision, pp 833–851
Chidester B, Do M (2013) Assisting the visually impaired using depth inference on mobile devices via stereo matching. In: 2013 IEEE International conference on multimedia and expo workshops (ICMEW), pp 1–6, DOI https://doi.org/10.1109/ICMEW.2013.6618381
Chu S, Narayanan S, Kuo C-CJ (2009) Environmental sound recognition with time–frequency audio features. ieee Transactions on Audio, Speech and Language Processing 17
Degara N, Hunt A, Hermann T (2015) Interactive sonification [guest editors’ introduction]. IEEE MultiMedia 22(1):20–23. https://doi.org/10.1109/MMUL.2015.8
Article Google Scholar
Dubus G, Bresin R (2013) A systematic review of mapping strategies for the sonification of physical quantities. PLoS ONE 8(12):82491
Article Google Scholar
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. Int J Comput Vis 111(1):98–136
Article Google Scholar
Gotzelmann T (2018) Visually augmented audio-tactile graphics for visually impaired people. ACM Trans Access Comput 11(2)
Goudarzi V (2015) Designing an interactive audio interface for climate science. IEEE MultiMedia 22(1):41–47. https://doi.org/10.1109/MMUL.2015.4
Article Google Scholar
Ivan K, Radek O (2008) Hybrid approach to sonification of color images. In: Proceedings of the international conference on convergence and hybrid information technology
Kane SK, Morris MR, Wobbrock JO (2013) Touchplates: Low-cost tactile overlays for visually impaired touch screen users. In: Proceedings of the 15th International ACM SIGACCESS conference on computers and accessibility. ASSETS ’13. Association for Computing Machinery, New York, NY, USA, DOI https://doi.org/10.1145/2513383.2513442
Kwon N, Koh Y, Oh U (2019) Supporting object-level exploration of artworks by touch for people with visual impairments. In: The 21st international ACM SIGACCESS conference on computers and accessibility. ASSETS ’19. Association for Computing Machinery, New York, NY, USA, pp 600–602, DOI https://doi.org/10.1145/3308561.3354620
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Article Google Scholar
Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc of the 8th international conference on computer vision (ICCV), vol 2, vancouver, British Columbia, Canada, pp 416–423
Martins ACG, Rangayyan RM, Ruschioni RA (2001) Audification and sonification of texture in images. J Electronic Imaging 10(3):690–705
Article Google Scholar
Matta S, Kumar DK, Yu X, Burry M (2004) An approach for image sonification. In: First international symposium on control, communications and signal processing, 2004, pp 431–434
Meijer PBL (1992) An experimental system for auditory image representations. IEEE Trans Biomed Eng 39(2):112–121
Article Google Scholar
Mignotte M (2014) A label field fusion model with a variation of information estimator for image segmentation. Inform Fusion 20:7–20
Article Google Scholar
Morris MR, Johnson J, Bennett CL, Cutrell E (2018) Rich representations of visual content for screen reader users. In: Proceedings of the 2018 CHI conference on human factors in computing systems. CHI ’18. Association for Computing Machinery, New York, NY, USA, pp 1–11, DOI https://doi.org/10.1145/3173574.3173633
Munsell AH (1912) A pigment color system and notation. J Psychol 23(2):236–244. https://doi.org/10.2307/1412843
Article Google Scholar
Oh U, Joh H, Lee Y (2021) Image accessibility for screen reader users: A systematic review and a road map. Electronics 10(8)
Quero LC, Bartolome JI, Lee S, Han E, Kim S, Cho J (2018) An interactive multimodal guide to improve art accessibility for blind people. In: Proceedings of the 20th international ACM SIGACCESS conference on computers and accessibility. ASSETS ’18. Association for Computing Machinery, New York, NY, USA, pp 346–348, DOI https://doi.org/10.1145/3234695.3241033
Rodrigues JB, Ferreira AVM, Maia IMO, Junior GB, de Almeida JDS, de Paiva AC (2019) Image processing of artworks for construction of 3d models accessible to the visually impaired. In: Advances in manufacturing, production management and process control. Springer, Cham, pp 243–253
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 4510–4520, DOI https://doi.org/10.1109/CVPR.2018.00474
Scavaco S, Henriques JT, Mengucci M, Correia N, Medeiros F (2013) Color sonification for the visually impaired. In: Cruz-Cunha MM, Varajão HKJ, Martinho R (eds) Proceedings of international conference on health and social care information systems and technologies (HCist). Procedia Technology, Elsevier, ???, pp 1048–1057
Schaffert N, Mattes K (2015) Interactive sonification in rowing: Acoustic feedback for on-water training. IEEE MultiMedia 22(1):58–67. https://doi.org/10.1109/MMUL.2015.9
Article Google Scholar
Sudol J, Dialameh O, Blanchard C, Dorcey T (2010) Looktel, a comprehensive platform for computer-aided visual assistance. In: 2010 IEEE computer society conference on computer vision and pattern recognition - workshops, pp 73–80, DOI https://doi.org/10.1109/CVPRW.2010.5543725
Tajadura-Jiménez A, Bianchi-Berthouze N, Furfaro E, Bevilacqua F (2015) Sonification of surface tapping changes behavior, surface perception, and emotion. IEEE MultiMedia 22(1):48–57. https://doi.org/10.1109/MMUL.2015.14
Article Google Scholar
Toffa OK, Mignotte M (2020) A hierarchical visual feature-based approach for image sonification. IEEE Transactions on Multimedia 23:706–715. https://doi.org/10.1109/TMM.2020.2987710
Article Google Scholar
Winters RM, Joshi N, Cutrell E, Morris MR (2019) Strategies for auditory display of social media. Ergon Des 27:11–15
Google Scholar
Wu X, Li Z-N (2008) A study of image-based music composition. In: 2008 IEEE International conference on multimedia and expo, pp 1345–1348, DOI https://doi.org/10.1109/ICME.2008.4607692
Xu Y, Li Z, Wang S, Li W, Sarkodie-Gyan T, Feng S (2021) A hybrid deep-learning model for fault diagnosis of rolling bearings. Measurements 169:108502. https://doi.org/10.1016/j.measurement.2020.108502
Article Google Scholar
Yeo WS, Berger J (2006) Application of raster scanning method to image sonification, sound visualization, sound analysis and synthesis. In: Proceedings of the Int Conf on digital audio effects (DAFx-06), Montreal, Quebec, Canada, pp 309–314
Yoshida T, Kitani KM, Koike H, Belongie S, Schlei K (2011) Edgesonic: Image feature sonification for the visually impaired. In: Proceedings of the 2Nd augmented human international conference. AH ’11. ACM, New York, NY, USA, pp 11–1114
Zhao Y, Wu S, Reynolds L, Azenkot S (2017) The effect of computer-generated descriptions on photo-sharing experiences of people with visual impairments. In: Proc ACM Hum-Comput Interact 1(CSCW)

Download references

Author information

Authors and Affiliations

Vision lab. of the Département d’Informatique et de Recherche Opérationnelle (DIRO), Université de Montréal, Montréal, H3C 3J7, QC, Canada
O. K. Toffa & M. Mignotte
Faculté des Arts et des Sciences, Université de Montréal, Montréal, H3C 3J7, QC, Canada
O. K. Toffa & M. Mignotte

Authors

O. K. Toffa
View author publications
You can also search for this author in PubMed Google Scholar
M. Mignotte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to O. K. Toffa.

Ethics declarations

Conflict of Interests

The authors have no relevant financial or non-financial interests to disclose.
The authors have no competing interests to declare that are relevant to the content of this article.
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
The authors have no financial or proprietary interests in any material discussed in this article.
The authors obtained a certificate of ethics from the Université de Montréal to perform the pilot study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

O. K. Toffa and M. Mignotte contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toffa, O.K., Mignotte, M. Dataset and semantic based-approach for image sonification. Multimed Tools Appl 82, 1505–1518 (2023). https://doi.org/10.1007/s11042-022-12914-z

Download citation

Received: 22 October 2021
Revised: 22 December 2021
Accepted: 09 March 2022
Published: 03 May 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-12914-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dataset and semantic based-approach for image sonification

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Learning a Deep Convolutional Network for Image Super-Resolution

Image Inpainting: A Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dataset and semantic based-approach for image sonification

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Learning a Deep Convolutional Network for Image Super-Resolution

Image Inpainting: A Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation