Skip to main content
Log in

Dataset and semantic based-approach for image sonification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents an image-audio dataset and a mid-level image sonification system that strives to help visually impaired users understand the semantic content of an image and access visual information via a combination of semantic audio and an easily decodable audio generated in real time, both triggered by sliding, taping, holding actions when the users explore the image on a touch screen or with a pointer. Firstly, we segmented the original image using a label fusion model and based on the user position in the image, a sonified signal is generated using musical notes and meaningful visual information within the active region like the color and the luminance, then the gradient and the texture. Secondly, we integrated the semantic understanding of the image into our model using DeepLab semantic segmentation of the image and created a dataset of audio and images aligned on the 20 classes of the PASCAL VOC 2012 dataset. The dataset of images are organized based on color, gradient, texture for low-level sonification and on semantic content with sounds for mid-level sonification. Thirdly, in order to provide both types of information in a complementary way, the slide, tap and hold actions of a touch screen are incorporated in the model. The semantic audio providing a brief description of the visual object is played on slide action, the generated signal with color details of the object on the tap action, gradient and texture of the object on hold action. Finally, we validated our sonification model on the provided dataset during a pilot study and the subjects were generally able to identify the objects in the image, the color of the objects and even provide a general description of the scene of the image. Our system could be useful to visually impaired persons in a photo sharing application using a smartphone or for painting art description in a digital museum.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://freesound.org/

  2. https://github.com/ohinitoffa/ImgSonficiation2

References

  1. Balakrishnan G, Sainarayanan G, Nagarajan R, Yaacob S (2008) A stereo image processing system for visually impaired. International Journal of Information, Control and Computer Sciences 2(9):1–10

    Google Scholar 

  2. Banf M, Blanz V (2013) Sonification of images for the visually impaired using a multi-level approach. In: Proceedings of the 4th augmented human international conference (AH ’13), pp 162–169

  3. Banf M, Mikalay R, Watzke B, Blanz V (2016) Picturesensation - a mobile application to help the blind explore the visual world through touch and sound. Journal of Rehabilitation and Assistive Technologies Engineering 3

  4. Bartolome JI, Quero LC, Sunhee K, Um MY, Cho J (2019) Exploring art with a voice controlled multimodal guide for blind people. In: Proceedings of the Thirteenth international conference on tangible, embedded, and embodied interaction. TEI ’19. Association for Computing Machinery, New York, NY, USA, pp 383–390, DOI https://doi.org/10.1145/3294109.3300994

  5. Capp M, Picton P (2000) The optophone: An electronic blind aid. Engineering Science and Education Journal 9(3):137–143

    Article  Google Scholar 

  6. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European conference on computer vision, pp 833–851

  7. Chidester B, Do M (2013) Assisting the visually impaired using depth inference on mobile devices via stereo matching. In: 2013 IEEE International conference on multimedia and expo workshops (ICMEW), pp 1–6, DOI https://doi.org/10.1109/ICMEW.2013.6618381

  8. Chu S, Narayanan S, Kuo C-CJ (2009) Environmental sound recognition with time–frequency audio features. ieee Transactions on Audio, Speech and Language Processing 17

  9. Degara N, Hunt A, Hermann T (2015) Interactive sonification [guest editors’ introduction]. IEEE MultiMedia 22(1):20–23. https://doi.org/10.1109/MMUL.2015.8

    Article  Google Scholar 

  10. Dubus G, Bresin R (2013) A systematic review of mapping strategies for the sonification of physical quantities. PLoS ONE 8(12):82491

    Article  Google Scholar 

  11. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. Int J Comput Vis 111(1):98–136

    Article  Google Scholar 

  12. Gotzelmann T (2018) Visually augmented audio-tactile graphics for visually impaired people. ACM Trans Access Comput 11(2)

  13. Goudarzi V (2015) Designing an interactive audio interface for climate science. IEEE MultiMedia 22(1):41–47. https://doi.org/10.1109/MMUL.2015.4

    Article  Google Scholar 

  14. Ivan K, Radek O (2008) Hybrid approach to sonification of color images. In: Proceedings of the international conference on convergence and hybrid information technology

  15. Kane SK, Morris MR, Wobbrock JO (2013) Touchplates: Low-cost tactile overlays for visually impaired touch screen users. In: Proceedings of the 15th International ACM SIGACCESS conference on computers and accessibility. ASSETS ’13. Association for Computing Machinery, New York, NY, USA, DOI https://doi.org/10.1145/2513383.2513442

  16. Kwon N, Koh Y, Oh U (2019) Supporting object-level exploration of artworks by touch for people with visual impairments. In: The 21st international ACM SIGACCESS conference on computers and accessibility. ASSETS ’19. Association for Computing Machinery, New York, NY, USA, pp 600–602, DOI https://doi.org/10.1145/3308561.3354620

  17. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  18. Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc of the 8th international conference on computer vision (ICCV), vol 2, vancouver, British Columbia, Canada, pp 416–423

  19. Martins ACG, Rangayyan RM, Ruschioni RA (2001) Audification and sonification of texture in images. J Electronic Imaging 10(3):690–705

    Article  Google Scholar 

  20. Matta S, Kumar DK, Yu X, Burry M (2004) An approach for image sonification. In: First international symposium on control, communications and signal processing, 2004, pp 431–434

  21. Meijer PBL (1992) An experimental system for auditory image representations. IEEE Trans Biomed Eng 39(2):112–121

    Article  Google Scholar 

  22. Mignotte M (2014) A label field fusion model with a variation of information estimator for image segmentation. Inform Fusion 20:7–20

    Article  Google Scholar 

  23. Morris MR, Johnson J, Bennett CL, Cutrell E (2018) Rich representations of visual content for screen reader users. In: Proceedings of the 2018 CHI conference on human factors in computing systems. CHI ’18. Association for Computing Machinery, New York, NY, USA, pp 1–11, DOI https://doi.org/10.1145/3173574.3173633

  24. Munsell AH (1912) A pigment color system and notation. J Psychol 23(2):236–244. https://doi.org/10.2307/1412843

    Article  Google Scholar 

  25. Oh U, Joh H, Lee Y (2021) Image accessibility for screen reader users: A systematic review and a road map. Electronics 10(8)

  26. Quero LC, Bartolome JI, Lee S, Han E, Kim S, Cho J (2018) An interactive multimodal guide to improve art accessibility for blind people. In: Proceedings of the 20th international ACM SIGACCESS conference on computers and accessibility. ASSETS ’18. Association for Computing Machinery, New York, NY, USA, pp 346–348, DOI https://doi.org/10.1145/3234695.3241033

  27. Rodrigues JB, Ferreira AVM, Maia IMO, Junior GB, de Almeida JDS, de Paiva AC (2019) Image processing of artworks for construction of 3d models accessible to the visually impaired. In: Advances in manufacturing, production management and process control. Springer, Cham, pp 243–253

  28. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 4510–4520, DOI https://doi.org/10.1109/CVPR.2018.00474

  29. Scavaco S, Henriques JT, Mengucci M, Correia N, Medeiros F (2013) Color sonification for the visually impaired. In: Cruz-Cunha MM, Varajão HKJ, Martinho R (eds) Proceedings of international conference on health and social care information systems and technologies (HCist). Procedia Technology, Elsevier, ???, pp 1048–1057

  30. Schaffert N, Mattes K (2015) Interactive sonification in rowing: Acoustic feedback for on-water training. IEEE MultiMedia 22(1):58–67. https://doi.org/10.1109/MMUL.2015.9

    Article  Google Scholar 

  31. Sudol J, Dialameh O, Blanchard C, Dorcey T (2010) Looktel, a comprehensive platform for computer-aided visual assistance. In: 2010 IEEE computer society conference on computer vision and pattern recognition - workshops, pp 73–80, DOI https://doi.org/10.1109/CVPRW.2010.5543725

  32. Tajadura-Jiménez A, Bianchi-Berthouze N, Furfaro E, Bevilacqua F (2015) Sonification of surface tapping changes behavior, surface perception, and emotion. IEEE MultiMedia 22(1):48–57. https://doi.org/10.1109/MMUL.2015.14

    Article  Google Scholar 

  33. Toffa OK, Mignotte M (2020) A hierarchical visual feature-based approach for image sonification. IEEE Transactions on Multimedia 23:706–715. https://doi.org/10.1109/TMM.2020.2987710

    Article  Google Scholar 

  34. Winters RM, Joshi N, Cutrell E, Morris MR (2019) Strategies for auditory display of social media. Ergon Des 27:11–15

    Google Scholar 

  35. Wu X, Li Z-N (2008) A study of image-based music composition. In: 2008 IEEE International conference on multimedia and expo, pp 1345–1348, DOI https://doi.org/10.1109/ICME.2008.4607692

  36. Xu Y, Li Z, Wang S, Li W, Sarkodie-Gyan T, Feng S (2021) A hybrid deep-learning model for fault diagnosis of rolling bearings. Measurements 169:108502. https://doi.org/10.1016/j.measurement.2020.108502

    Article  Google Scholar 

  37. Yeo WS, Berger J (2006) Application of raster scanning method to image sonification, sound visualization, sound analysis and synthesis. In: Proceedings of the Int Conf on digital audio effects (DAFx-06), Montreal, Quebec, Canada, pp 309–314

  38. Yoshida T, Kitani KM, Koike H, Belongie S, Schlei K (2011) Edgesonic: Image feature sonification for the visually impaired. In: Proceedings of the 2Nd augmented human international conference. AH ’11. ACM, New York, NY, USA, pp 11–1114

  39. Zhao Y, Wu S, Reynolds L, Azenkot S (2017) The effect of computer-generated descriptions on photo-sharing experiences of people with visual impairments. In: Proc ACM Hum-Comput Interact 1(CSCW)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to O. K. Toffa.

Ethics declarations

Conflict of Interests

  • The authors have no relevant financial or non-financial interests to disclose.

  • The authors have no competing interests to declare that are relevant to the content of this article.

  • All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

  • The authors have no financial or proprietary interests in any material discussed in this article.

  • The authors obtained a certificate of ethics from the Université de Montréal to perform the pilot study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

O. K. Toffa and M. Mignotte contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toffa, O.K., Mignotte, M. Dataset and semantic based-approach for image sonification. Multimed Tools Appl 82, 1505–1518 (2023). https://doi.org/10.1007/s11042-022-12914-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12914-z

Keywords

Navigation