Sound Analysis in Smart Cities



This chapter introduces the concept of smart cities and discusses the importance of sound as a source of information about urban life. It describes a wide range of applications for the computational analysis of urban sounds and focuses on two high-impact areas, audio surveillance, and noise pollution monitoring, which sit at the intersection of dense sensor networks and machine listening. For sensor networks we focus on the pros and cons of mobile versus static sensing strategies, and the description of a low-cost solution to acoustic sensing that supports distributed machine listening. For sound event detection and classification we focus on the challenges presented by this task, solutions including feature design and learning strategies, and how a combination of convolutional networks and data augmentation result in the current state of the art. We close with a discussion about the potential and challenges of mobile sensing, the limitations imposed by the data currently available for research, and a few areas for future exploration.


Urban sound Smart cities Noise monitoring Sensor network Acoustic sensing Internet of things (IOT) MEMS microphone Audio surveillance Sound event detection Sound classification Machine listening Machine learning Deep learning Convolutional neural networks Data augmentation 


  1. 1.
    Andén, J., Mallat, S.: Multiscale scattering for audio classification. In: 12th International Society for Music Information Retrieval Conference, Miami, pp. 657–662 (2011)Google Scholar
  2. 2.
    Andén, J., Mallat, S.: Scattering representation of modulated sounds. In: 15th DAFx, York (2012)Google Scholar
  3. 3.
    Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Atzmueller, M., Becker, M., Doerfel, S., Hotho, A., Kibanov, M., Macek, B., Mitzlaff, F., Mueller, J., Scholz, C., Stumme, G.: Ubicon: observing physical and social activities. In: 2012 IEEE International Conference on Green Computing and Communications (GreenCom), pp. 317–324. IEEE, New York (2012)Google Scholar
  5. 5.
    Aucouturier, J., Defreville, B., Pachet, F.: The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J. Acoust. Soc. Am. 122(2), 881–891 (2007)CrossRefGoogle Scholar
  6. 6.
    Barham, R., Goldsmith, M., Chan, M., Simmons, D., Trowsdale, L., Bull, S.: Development and performance of a multi-point distributed environmental noise measurement system using mems microphones. In: Proceedings of the 8th European Conference on Noise Control (Euronoise 2009) (2009)Google Scholar
  7. 7.
    Barham, R., Chan, M., Cand, M.: Practical experience in noise mapping with a MEMS microphone based distributed noise measurement system. In: 39th International Congress and Exposition on Noise Control Engineering (Internoise 2010) (2010)Google Scholar
  8. 8.
    Basner, M., Babisch, W., Davis, A., Brink, M., Clark, C., Janssen, S., Stansfeld, S.: Auditory and non-auditory effects of noise on health. The Lancet 383(9925), 1325–1332 (2014)CrossRefGoogle Scholar
  9. 9.
    Baxter, K.C., Fisher, K.: Gunshot detection sensor with display. US Patent 7,266,045, 2007Google Scholar
  10. 10.
    Becker, M., Caminiti, S., Fiorella, D., Francis, L., Gravino, P., Haklay, M.M., Hotho, A., Loreto, V., Mueller, J., Ricchiuti, F., et al.: Awareness and learning in participatory noise sensing. PLoS One 8(12), e81638 (2013)CrossRefGoogle Scholar
  11. 11.
    Becker, M., Mueller, J., Hotho, A., Stumme, G.: A generic platform for ubiquitous and subjective data. In: Proceedings of the 2013 ACM Conference on Pervasive and Ubiquitous Computing Adjunct Publication, pp. 1175–1182. ACM, New York (2013)Google Scholar
  12. 12.
    Bell, M.C., Galatioto, F.: Novel wireless pervasive sensor network to improve the understanding of noise in street canyons. Appl. Acoust. 74(1), 169–180 (2013)CrossRefGoogle Scholar
  13. 13.
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: 19th International Conference on Computational Statistics (COMPSTAT), Paris, pp. 177–186 (2010)Google Scholar
  14. 14.
    Bronzaft, A.L.: The effect of a noise abatement program on reading ability. J. Environ. Psychol. 1(3), 215–222 (1981)CrossRefGoogle Scholar
  15. 15.
    Bronzaft, A.: Neighborhood noise and its consequences. Survey Research Unit, School of Public Affairs, Baruch College, New York (2007)Google Scholar
  16. 16.
    Brown, A.L., Kang, J., Gjestland, T.: Towards standardization in soundscape preference assessment. Appl. Acoust. 72(6), 387–392 (2011)CrossRefGoogle Scholar
  17. 17.
  18. 18.
    Burke, J.A., Estrin, D., Hansen, M., Parker, A., Ramanathan, N., Reddy, S., Srivastava, M.B.: Participatory sensing. Center for Embedded Network Sensing (2006)Google Scholar
  19. 19.
    Cai, L.H., Lu, L., Hanjalic, A., Zhang, H.J., Cai, L.H.: A flexible framework for key audio effects detection and auditory context inference. IEEE Trans. Audio Speech Lang. Process. 14(3), 1026–1039 (2006). doi:10.1109/TSA.2005.857575CrossRefGoogle Scholar
  20. 20.
    Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2015)Google Scholar
  21. 21.
    Campbell, A.T., Eisenman, S.B., Lane, N.D., Miluzzo, E., Peterson, R.A.: People-centric urban sensing. In: Proceedings of the 2nd Annual International Workshop on Wireless Internet, p. 18. ACM, New York (2006)Google Scholar
  22. 22.
    Carlyon, R.: How the brain separates sounds. Trends Cogn. Sci. 8(10), 465–471 (2004)CrossRefGoogle Scholar
  23. 23.
    Chaudhuri, S., Raj, B.: Unsupervised hierarchical structure induction for deeper semantic analysis of audio. In: IEEE ICASSP, pp. 833–837 (2013). doi:10.1109/ICASSP.2013.6637765Google Scholar
  24. 24.
    Chu, S., Narayanan, S., Kuo, C.C.J., Mataric, M.J.: Where am I? scene recognition for mobile robots using audio features. In: 2006 IEEE International Conference on Multimedia and Expo, pp. 885–888. IEEE, New York (2006)Google Scholar
  25. 25.
    Chu, S., Narayanan, S., Kuo, C.C.: Environmental sound recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17(6), 1142–1158 (2009). doi:10.1109/TASL.2009.2017438CrossRefGoogle Scholar
  26. 26.
    Coates, A., Ng, A.Y.: Learning feature representations with K-means. In: Neural Networks: Tricks of the Trade, pp. 561–580. Springer, Berlin, Heidelberg (2012)Google Scholar
  27. 27.
    Cristani, M., Bicego, M., Murino, V.: On-line adaptive background modelling for audio surveillance. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004 (ICPR 2004), vol. 2, pp. 399–402. IEEE, New York (2004)Google Scholar
  28. 28.
    Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia 9(2), 257–267 (2007)CrossRefGoogle Scholar
  29. 29.
    Cristani, M., Raghavendra, R., Bue, A.D., Murino, V.: Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100, 86–97 (2013)CrossRefGoogle Scholar
  30. 30.
    Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1), 143–175 (2001)CrossRefzbMATHGoogle Scholar
  31. 31.
    D’Hondt, E., Stevens, M., Jacobs, A.: Participatory noise mapping works! an evaluation of participatory sensing as an alternative to standard techniques for environmental monitoring. Pervasive Mob. Comput. 9(5), 681–694 (2013)CrossRefGoogle Scholar
  32. 32.
    Dieleman, S., Schrauwen, B.: Multiscale approaches to music audio feature learning. In: 14th ISMIR, Curitiba (2013)Google Scholar
  33. 33.
    Eghbal-Zadeh, H., Lehner, B., Dorfer, M., Widmer, G.: CP-JKU submissions for DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks. Technical report, DCASE2016 Challenge (2016)Google Scholar
  34. 34.
    Ellis, D.P.W., Lee, K.: Minimal-impact audio-based personal archives. In: 1st ACM workshop on Continuous Archival and Retrieval of Personal Experiences, New York, NY, pp. 39–47 (2004)Google Scholar
  35. 35.
    First report of the Interdepartmental Group on Costs and Benefits, Noise Subject Group: An economic valuation of noise pollution – developing a tool for policy appraisal. Department for Environment, Food and Rural Affairs (2008)Google Scholar
  36. 36.
    Foresti, G.: A real-time system for video surveillance of unattended outdoor environments. IEEE Trans. Circuits Syst. Video Technol. 8(6), 697–704 (1998)CrossRefGoogle Scholar
  37. 37.
    García, A.: Environmental Urban Noise. Wentworth Institute of Technology Press, Boston, MA (2001)Google Scholar
  38. 38.
    Giannoulis, D., Benetos, E., Stowell, D., Plumbley, M.D.: IEEE AASP challenge on detection and classification of acoustic scenes and events - public dataset for scene classification task. Technical report, Queen Mary University of London (2012)Google Scholar
  39. 39.
    Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., Plumbley, M.D.: A database and challenge for acoustic scene classification and event detection. In: 21st EUSIPCO (2013)Google Scholar
  40. 40.
    Grootel, M., Andringa, T., Krijnders, J.: DARES-G1: Database of annotated real-world everyday sounds. In: Proceedings of the NAG/DAGA Meeting 2009, Rotterdam (2009)Google Scholar
  41. 41.
    Guillaume, G., Can, A., Petit, G., Fortin, N., Palominos, S., Gauvreau, B., Bocher, E., Picaut, J.: Noise mapping based on participative measurements. Noise Mapp. 3(1), 140–156 (2016)Google Scholar
  42. 42.
    Hammer, M.S., Swinburn, T.K., Neitzel, R.L.: Environmental noise pollution in the United States: developing an effective public health response. Environ. Health Perspect. 122(2), 115–119 (2014)Google Scholar
  43. 43.
    Heinrich, U.R., Feltens, R.: Mechanisms underlying noise-induced hearing loss. Drug Discov. Today Dis. Mech. 3(1), 131–135 (2006)CrossRefGoogle Scholar
  44. 44.
    Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013, 1 (2013)Google Scholar
  45. 45.
    Kanjo, E.: Noisespy: a real-time mobile phone platform for urban noise monitoring and mapping. Mob. Netw. Appl. 15(4), 562–574 (2010)CrossRefGoogle Scholar
  46. 46.
    Kivelä, I., Gao, C., Luomala, J., Ihalainen, J., Hakala, I.: Design of networked low-cost wireless noise measurement sensors. Sensors Transducers 10, 171 (2011)Google Scholar
  47. 47.
    Krizhevsky, A.: The ZCA whitening transformation. Appendix A of learning multiple layers of features from tiny images, Technical Report, University of Toronto (2009)Google Scholar
  48. 48.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)Google Scholar
  49. 49.
    Larson Davis Model 831-NMS permanent noise monitoring system (2015).
  50. 50.
    Lecomte, S., Lengellé, R., C. Richard, C., Capman, F., Ravera, B.: Abnormal events detection using unsupervised one-class svm-application to audio surveillance and evaluation. In: 2011 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 124–129. IEEE, New York (2011)Google Scholar
  51. 51.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  52. 52.
  53. 53.
    Lin, W., Sun, M., Poovendran, R., Zhang, Z.: Group event detection for video surveillance. In: 2009 IEEE International Symposium on Circuits and Systems, pp. 2830–2833. IEEE, New York (2009)Google Scholar
  54. 54.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: 10th ACM International Conference on Multimedia, pp. 533–542 (2002)Google Scholar
  56. 56.
    Maisonneuve, N., Stevens, M., Ochab, B.: Participatory noise pollution monitoring using mobile phones. Inf. Polity 15(1), 51–71 (2010)Google Scholar
  57. 57.
    McAdams, S.: Spectral fusion, spectral parsing and the formation of auditory images. Ph.D. thesis, Stanford University, Stanford (1984)Google Scholar
  58. 58.
    McFee, B., Humphrey, E., Bello, J.: A software framework for musical data augmentation. In: 16th International Society for Music Information Retrieval Conference, pp. 248–254. Malaga, Spain (2015)Google Scholar
  59. 59.
    Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference (EUSIPCO), Budapest (2016)Google Scholar
  60. 60.
    Mietlicki, F., Mietlicki, C., Sineau, M.: An innovative approach for long-term environmental noise measurement: Rumeur network. In: 10th European Congress and Exposition on Noise Control Engineering (EuroNoise), Maastricht (2015)Google Scholar
  61. 61.
    Muzet, A., et al.: The need for a specific noise measurement for population exposed to aircraft noise during night-time. Noise Health 4(15), 61 (2002)Google Scholar
  62. 62.
    Neitzel, R.L., Gershon, R.R., McAlexander, T.P., Magda, L.A., Pearson, J.M.: Exposures to transit and other sources of noise among New York City residents. Environ. Sci. Technol. 46(1), 500–508 (2011)CrossRefGoogle Scholar
  63. 63.
    Nelson, J.P.: Airports and property values: a survey of recent evidence. J. Transp. Econ. Policy 14, 37–52 (1980)Google Scholar
  64. 64.
    Nelson, J.P.: Highway noise and property values: a survey of recent evidence. J. Trans. Econ. Policy 16, 117–138 (1982)Google Scholar
  65. 65.
    New York City Department of Health and Mental Hygiene: Ambient Noise Disruption in New York City, Data brief 45. New York City Department of Health and Mental Hygiene, NY (2014)Google Scholar
  66. 66.
    NYC 311 Website.
  67. 67.
    Payne, S.R., Davies, W.J., Adams, M.D.: Research into the Practical and Policy Applications of Soundscape Concepts and Techniques in Urban Areas. DEFRA, HMSO, London (2009)Google Scholar
  68. 68.
    Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, pp. 1–6 (2015). doi:10.1109/MLSP.2015.7324337Google Scholar
  69. 69.
    Rabaoui, A., Davy, M., Rossignol, S., Ellouze, N.: Using one-class svms and wavelets for audio surveillance. IEEE Trans. Inf. Forensics Secur. 3(4), 763–775 (2008)CrossRefGoogle Scholar
  70. 70.
    Radhakrishnan, R., Divakaran, A., Smaragdis, P.: Audio analysis for surveillance applications. In: IEEE WASPAA’05, pp. 158–161 (2005). doi:10.1109/ASPAA.2005.1540194Google Scholar
  71. 71.
    Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 142–153 (2015). doi:10.1109/TASLP.2014.2375575Google Scholar
  72. 72.
    Rana, R.K., Chou, C.T., Kanhere, S.S., Bulusu, N., Hu, W.: Ear-phone: an end-to-end participatory urban noise mapping system. In: Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, pp. 105–116. ACM (2010)Google Scholar
  73. 73.
    Ruge, L., Altakrouri, B., Schrader, A.: Soundofthecity-continuous noise monitoring for a healthy city. In: 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 670–675. IEEE, New York (2013)Google Scholar
  74. 74.
    Salamon, J., Bello, J.P.: Feature learning with deep scattering for urban sound analysis. In: 2015 European Signal Processing Conference, Nice (2015)Google Scholar
  75. 75.
    Salamon, J., Bello, J.P.: Unsupervised feature learning for urban sound classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane (2015)Google Scholar
  76. 76.
    Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)CrossRefGoogle Scholar
  77. 77.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: 22nd ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, pp. 1041–1044 (2014)Google Scholar
  78. 78.
    Salamon, J., Bello, J.P., Farnsworth, A., Kelling, S.: Fusing shallow and deep learning for bioacoustic bird species classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, pp. 141–145 (2017)Google Scholar
  79. 79.
    Santini, S., Ostermaier, B., Adelmann, R.: On the use of sensor nodes and mobile phones for the assessment of noise pollution levels in urban environments. In: 2009 6th International Conference on Networked Sensing Systems (INSS), pp. 1–8. IEEE, New York (2009)Google Scholar
  80. 80.
    Saxena, S., Brémond, F., Thonnat, M., Ma, R.: Crowd behavior recognition for video surveillance. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 970–981. Springer, Berlin, Heidelberg (2008)Google Scholar
  81. 81.
    Schweizer, I., Meurisch, C., Gedeon, J., Bärtl, R., Mühlhäuser, M.: Noisemap: multi-tier incentive mechanisms for participative urban sensing. In: Proceedings of the 3rd International Workshop on Sensing Applications on Mobile Phones, p. 9. ACM, New York (2012)Google Scholar
  82. 82.
    Serizel, R., Bisot, V., Essid, S., Richard, G.: Machine listening techniques as a complement to video image analysis in forensics. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 948–952. IEEE, New York (2016)Google Scholar
  83. 83.
    Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: International Conference on Document Analysis and Recognition, vol. 3, Edinburgh, Scottland, pp. 958–962 (2003)Google Scholar
  84. 84.
    Smith, D., Ma, L., Ryan, N.: Acoustic environment as an indicator of social and physical context. Pers. Ubiquit. Comput. 10(4), 241–254 (2006). doi:10.1007/s00779-005-0045-4. CrossRefGoogle Scholar
  85. 85.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  86. 86.
    Stowell, D., Plumbley, M.D.: Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ 2, e488 (2014). doi:10.7717/peerj.488. CrossRefGoogle Scholar
  87. 87.
    Taber, R.: Technology for a quieter america, national academy of engineering. Technical report, NAEPR-06-01-A (2007)Google Scholar
  88. 88.
    Thrun, S., Bennewitz, M., Burgard, W., Cremers, A., Dellaert, F., Fox, D., Haehnel, D., Rosenberg, C., Roy, N., Schulte, J., et al.: Minerva: a second geration mobile tour-guide robot. In: IEEE International Conference on Robotics and Automation, pp. 3136–3141 (1999)Google Scholar
  89. 89.
    Valenzise, G., Gerosa, L., Tagliasacchi, M., Antonacci, F., Sarti, A.: Scream and gunshot detection and localization for audio-surveillance systems. In: IEEE Conference on Advanced Video and Signal Based Surveillance, 2007 (AVSS 2007), pp. 21–26 (2007)CrossRefGoogle Scholar
  90. 90.
    Van Kempen, E., Babisch, W.: The quantitative relationship between road traffic noise and hypertension: a meta-analysis. J. Hypertens. 30(6), 1075–1086 (2012)CrossRefGoogle Scholar
  91. 91.
    Van Renterghem, T., Thomas, P., Dominguez, F., Dauwe, S., Touhafi, A., Dhoedt, B., Botteldooren, D.: On the ability of consumer electronics microphones for environmental noise monitoring. J. Environ. Monit. 13(3), 544–552 (2011)CrossRefGoogle Scholar
  92. 92.
    Wicke, L.: Die ökologischen Milliarden: das kostet die zerstörte Umwelt-so können wir sie retten. Kösel, Munich (1986)Google Scholar
  93. 93.
    Xu, M., Xu, C., Duan, L., Jin, J.S., Luo, S.: Audio keywords generation for sports video analysis. ACM Trans. Multimed. Comput. Commun. Appl. 4(2), 1–23 (2008)CrossRefGoogle Scholar
  94. 94.
    Yanco, H.A.: Wheelesley: a robotic wheelchair system: Indoor navigation and user interface. In: Assistive Technology and Artificial Intelligence, pp. 256–268. Springer, Berlin, Heidelberg (1998)Google Scholar
  95. 95.
    Yost, W.: Auditory image perception and analysis: the basis for hearing. Hear. Res. 56(1), 8–18 (1991)CrossRefGoogle Scholar
  96. 96.
    Zajdel, W., Krijnders, J., Andringa, T., Gavrila, D.: Cassandra: audio-video sensor fusion for aggression detection. In: IEEE Conference on Advanced Video and Signal Based Surveillance, 2007. AVSS 2007, pp. 200–205. IEEE, New York (2007)Google Scholar
  97. 97.
    Ziliani, F., Cavallaro, A.: Image analysis for video surveillance based on spatial regularization of a statistical model-based change detection. In: Proceedings of IEEE International Conference on Image Analysis and Processing, pp. 1108–1111. IEEE, New York (1999)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Music and Audio Research LaboratoryNew York UniversityNew YorkUSA
  2. 2.Center for Urban Science and Progress & Music and Audio Research LaboratoryNew York UniversityNew YorkUSA

Personalised recommendations