Advertisement

Future Perspective

  • Dan Ellis
  • Tuomas Virtanen
  • Mark D. Plumbley
  • Bhiksha Raj
Chapter

Abstract

This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.

Keywords

Audio content analysis Sound catalogues Sound vocabularies Audio database collection Audio annotation Active learning Weak labels Applications of sound analysis 

References

  1. 1.
    Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 577–584 (2003)Google Scholar
  2. 2.
    Auer, P., Ortner, R.: A boosting approach to multiple instance learning. In: European Conference on Machine Learning, pp. 63–74. Springer, Berlin (2004)Google Scholar
  3. 3.
    Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900 (2016)Google Scholar
  4. 4.
    Babenko, B.: Multiple instance learning: algorithms and applications. Technical Report, Department of Computer Science and Engineering, University of California, San Diego (2008)Google Scholar
  5. 5.
    Bandyopadhyay, S., Ghosh, D., Mitra, R., Zhao, Z.: MBSTAR: multiple instance learning for predicting specific functional binding sites in microRNA targets. Sci. Rep. 5, 8004 (2015)CrossRefGoogle Scholar
  6. 6.
    Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X.Z., Raich, R., Hadley, S.J., Hadley, A.S., Betts, M.G.: Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. J. Acoust. Soc. Am. 131(6), 4640–4650 (2012)CrossRefGoogle Scholar
  7. 7.
    Büchler, M., Allegro, S., Launer, S., Dillier, N.: Sound classification in hearing aids inspired by auditory scene analysis. EURASIP J. Adv. Signal Process. 2005(18), 387845 (2005)CrossRefzbMATHGoogle Scholar
  8. 8.
    Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4(1), 129–145 (1996)zbMATHGoogle Scholar
  9. 9.
    Cooke, M., Ellis, D.P.: The auditory organization of speech and other sources in listeners and computational models. Speech Commun. 35(3), 141–177 (2001)CrossRefzbMATHGoogle Scholar
  10. 10.
    Correia, J., Trancoso, I., Raj, B.: Adaptation of SVM for MIL for inferring the polarity of movies and movie reviews. In: Spoken Language Technology Workshop (SLT), 2016 IEEE, pp. 258–264. IEEE, New York (2016)Google Scholar
  11. 11.
    Dalvi, B., Callan, J., Cohen, W.W.: Entity list completion using set expansion techniques. In: Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, Gaithersburg MD (2011)Google Scholar
  12. 12.
    Doppler Labs: HearOne wireless smart earbuds (2017). http://hereplus.me Google Scholar
  13. 13.
    Elizalde, B., Raj, B., Vincent, E.: Large-scale weakly supervised sound event detection for smart cars (2017). http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection Google Scholar
  14. 14.
    Frey, B.J., Deng, L., Acero, A., Kristjansson, T.T.: ALGONQUIN: iterating laplace’s method to remove multiple types of acoustic distortion for robust speech recognition. In: INTERSPEECH, pp. 901–904 (2001)Google Scholar
  15. 15.
    Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE ICASSP 2017, New Orleans (2017). https://research.google.com/pubs/pub45857.html
  16. 16.
    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539–545. Association for Computational Linguistics, Stroudsburg, PA (1992)Google Scholar
  17. 17.
    Hershey, S., Chaudhury, S., Ellis, D.P.W., Gemmeke, J., Jansen, A., Moore, R.C., Plakal, M., Sauros, R.A., Seybold, B., Slaney, M., Weiss, R.: CNN architectures for large-scale audio classification. In: IEEE ICASSP 2017, New Orleans (2017). https://research.google.com/pubs/pub45611.html
  18. 18.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Jansen, A., Gemmeke, J.F., Ellis, D.P.W., Liu, X., Lawrence, W., Freedman, D.: Large-scale audio event discovery in one million youtube videos. In: IEEE ICASSP 2017, New Orleans (2017)Google Scholar
  20. 20.
    Kingsbury, B.E., Morgan, N., Greenberg, S.: Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)CrossRefGoogle Scholar
  21. 21.
    Klapuri, A.: Multiple fundamental frequency estimation by summing harmonic amplitudes. In: ISMIR, pp. 216–221 (2006)Google Scholar
  22. 22.
    Kong, Q., Xu, Y., Wang, W., Plumbley, M.D.: A joint detection-classification model for audio tagging of weakly labelled data. CoRR abs/1610.01797 (2016). http://arxiv.org/abs/1610.01797
  23. 23.
    Kotzias, D., Denil, M., De Freitas, N., Smyth, P.: From group to individual labels using deep features. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 597–606. ACM, New York (2015)Google Scholar
  24. 24.
    Kumar, A., Raj, B.: Audio event detection using weakly labeled data. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1038–1047. ACM, New York (2016)Google Scholar
  25. 25.
    Kumar, A., Raj, B.: Weakly supervised scalable audio content analysis. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE, New York (2016)Google Scholar
  26. 26.
    Kumar, A., Raj, B., Nakashole, N.: Discovering sound concepts and acoustic relations in text. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York (2017)Google Scholar
  27. 27.
    Leistner, C., Saffari, A., Bischof, H.: Miforests: multiple-instance learning with randomized trees. In: Computer Vision–ECCV 2010, pp. 29–42 (2010)Google Scholar
  28. 28.
    Mandel, M.I., Ellis, D.P.: Multiple-instance learning for music information retrieval. In: ISMIR, pp. 577–582 (2008)Google Scholar
  29. 29.
    Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: ICML, vol. 98, pp. 341–349 (1998)Google Scholar
  30. 30.
    Mesaros, A., Heittola, T., Virtanen, T.: Tut database for acoustic scene classification and sound event detection. In: Signal Processing Conference (EUSIPCO), 2016 24th European, pp. 1128–1132. IEEE, New York (2016). http://www.cs.tut.fi/~mesaros/pubs/mesaros_eusipco2016-dcase.pdf
  31. 31.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  32. 32.
    Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-ending learning. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15) (2015)Google Scholar
  33. 33.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  34. 34.
    Papadopoulos, D.P., Uijlings, J.R., Keller, F., Ferrari, V.: Training object class detectors with click supervision. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR). Honolulu, Hawaii (2017). ArXiv preprint arXiv:1704.06189Google Scholar
  35. 35.
    Pillai, R., Qazi, U.W.: Acoustic analysis of text (aat): Extracting sound out of words. QSIURP Research Report, Carnegie Mellon University Qatar (2016)Google Scholar
  36. 36.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Sager, S., Borth, D., Elizalde, B., Schulze, C., Raj, B., Lane, I., Dengel, A.: AudioSentiBank: large-scale semantic ontology of acoustic concepts for audio content analysis. arXiv preprint (arXiv:1607.03766) (2016)Google Scholar
  38. 38.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044. ACM, New York (2014). https://serv.cusp.nyu.edu/projects/urbansounddataset/salamon_urbansound_acmmm14.pdf
  39. 39.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)Google Scholar
  40. 40.
    Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015)CrossRefGoogle Scholar
  41. 41.
    Temko, A., Malkin, R., Zieger, C., Macho, D., Nadeu, C., Omologo, M.: Clear evaluation of acoustic event detection and classification systems. In: International Evaluation Workshop on Classification of Events, Activities and Relationships, pp. 311–322. Springer, New York (2006)Google Scholar
  42. 42.
    Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley/IEEE Press, New York (2006)CrossRefGoogle Scholar
  43. 43.
    Wikipedia: Amazon Echo (2017). https://en.wikipedia.org/wiki/Amazon_Echo
  44. 44.
    Xu, Y., Kong, Q., Huang, Q., Wang, W., Plumbley, M.D.: Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging. CoRR abs/1703.06052 (2017). http://arxiv.org/abs/1703.06052
  45. 45.
    Zhao, S., Heittola, T., Virtanen, T.: Active learning for sound event classification by clustering unlabeled data. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2017)Google Scholar
  46. 46.
    Zhao, Z., Fu, G., Liu, S., Elokely, K.M., Doerksen, R.J., Chen, Y., Wilkins, D.E.: Drug activity prediction using multiple-instance learning via joint instance and feature selection. BMC Bioinf. 14(14), S16 (2013)CrossRefGoogle Scholar
  47. 47.
    Zhou, Z.H., Zhang, M.L.: Neural networks for multi-instance learning. In: Proceedings of the International Conference on Intelligent Information Technology, Beijing, pp. 455–459 (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Dan Ellis
    • 1
  • Tuomas Virtanen
    • 2
  • Mark D. Plumbley
    • 3
  • Bhiksha Raj
    • 4
  1. 1.Google Inc, 111 8th AveNew YorkUSA
  2. 2.Laboratory of Signal ProcessingTampere University of TechnologyTampereFinland
  3. 3.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK
  4. 4.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations