Existing acoustic scene classification (ASC) systems often fail to generalize across different recording devices. In this work, we present an unsupervised domain adaptation method for ASC based on data standardization and feature projection. First, log-amplitude spectro-temporal features are standardized in a band-wise fashion over samples and time. Then, both source- and target-domain samples are projected onto the span of the principal eigenvectors of the covariance matrix of source-domain training data. The proposed method, being devised as a preprocessing procedure, is independent of the choice of the classification algorithm and can be readily applied to any ASC model at a minimal cost. Using the TUT Urban Acoustic Scenes 2018 Mobile Development dataset, we show that the proposed method can provide an absolute increment of over 10% compared to state-of-the-art unsupervised adaptation methods. Furthermore, the proposed method consistently outperforms a recent ASC model that ranked first in Task 1-A of the 2021 DCASE Challenge when evaluated on various unseen devices from the TAU Urban Acoustic Scenes 2020 Mobile Development dataset. In addition, our method appears robust even when provided with a small amount of target-domain data, proving effective using as few as 90 seconds of test audio recordings. Finally, we show that the proposed adaptation method can also be employed as a feature extraction stage for shallower neural networks, thus significantly reducing model complexity.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Wang, W. (Ed.) (2011). Machine Audition. IGI Global.
Wang, D., & Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press.
Virtanen, T., Plumbley, M. D., & Ellis, D. (Eds.) (2018). Computational Analysis of Sound Scenes and Events. Springer International Publishing.
Chan, T., & Chin, C. S. (2020). A comprehensive review of polyphonic sound event detection. IEEE Access, 8, 103339–103373.
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. D. (2015). Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17, 1733–1746.
Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32, 16–34.
Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions. arXiv preprint, arXiv:2005.14623
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., & Plumbley, M. D. (2017). Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 379–393.
Mesaros, A., Heittola, T., & Virtanen, T. (2018a). Acoustic scene classification: An overview of DCASE 2017 challenge entries. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC) (pp. 411–415).
Mesaros, A., Heittola, T., & Virtanen, T. (2018b). A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) (pp. 9–13).
Mesaros, A., Heittola, T., & Virtanen, T. (2019). Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) (pp. 164–168).
Abeßer, J. (2020). A review of deep learning based methods for acoustic scene classification (p. 10). Sci.: Appl.
Mesaros, A., Heittola, T., & Virtanen, T. (2017b). Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 319–323).
Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. In 1994 First Workshop on Mobile Computing Systems and Applications (pp. 85–90).
Vivek, V. S., Vidhya, S., & Madhanmohan, P. (2020). Acoustic scene classification in hearing aid using deep learning. In 2020 International Conference on Communication and Signal Processing (ICCSP) (pp. 695–699).
Sehili, M. A., Istrate, D., Dorizzi, B., & Boudy, J. (2012). Daily sound recognition using a combination of GMM and SVM for home automation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO) (pp. 1673–1677).
Radhakrishnan, R., Divakaran, A., & Smaragdis, A. (2005). Audio analysis for surveillance applications. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 158–161).
Perera, C., Zaslavsky, A., Christen, P., & Georgakopoulos, D. (2013). Context aware computing for the internet of things: A survey. IEEE Communications Surveys & Tutorials, 16, 414–454.
Abeßer, J., Mimilakis, S. I., Gräfe, R., Lukashevich, H., & Fraunhofer, I. (2017). Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) (pp. 7–11).
Chu, S., Narayanan, S., Kuo, C. C. J., & Mataric, M. J. (2006). Where am I? Scene recognition for mobile robots using audio features. In 2006 IEEE International conference on multimedia and expo (pp. 885–888).
Gharib, S., Drossos, K., Cakir, E., Serdyuk, D., & Virtanen, T. (2018). Unsupervised adversarial domain adaptation for acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) (pp. 138–142).
Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence. volume 30.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1521–1528).
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244.
Wang, R., Wang, M., Zhang, X., & Rahardja, S. (2019). Domain adaptation neural network for acoustic scene classification in mismatched conditions. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1501–1505).
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17, 2096–2030.
Chin, C. S., Kek, X. Y., & Chan, T. K. (2020). Wavelet scattering based gated recurrent units for binaural acoustic scenes classification. In 2020 International Conference on Internet of Things and Intelligent Applications (ITIA) (pp. 1–5). IEEE.
Chin, C. S., Kek, X. Y., & Chan, T. K. (2021). Scattering transform of averaged data augmentation for ensemble random subspace discriminant classifiers in audio recognition. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS) (pp. 454–458). IEEE volume 1.
Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics, 65, 1331–1398.
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7167–7176).
Drossos, K., Magron, P., & Virtanen, T. (2019). Unsupervised adversarial domain adaptation based on the Wasserstein distance for acoustic scene classification. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 259–263).
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 214–223). PMLR.
Primus, P., Eghbal-zadeh, H., Eitelsebner, D., Koutini, K., Arzt, A., & Widmer, G. (2019). Exploiting parallel audio recordings to enforce device invariance in CNN-based acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) (pp. 204–208).
Mun, S., & Shon, S. (2019). Domain mismatch robust acoustic scene classification using channel information conversion. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 845–849).
Hsu, W., Zhang, Y., & Glass, J. R. (2017). Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems 30 (NIPS) (pp. 1878–1889).
Mezza, A. I., Habets, E. A. P., Müller, M., & Sarti, A. (2021). Unsupervised domain adaptation for acoustic scene classification using band-wise statistics matching. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO) (pp. 11–15).
Kosmider, M. (2019). Calibrating Neural Networks for Secondary Recording Devices. Technical Report DCASE2019 Challenge.
Kośmider, M. (2020). Spectrum correction: Acoustic scene classification with mismatched recording devices. Interspeech 2020, (pp. 4641–4645).
Kim, B., Yang, S., Kim, J., & Chang, S. (2021b). QTI Submission to DCASE 2021: Residual Normalization for Device-Imbalanced Acoustic Scene Classification with Efficient Design. Technical Report DCASE2021 Challenge.
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).
Ulyanov, D., Vedaldi, A., & Lempitsky, V. S. (2016). Instance normalization: The missing ingredient for fast stylization. CoRR, arXiv:1607.08022
Mezza, A. I., Habets, E. A. P., Muller, M., & Sarti, A. (2020). Feature projection-based unsupervised domain adaptation for acoustic scene classification. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6).
Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53, 217–288.
Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). PMLR.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. CoRR, arXiv:1412.6980
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Kim, B., Chang, S., Lee, J., & Sung, D. (2021a). Broadcasted residual learning for efficient keyword spotting. arXiv preprint arXiv:2106.04140
Loshchilov, I., & Hutter, F. (2016). SGDR: stochastic gradient descent with restarts. CoRR, arXiv:1608.03983
Conflict of Interest
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mezza, A.I., Habets, E.A.P., Müller, M. et al. Unsupervised Domain Adaptation via Principal Subspace Projection for Acoustic Scene Classification. J Sign Process Syst 94, 197–213 (2022). https://doi.org/10.1007/s11265-021-01720-9
- Acoustic scene classification
- Unsupervised domain adaptation
- Mismatched recording devices