Abstract
We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.
This is a preview of subscription content,
to check access.




Similar content being viewed by others
Notes
We named Shennong after the so-called Chinese Emperor that popularized tea according to Chinese Mythology. It is a reference to Kaldi, a speech recognition toolkit on which Shennong is built, and a legendary Ethiopian goatherd who discovered the coffee plant.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ..., et al (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI16) (pp. 265–283).
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., ..., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10-11), 763–786.
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of Phonetic sciences, (Vol. 17 pp. 97–110).
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5 (9/10), 341–345.
Can, D., Martinez, V.R., Papadopoulos, P., & Narayanan, S.S. (2018). Pykaldi: A python wrapper for kaldi. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP): IEEE.
Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: A resource for the next generations of speech-to-text. In LREC, (Vol. 4 pp. 69–71).
De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., De Wet, F., Barnard, E., & De Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56, 119–131.
Dunbar, E., Cao, X.N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., ..., Dupoux, E. (2017). The zero resource speech challenge 2017. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 323–330): IEEE.
Dunbar, E., Karadayi, J., Bernard, M., Cao, X.N., Algayres, R., Ondel, L., ..., Dupoux, E. (2020). The zero resource speech challenge 2020: Discovering discrete subword and word units. In Interspeech 2020.
Ellis, D.P.W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat. Online web resource.
Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459–1462).
Fer, R., Matějka, P., Grézl, F., Plchot, O., Veselỳ, K., & Černockỳ, J.H. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech & Language, 46, 252–267.
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498).
Harper, M. (2013). The babel program and low resource speech technology. Proc. of ASRU 2013.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.
Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Hermansky, H., Morgan, N., Bayya, A., & Kohn, P. (1991). RASTA-PLP speech analysis. In Proc. IEEE int’l conf. acoustics, speech and signal processing, (Vol. 1 pp. 121–124).
Hung, L.H., Kristiyanto, D., Lee, S.B., & Yeung, K.Y. (2016). Guidock: Using docker containers with a common graphics user interface to address the reproducibility of research. PloS ONE, 11(4), e0152686.
Kim, D., Umesh, S., Gales, M., Hain, T., & Woodland, P. (2004). Using VTLN for broadcast news transcription. In 8th international conference on spoken language processing.
Kim, J.W., Salamon, J., Li, P., & Bello, J.P. (2018). Crepe: A convolutional representation for pitch estimation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 161–165).
Koolagudi, S.G., & Rao, K.S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117.
Lenain, R., Weston, J., Shivkumar, A., & Fristed, E. (2020). Surfboard: Audio feature extraction for modern machine learning.
Liu, F., Surendran, D., & Xu, Y. (2006). Classification of statement and question intonations in Mandarin. Proc. 3rd speech prosody, (pp. 603–606).
Mauch, M., & Dixon, S. (2014). pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 659–663): IEEE.
Oord, A.v.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748.
Orozco-Arroyave, J., Hönig, F., Arias-Londoño, J., Vargas-Bonilla, J., Daqrouq, K., Skodda, S., ..., Nöth, E. (2016). Automatic detection of Parkinson’s disease in running speech spoken in three different languages. The Journal of the Acoustical Society of America, 139(1), 481–500.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ..., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ..., et al. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Pitt, M.A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). Columbus, OH: Department of Psychology Ohio State University.
Plante, F., Meyer, G., & Ainsworth, W. (1995). A pitch extraction reference database. In Eurospeech-1995 (pp. 837–840).
Povey, D. (2010). Notes for affine transform-based VTLN. Microsoft Research.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ..., Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding: IEEE Signal Processing Society.
Ramirez, J., Górriz, J. M., & Segura, J.C. (2007). Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding, 6(9), 1–22.
Riad, R., Titeux, H., Lemoine, L., Montillot, J., Bagnou, J.H., Cao, X.N., ..., Bachoud-Lévi, A.C. (2020). Vocal markers from sustained phonation in huntington’s disease. In Interspeech 2020.
Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second dihard diarization challenge: Dataset, task, and baselines. Interspeech, 2019, (pp. 978–982).
Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., ..., Liberman, M. (2020). The third dihard diarization challenge. arXiv:2012.01477.
Saeed, A., Grangier, D., & Zeghidour, N. (2021). Contrastive learning of general-purpose audio representations. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3875–3879).
Schatz, T., Bernard, M., & Thiollière, R. (2020). h5features: Efficient storage of large features data. https://github.com/bootphon/h5features/releases/tag/v1.3.3. Software. Version 1.3.3.
Schatz, T., Feldman, N.H., Goldwater, S., Cao, X.N., & Dupoux, E. (2021). Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences, 118(7).
Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech 2013 (pp. 1–5).
Schatz, T., Peddinti, V., Cao, X.N., Bach, F., Hermansky, H., & Dupoux, E. (2014). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. In 15th annual conference of the international speech communication association.
Silnova, A., Matejka, P., Glembek, O., Plchot, O., Novotnỳ, O., Grezl, F., ..., Cernockỳ, J. (2018). But/phonexia bottleneck feature extractor. In Odyssey (pp. 283–287).
Tirumala, S.S., Shahamiri, S.R., Garhwal, A.S., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271.
Versteegh, M., Anguera, X., Jansen, A., & Dupoux, E. (2016). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science, 81, 67–72.
Versteegh, M., Thiollière, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A., & Dupoux, E. (2015). The zero resource speech challenge 2015. In Interspeech 2015.
Zahorian, S.A., & Hu, H. (2008). A spectral/temporal method for robust fundamental frequency tracking. The Journal of the Acoustical Society of America, 123(6), 4559–4571.
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. In Interspeech 2018.
Acknowledgements
This work is funded by Inria (Grant ADT-193), the Agence Nationale de la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL, ANR-19-P3IA-0001 PRAIRIE 3IA Institute), CIFAR (Learning in Minds and Brains) and Facebook AI Research (Research Grant).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mathieu Bernard and Maxime Poli contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bernard, M., Poli, M., Karadayi, J. et al. Shennong: A Python toolbox for audio speech features extraction. Behav Res 55, 4489–4501 (2023). https://doi.org/10.3758/s13428-022-02029-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13428-022-02029-6