Skip to main content
Log in

Shennong: A Python toolbox for audio speech features extraction

Behavior Research Methods Aims and scope Submit manuscript

Cite this article

Abstract

We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. We named Shennong after the so-called Chinese Emperor that popularized tea according to Chinese Mythology. It is a reference to Kaldi, a speech recognition toolkit on which Shennong is built, and a legendary Ethiopian goatherd who discovered the coffee plant.

  2. https://github.com/bootphon/shennong

  3. https://github.com/bootphon/shennong/tree/v1.0/examples

  4. https://zerospeech.com/tracks/2015/results

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ..., et al (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI16) (pp. 265–283).

  • Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., ..., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10-11), 763–786.

    Article  Google Scholar 

  • Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of Phonetic sciences, (Vol. 17 pp. 97–110).

  • Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5 (9/10), 341–345.

    Google Scholar 

  • Can, D., Martinez, V.R., Papadopoulos, P., & Narayanan, S.S. (2018). Pykaldi: A python wrapper for kaldi. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP): IEEE.

  • Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: A resource for the next generations of speech-to-text. In LREC, (Vol. 4 pp. 69–71).

  • De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., De Wet, F., Barnard, E., & De Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56, 119–131.

    Article  Google Scholar 

  • Dunbar, E., Cao, X.N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., ..., Dupoux, E. (2017). The zero resource speech challenge 2017. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 323–330): IEEE.

  • Dunbar, E., Karadayi, J., Bernard, M., Cao, X.N., Algayres, R., Ondel, L., ..., Dupoux, E. (2020). The zero resource speech challenge 2020: Discovering discrete subword and word units. In Interspeech 2020.

  • Ellis, D.P.W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat. Online web resource.

  • Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459–1462).

  • Fer, R., Matějka, P., Grézl, F., Plchot, O., Veselỳ, K., & Černockỳ, J.H. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech & Language, 46, 252–267.

    Article  Google Scholar 

  • Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498).

  • Harper, M. (2013). The babel program and low resource speech technology. Proc. of ASRU 2013.

  • Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.

    Article  PubMed  Google Scholar 

  • Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.

    Article  Google Scholar 

  • Hermansky, H., Morgan, N., Bayya, A., & Kohn, P. (1991). RASTA-PLP speech analysis. In Proc. IEEE int’l conf. acoustics, speech and signal processing, (Vol. 1 pp. 121–124).

  • Hung, L.H., Kristiyanto, D., Lee, S.B., & Yeung, K.Y. (2016). Guidock: Using docker containers with a common graphics user interface to address the reproducibility of research. PloS ONE, 11(4), e0152686.

    Article  PubMed  PubMed Central  Google Scholar 

  • Kim, D., Umesh, S., Gales, M., Hain, T., & Woodland, P. (2004). Using VTLN for broadcast news transcription. In 8th international conference on spoken language processing.

  • Kim, J.W., Salamon, J., Li, P., & Bello, J.P. (2018). Crepe: A convolutional representation for pitch estimation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 161–165).

  • Koolagudi, S.G., & Rao, K.S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117.

    Article  Google Scholar 

  • Lenain, R., Weston, J., Shivkumar, A., & Fristed, E. (2020). Surfboard: Audio feature extraction for modern machine learning.

  • Liu, F., Surendran, D., & Xu, Y. (2006). Classification of statement and question intonations in Mandarin. Proc. 3rd speech prosody, (pp. 603–606).

  • Mauch, M., & Dixon, S. (2014). pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 659–663): IEEE.

  • Oord, A.v.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748.

  • Orozco-Arroyave, J., Hönig, F., Arias-Londoño, J., Vargas-Bonilla, J., Daqrouq, K., Skodda, S., ..., Nöth, E. (2016). Automatic detection of Parkinson’s disease in running speech spoken in three different languages. The Journal of the Acoustical Society of America, 139(1), 481–500.

    Article  PubMed  Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ..., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ..., et al. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Pitt, M.A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). Columbus, OH: Department of Psychology Ohio State University.

  • Plante, F., Meyer, G., & Ainsworth, W. (1995). A pitch extraction reference database. In Eurospeech-1995 (pp. 837–840).

  • Povey, D. (2010). Notes for affine transform-based VTLN. Microsoft Research.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ..., Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding: IEEE Signal Processing Society.

  • Ramirez, J., Górriz, J. M., & Segura, J.C. (2007). Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding, 6(9), 1–22.

    Google Scholar 

  • Riad, R., Titeux, H., Lemoine, L., Montillot, J., Bagnou, J.H., Cao, X.N., ..., Bachoud-Lévi, A.C. (2020). Vocal markers from sustained phonation in huntington’s disease. In Interspeech 2020.

  • Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second dihard diarization challenge: Dataset, task, and baselines. Interspeech, 2019, (pp. 978–982).

  • Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., ..., Liberman, M. (2020). The third dihard diarization challenge. arXiv:2012.01477.

  • Saeed, A., Grangier, D., & Zeghidour, N. (2021). Contrastive learning of general-purpose audio representations. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3875–3879).

  • Schatz, T., Bernard, M., & Thiollière, R. (2020). h5features: Efficient storage of large features data. https://github.com/bootphon/h5features/releases/tag/v1.3.3. Software. Version 1.3.3.

  • Schatz, T., Feldman, N.H., Goldwater, S., Cao, X.N., & Dupoux, E. (2021). Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences, 118(7).

  • Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech 2013 (pp. 1–5).

  • Schatz, T., Peddinti, V., Cao, X.N., Bach, F., Hermansky, H., & Dupoux, E. (2014). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. In 15th annual conference of the international speech communication association.

  • Silnova, A., Matejka, P., Glembek, O., Plchot, O., Novotnỳ, O., Grezl, F., ..., Cernockỳ, J. (2018). But/phonexia bottleneck feature extractor. In Odyssey (pp. 283–287).

  • Tirumala, S.S., Shahamiri, S.R., Garhwal, A.S., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271.

    Article  Google Scholar 

  • Versteegh, M., Anguera, X., Jansen, A., & Dupoux, E. (2016). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science, 81, 67–72.

    Article  Google Scholar 

  • Versteegh, M., Thiollière, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A., & Dupoux, E. (2015). The zero resource speech challenge 2015. In Interspeech 2015.

  • Zahorian, S.A., & Hu, H. (2008). A spectral/temporal method for robust fundamental frequency tracking. The Journal of the Acoustical Society of America, 123(6), 4559–4571.

    Article  PubMed  Google Scholar 

  • Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. In Interspeech 2018.

Download references

Acknowledgements

This work is funded by Inria (Grant ADT-193), the Agence Nationale de la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL, ANR-19-P3IA-0001 PRAIRIE 3IA Institute), CIFAR (Learning in Minds and Brains) and Facebook AI Research (Research Grant).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathieu Bernard.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mathieu Bernard and Maxime Poli contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bernard, M., Poli, M., Karadayi, J. et al. Shennong: A Python toolbox for audio speech features extraction. Behav Res 55, 4489–4501 (2023). https://doi.org/10.3758/s13428-022-02029-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3758/s13428-022-02029-6

Keywords

Navigation