Skip to main content
Log in

Voice-Based Gender Identification in Multimedia Applications

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

In the context of content-based multimedia indexing gender identification based on speech signal is an important task. In this paper a set of acoustic and pitch features along with different classifiers are compared for the problem of gender identification. We show that the fusion of features and classifiers performs better than any individual classifier. Based on such conclusions we built a system for gender identification in multimedia applications. The system uses a set of Neural Networks with acoustic and Pitch related features.

90% of classification accuracy is obtained for 1 second segments and with independence to the language and the channel of the speech. Practical considerations, such as the continuity of speech and the use of mixture of experts instead of one single expert are shown to improve the classification accuracy to 93%. When used on a subset of the Switchboard database, the classification accuracy attains 98.5% for 5 seconds segments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Acero, A. and Huang, X. (1996). Speaker and Gender Normalization for Continuous-Density Hidden Markov Models. In Proc. 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, 7–10, pp. 342 –345.

  • Cover Thomas M. and Thomas Joy A. (1991). Elements of Information Theory. In Wiley Series in Telecommunications. John Wiley and Sons.

  • DARPA TIMIT, Acoustic-Phonetic Continuous Speech Corpus, American National Institute of Standards and Technology, NTIS Order Number PB91–50565.

  • Delacourt, P. and Wellekens, C.J. (2000). DISTBIC: A Speaker-Based Segmentation for Audio Data Indexing. Speech Communication, 32, 111–126.

    Google Scholar 

  • Godfrey, J.J., Holliman, E.C., and McDaniel, J. (1992). SWITCHBOARD Telephone Speech Corpus for Research and Development. In Proc. IEEE ICASSP92 Conference, pp. 517–520.

  • Harb, H. and Chen, L. (2003a). Gender Identification Using a General Audio Classifier. In Proc. IEEE International Conference on Multimedia and Expo, ICME03, 2, pp. 733–736.

  • Harb, H. and Chen, L. (2003b). Robust Speech Music Discrimination Using Spectrum’s First Order Statistics and Neural Networks. In Proc. IEEE International Symposium on Signal Processing and its Applications, ISSPA 2003, pp. II-125–128.

  • Harb, H., Chen, L., and Auloge J.-Y. (2004). Mixture of Experts for Audio Classification: An Application to Male/Female Classification and Musical Genre Recognition. In Proc. IEEE International Conference on Multimedia and Expo, ICME 2004.

  • Haykin, S. (1994). Neural Networks A Comprehensive Foundation, Macmillan College Publishing Company.

  • Hemphill, C.T., Godfrey J.J., and Doddington, G.R. (1990). The ATIS Spoken Language Systems Pilot Corpus, In DARPA Speech and Natural Language Workshop.

  • Hess, W. (1983). Pitch Determination of Speech Signals. New York: Springer-Verlag.

    Google Scholar 

  • Ho, T., and Basu, M. (2002). Complexity Measures of Supervised Classification Problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300.

    Google Scholar 

  • Huang, X.D., Lee, K.F., Hon, H.W., and Hwang, M.Y. (1991). Improved Acoustic Modeling with the SPHINX Speech Recognition System. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., Vol. 1, 14–17, pp. 345–348.

  • Jung, E., Schwarzbacher, A., and Lawlor, R. (2002). Implementation of Real-Time AMDF Pitch-Detection for Voice Gender Normalization. In Proceedings of the 14th International Conference on Digital Signal Processing, 2002. DSP 2002, Vol. 2, pp. 827–830.

  • Kirchoff, K. and Bilmes, J. (1999). Dynamic Classifier Combination in Hybrid Speech Recognition Systems Using Utterance-level Confidence Values. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. II-693–696.

  • Kittler, J. and Alkoot, F.M. (2003). Sum Versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 110–115.

    Google Scholar 

  • Konig, Y. and Morgan, N. (1992). GDNN a Gender Dependent Neural Network for Continuous Speech Recognition. In International Joint Conference on Neural Networks, 1992. IJCNN, Vol. 2, 7–11, pp. 332– 337.

  • Marston, D. (1998). Gender Adapted Speech Coding. In Proc. 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1998. ICASSP ‘98, Vol. 1, 12–15, pp. 357 –360.

  • Martland, P., Whiteside, S.P., Beet, S.W., and Baghai-Ravary, L. (1996). Analysis of Ten Vowel Sounds Across Gender and Regional Cultural Accent. In Proc. Fourth International Conference on Spoken Language, 1996. ICSLP 96, Vol. 4, 3–6, pp. 2231–2234.

  • Mirghafori, N., Morgan, N., and Bourlard, H. (1994). Parallel Training of MLP Probability Estimators for Speech Recognition a Gender-Based Approach, In Proc. 1994 IEEE Workshop Neural Networks for Signal Processing IV, 6–8, pp. 289–298.

  • Muthusama, Y.K., Cole, R.A., and Oshika, B.T. (1992). The OGI Multi-Language Telephone Speech Corpus. In Proc. ICSLP 1992.

  • Neti, C. and Roukos, S. (1997). Phone-Context Specific Gender-Dependent Acoustic-Models for Continuous Speech Recognition. In Proc. 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 192–198.

  • Parris, E.S. and Carey, M. J. (1996). Language Independent Gender Identification In Proc. IEEE ICASSP, pp. 685–688.

  • Potamitis, I., Fakotakis, N., and Kokkinakis, G. (2002). Gender-Dependent and Speaker-Dependent Speech Enhancement. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. (ICASSP ‘02), Vol. 1, 13–17, pp. I-249–I-252.

  • Price, P., Fisher, W.M., Bernstein, J., and Pallett, D.S. (1988). The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 1988, pp. I-291–294.

  • Rabiner, L.R. and Juang, B.H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Rivarol, V., Farhat, A., and O’Shaughnessy, D. (1996). Robust Gender-Dependent Acoustic-Phonetic Modelling in Continuous Speech Recognition Based on a New Automatic Male Female Classification. In Proc. Fourth International Conference on Spoken Language, 1996. ICSLP 96., Vol. 2, 3–6, pp. 1081–1084.

  • Ross, J.M. et al. (1974). Average Magnitude Difference Function Pitch Extractor. In IEEE Transactions on Speech and Audio Processing, 22, 353–362.

    Google Scholar 

  • Seigler, M., Jain, U., Raj, B., and Stern, R. (1997). Automatic Segmentation, Classification, and Clustering of Broadcast News Audio. In Proc. DARPA speech recognition workshop.

  • Shimamura, T. and Kobayashi, H. (2001). Weighted Autocorrelation for Pitch Extraction of Noisy Speech. In IEEE Transactions on Speech and Audio Processing, 9(7), 727–730.

    Google Scholar 

  • Skurichina, M. and Duin, R.P.W. (1998). Bagging for Linear Classifiers. Pattern Recognition, 31(7), 909–930.

    Google Scholar 

  • Slomka, S., and Sridharan, S. (1997). Automatic Gender Identification Optimised For Language Independence. In Proceeding of IEEE TENCON- Speech and Image Technologies for Computing and Telecommunications, pp. 145–148.

  • Tzanetakis, G., and Cook, P. (2002). Musical Genre Classification of Audio Signals IEEE Transactions on Speech and Audio Processing, 10(5), 293–302.

    Google Scholar 

  • Viswanathan, M., Beigi Homayoon S.M., and Tritschler, A. (2000). TranSegId: A system for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification. In Proc. of the World Automation Congress, WAC2000, Wailea, USA.

  • Wu, S., Kingsbury, B., Morgan, N., and Greenberg, S. (1998). Incorporating Information From Syllable-length Time Scales Into Automatic Speech Recognition. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing 1998, pp. 721–724.

  • Zighed, D. A., Auray, J.P., and Duru, G. (1992). SIPINA: Méthode et logiciel (in French). Lacassagne.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Harb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Harb, H., Chen, L. Voice-Based Gender Identification in Multimedia Applications. J Intell Inf Syst 24, 179–198 (2005). https://doi.org/10.1007/s10844-005-0322-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-005-0322-8

Keywords

Navigation