Investigating Fairness in Machine Learning-based Audio Sentiment Analysis using Spectrograms and Bag-of-visual-words

Audio sentiment analysis is a growing area of research, however fairness in audio sentiment analysis is hardly investigated. We found research on machine learning tools’ reliability and fairness in various demographic groups. However, fairness in audio sentiment analysis regarding gender is still an uninvestigated �eld. In this research, we used 442 audio �les of happiness and sadness -- representing equal samples of male and female subjects -- and generated spectrograms for each �le. Then we used bag-of-visual-words method for feature extraction and Random Forest, Support Vector Machines and K-nearest Neighbors classi�ers to investigate whether the machine learning models for audio sentiment analysis are fair among the two genders. We found the need for gender-specic models for audio sentiment analysis instead of a gender-agnostic general-model. Our results provided three pieces of evidence to back up our claim that the gender-agnostic model is bias in terms of accuracy of the audio sentiment analysis task. Furthermore, we discovered that a gender-specic model trained with female audio samples does not perform well against male audio �les and vice versa. The best accuracy for female-model is 76% and male-model is 74%, which is signi�cantly better than the gender-agnostic model’s accuracy of 66%.


Introduction
Arti cial intelligence (AI) has been used in many classi cation and recognition tasks including sentiment analysis.However, in the race to constantly build models with high accuracy, researchers have not always paid attention to uneven system capability development 1 .Increasingly, researchers are starting to realize that a high accuracy in standard benchmark datasets does not guarantee successful deployment in realworld scenarios 2,3,4 .There are many demographic factors that play role in real-world scenarios that standard benchmark dataset does not always provide.Due to this issue many machine learning models are never deployed for production.Nonetheless, researchers are getting aware of this issue, therefore, arti cial intelligence's reliability, credibility, and fairness have attracted people's attention, especially for human-in-the-loop or human-centered AI.Recent research has raised concerns over fairness implications 5 .A growing number of researchers have realized that the "one-size-ts-all" approach does not t at all in AI systems and are committed to creating and designing AI algorithms that are trustworthy 6 , fair 7 , and unbiased 8,9 .Machine learning and AI tools are used to make human life easier with the desire that the solutions would do what humans would have done but are faster, more consistent, and unbiased.But the irony here is that those desired solutions are frequently observed to perform better for some demographic groups than for others, making biased judgments and escalating inequality 10,11 .Hence, latching on to as much demographic data as possible could make the solution nearest to how humans would have done it themselves without bias.
The fairness of AI algorithms is an evolving and adapting research area that stems from the general need for decision-making to be free from bias and discrimination 12 .Schmitz et al. 13 experimented with how to balance the accuracy and fairness of the multimodal architectures for emotion recognition and discovered that the fairest bimodal model is an audio + video fusion.Ricci et al. 14 discussed the meaning of fairness in medical image processing and commented on potential sources of bias and available strategies to mitigate bias.In order to explore the issue of fairness in dialogue systems, Liu et al. 15 constructed a benchmark dataset and proposed quantitative measures to understand fairness in dialogue models.They found that popular dialogue models signi cantly prejudiced different genders and races and provided two methods to mitigate bias in dialogue systems.In the context of current issues in healthcare, Chen et al. 16 summarized the fairness in machine learning and its intersectional eld, outlining how algorithmic biases arise in existing clinical work ows and the healthcare disparities that result from these issues.Although research 1,16,17,18 shows that AI algorithms can be biased against speci c populations or groups in various situations, there is a gap in understanding fairness in audio sentiment analysis.
We used 442 audio les with male and female voices, transformed them into spectrograms, and used bag-of-visual-words feature representation and machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM) and K-nearest Neighbors (KNN) to investigate the fairness of the models in audio sentiment analysis.We found that models generated using the same parameters and algorithms do not perform the same way for different genders' audio les.Hence, using a generalpurpose model to analyze the sentiment of different genders performs poorly to the accuracy of sentiment analysis.Earlier in the paper 19 , the model was generated and tested using only female audio les obtaining an accuracy of 76%.Later, we generated a general-purpose model using both male and female audio les together, but the accuracy decreased to 66%.We separated the audio les into male and female groups and built a gender-speci c customized model to address the poor accuracy of the general-purpose model.We then tested the accuracy of the gender-speci c model against gender-speci c dataset and dataset representing both genders.The model customized for female audio les did not perform well with male audio les and vice versa.Additionally, both gender-speci c models perform poorly when model built for one gender tested against audio les of the other gender, even after performing hyperparameter optimization in the machine learning algorithm, which veri ed that we need a personalized approach based on gender to get better accuracy.

Results
We built three different models (general-model, female-model, and male-model) to demonstrate that utilizing the same model for different gender does not produce a fair solution for audio sentiment analysis, from which we can infer machine learning models can show gender-biases.Table 1 shows the parameters used for three different models.
General-model: This model was built using both male and female audio les.We performed hyperparameter optimization, and the best combination was a sample rate of 44,100 Hz, 100 keypoints (using an ORB extractor), 5 clusters while building bag-of-visual-words, and a RF algorithm as the classi er.
Female-model: This model was customized for female audio les.We performed hyperparameter optimization, and the best combination was a sample rate of 22,050 Hz, 150 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and a RF algorithm as the classi er.
Male-model: This model was customized for male audio les.We performed hyperparameter optimization, and the best combination was a sample rate of 11,025 Hz, 300 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and an RF algorithm as the classi er.Table 2 shows accuracy for different scenarios.We built three different models (general-model, femalemodel, and male-model) and we tested these models with the audio les of gender that the model was customized for, gender that the model was not customized for, and nally both male and female audio les together.First, we developed a general model that has both male and female audio les in the dataset and got the accuracy of 66% which was very low compared to our previous work's accuracy of 76% which only used female audio les 19 .This difference in accuracy made us think about the model's fairness, so we decided to separate the male and female audio les and create gender-speci c models.We developed a model using only male audio les and performed hyperparameter optimization in different steps explained in the methodology section.And we were able to get 74% accuracy which is better than the general-model's accuracy.
Moreover, we passed audio les into the model that it was not built for and compared the accuracy with the gender-speci c model's accuracy.Female audio les were passed to the male-model and the accuracy decreased to 60% and also when both genders' audio les were passed, we got 64% accuracy.Furthermore, audio les of male gender were passed to female-model and the accuracy decreased to 57%, and when dataset of both genders' audio les were passed together, we got 62% accuracy.Therefore, our results demonstrate gender bias in the machine learning-based sentiment analysis.Table 3 demonstrates different classi ers' accuracy for all three models used.Random Forest performed best among three classi ers used for all three models: female-model 76% accuracy, male-model 74% accuracy and general-model 66%.

Discussion
We used audio les of both genders, male and female, for audio sentiment analysis.While running experiments, we encountered the fact that machine learning models are not fair in all instances and could be prone to biases if the audio instances used to train one sentiment analysis model belong to different gender than that of instances to testing the model.
Three pieces of evidence shown in the result section clearly show that AI algorithm is not fair when it comes to audio sentiment analysis of two different gender male and female audio les with a one-sizets-all audio sentiment analysis model.Below are the three pieces of evidence: 1.All three models (general-model, female-model, and male-model) did not perform well when tested with the audio les of both genders.
2. The model that performed best with female audio les did not perform well with male audio les.
3. The model that performed best with male audio les did not perform well with female audio les.
The result displayed in table 2 substantiates the need for gender-speci c personalized models or algorithms that can handle these differences and still provide better results.Another valuable insight we inferred from the result was male-model performed better compared to female-model in general setting, i.e., when both genders' audio samples were used (male-model 64% vs female-model 62%).Additionally, male-model performed slightly better than female-model when opposite gender's audio samples were used, i.e., female audio samples for male-model (60%) and male audio sample for female-model (57%).
However, the best model based on accuracy out of all models is a female-model with 76% accuracy when tested against female audio samples.These comparisons prove there is bias in AI algorithms used for audio sentiment analysis.
In this experiment, we have tried to showcase a scenario that might happen in real-world settings.If we train a model with male audio les and during a deployment phase female audio sample were used to test or vice-versa, the model will perform unfairly.Hence, we developed a gender-speci c customized model for doing justice to both genders' audio les.The dataset had demographic information such as gender in the lename, which made it easier for us to separate the les based on gender and develop the gender-speci c model.But in many cases, datasets do not provide such demographic information.For such a scenario, one of the solutions could be rst to do the gender recognition and then pass the audio les to the gender-speci c personalized model.Personalizing the model could be a way to increase the accuracy of the model for both genders.When we average the result from both gender-speci c models for the particular gender that it was customized for, we get an accuracy of 75%, which is higher than in cases where both genders' audio les were used together but without gender recognition.We even applied hyperparameter optimization in machine learning algorithms and the best accuracy we could get from male-model for both genders audio les together were 64% and from Female-model was 62%.This solution of developing an ensemble of gender recognition and gender-speci c models and combining the results can provide a personalized touch to the model instead of a one-size-ts-all approach.In our experiment, we were able to verify the bias of machine learning on demographic factors such as gender, but if there were more demographic factors available in the datasets, such as race, age, and ethnicity, there might be a chance of increasing the accuracy of the model.Our main point is that we need to acknowledge the differences in the demographic aspect, be it gender, race, or ethnicity, in the datasets and have a personalized approach rather than trying to use a one-size-ts-all approach.More researchers are seeing this issue and drawing attention to the fairness in machine learning and AI tools.And we think accepting there is an issue and pointing it out is the rst step towards a solution.

Methodology
We employed the EmoFilm dataset 20 --a multilingual corpus comprising 1,115 emotional utterances in English, Spanish, and Italian that were taken from 43 lms and 207 speakers (both male and female are included).This research used the existing dataset and did not in volve direct human participants.With the combination of spectrograms and the bag-of-visual-words method, this dataset has never been used for sentiment analysis.Each audio les' label was included in the lename, making it very simple to identify the emotion represented by the audio signal (i.e., Fear, Disgust, Happiness, Anger, Sadness).For our binary sentiment classi cation goal, we have chosen 204 Happiness and 238 Sadness audio les equally divided among males and females.
We used spectrograms as data representation for audio les and used bag-of visual words to transform them into histograms.Later passed the histogram to classi ers such as RF, SVM, and KNN for sentiment classi cation.Figure 1 shows the pipeline of the experiment.More detail on the methodology can be found in our previous work 19 .Here we will explain how we customized the model to be gender speci c.All the steps were the same while developing gender-speci c models; only the parameters differed.We did our experiment in two sections; rst, we generated the histogram from audio les and later passed the histogram to classi ers.In the rst part of the experiment, we transformed the audio les into a spectrogram using Short-time Fourier Transform (STFT) with the Librosa library in python, where different sample rates were applied for different genders.The best combination of the parameters can be seen in the result section.Then, used the Oriented FAST and rotated BRIEF 21 (ORB) algorithm to extract the keypoints and descriptors.Using the descriptors, we generated the histogram and saved it as a .csvle for later analysis.After transforming the raw audio dataset into the histogram .csvles, we split the dataset into the ratio of 75:25 for training and testing.We used the training data to train our classi er models and got the sentiment of the audio le.Hyperparameter optimization was done in many stages; we have listed the optimization that we performed: 1. Generating spectrogram.We used different sample rates (11,025 Hz, 22,050 Hz, and 44,100 Hz) while generating spectrograms to check which one works best for each gender in our experiment.We found that a sample rate of 22,050 Hz works best for the female gender, and a sample rate of 11,025 Hz works best for the male gender.
2. Building a visual dictionary.In order to create a visual dictionary, we used the ORB algorithm to extract keypoints from the spectrogram with 32-bit descriptors.We used a different number of keypoints (50, 100, 150, 300) to test if we could improve the accuracy for male and female audio les.It was established that 150 keypoints for female audio les and 300 keypoints for male audio les works the best.Additionally, we created multiple clusters (5, 10, 20) for each image to compare and get the best performance for models customized for females, males, and both.We found that 10 clusters worked best for both male and female audio les.
3. Hyperparameter optimization algorithms.We used hyperparameter optimization techniques using Randomized Search and Grid Search for the classi ers to get the best parameters for building customized models for males and females.
We were able to utilize the hyperparameter optimization to increase the models' accuracy.We performed the optimization anywhere it was possible to increase the accuracy of gender-speci c models.

Declarations
Figure 1 Pipeline of the gender-speci c audio sentiment analysis model.

Table 1 .
Best parameters for each gender-speci c model and general model.

Table 2 .
Accuracy of models based on the gender of audio les.

Table 3 .
Different models' accuracy for different algorithms.