Skip to main content

Decoding Demographic un-fairness fromĀ Indian Names

  • Conference paper
  • First Online:
Social Informatics (SocInfo 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13618))

Included in the following conference series:

  • 953 Accesses

Abstract

Demographic classification is essential in fairness assessment in recommender systems or in measuring unintended bias in online networks and voting systems. Important fields like education and politics, which often lay a foundation for the future of equality in society, need scrutiny to design policies that can better foster equality in resource distribution constrained by the unbalanced demographic distribution of people in the country.

We collect three publicly available datasets to train state-of-the-art classifiers in the domain of gender and caste classification. We train the models in the Indian context, where the same name can have different styling conventions (Jolly Abraham/Kumar Abhishikta in one state may be written as Abraham Jolly/Abishikta Kumar in the other). Finally, we also perform cross-testing (training and testing on different datasets) to understand the efficacy of the above models.

We also perform an error analysis of the prediction models. Finally, we attempt to assess the bias in the existing Indian system as case studies and find some intriguing patterns manifesting in the complex demographic layout of the sub-continent across the dimensions of gender and caste.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.britannica.com/place/India/Indo-European-languages.

  2. 2.

    https://github.com/vahini01/IndianDemographics.

  3. 3.

    Detailed stats are available in Appendix A.

  4. 4.

    https://www.kooapp.com/.

  5. 5.

    https://resultsarchives.nic.in.

  6. 6.

    maintained in the same website as CBSE.

  7. 7.

    https://en.wikipedia.org/wiki/Scheduled_Castes_and_Scheduled_Tribes.

  8. 8.

    https://www.kooapp.com/feed.

  9. 9.

    https://www.kooapp.com/.

  10. 10.

    IndicBERT supports the following 12 languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

  11. 11.

    https://en.wikipedia.org/wiki/2011_Census_of_India.

References

  1. Ambekar, A., Ward, C.B., Mohammed, J., Male, S., Skiena, S.: Name-ethnicity classification from open sources. In: KDD, pp. 49ā€“58. Association for Computing Machinery, New York, NY, USA (2009)

    Google ScholarĀ 

  2. Gender API. https://gender-api.com (2021)

  3. Name API. https://www.nameapi.org/en/home/ (2021)

  4. Chakraborty, S., Dutta, P., Roychowdhury, S., Mukherjee, A.: CRUSH: contextually regularized and user anchored self-supervised hate speech detection. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1874ā€“1886. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.findings-naacl.144. https://aclanthology.org/2022.findings-naacl.144

  5. Chakraborty, S., Goyal, P., Mukherjee, A.: Aspect-based sentiment analysis of scientific reviews. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 207ā€“216. JCDL 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3383583.3398541

  6. Chakraborty, S., Goyal, P., Mukherjee, A.: (IM) balance in the representation of news? an extensive study on a decade long dataset from India. International Conference on Social Informatics, SocInfo (2022). arXiv preprint arXiv:2110.14183

  7. Genderize. https://genderize.io/ (2021)

  8. Hu, Y., Hu, C., Tran, T., Kasturi, T., Joseph, E., Gillingham, M.: Whatā€™s in a name? - gender classification of names with character based machine learning models (2021)

    Google ScholarĀ 

  9. KrĆ¼ger, S., Hermann, B.: Can an online service predict gender? on the state-of-the-art in gender identification from texts. In: Proceedings of the 2nd International Workshop on Gender Equality in Software Engineering, pp. 13ā€“16. GE 2019. IEEE Press, Canada (2019). https://doi.org/10.1109/GE.2019.00012

  10. Mueller, J., Stumme, G.: Gender inference using statistical name characteristics in twitter. In: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on Social Informatics 2016, Data Science 2016. MISNC, SI, DS 2016, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2955129.2955182

  11. Onograph. https://forebears.io/onograph/ (2021)

  12. Parasurama, P.: raceBERT - a transformer-based model for predicting race and ethnicity from names (2021). arXiv preprint arXiv:2112.03807

  13. Singh, A.K., et al.: Whatā€™s kooking? characterizing indiaā€™s emerging social network, koo. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 193ā€“200. Association for Computing Machinery, New York, NY, USA (2021)

    Google ScholarĀ 

  14. Sood, G., Laohaprapanon, S.: Predicting race and ethnicity from the sequence of characters in a name (2018)

    Google ScholarĀ 

  15. Swami, S., Khandelwal, A., Shrivastava, M., Akhtar, S.: LRTC IIITH at IBEREVAL 2017: stance and gender detection in tweets on catalan independence. In: CEUR Workshop Proceedings 1881, 199ā€“203 (2017), 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IBEREVAL (2017)

    Google ScholarĀ 

  16. Tang, C., Ross, K., Saxena, N., Chen, R.: Whatā€™s in a name: a study of names, gender inference, and gender behavior in facebook. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011, International Workshops, pp. 344ā€“356 (2011)

    Google ScholarĀ 

  17. Treeratpituk, P., Giles, C.L.: Name-ethnicity classification and ethnicity-sensitive name matching. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1141ā€“1147. AAAI2012, AAAI Press, Canada (2012)

    Google ScholarĀ 

  18. Tripathi, A., Faruqui, M.: Gender prediction of Indian names. In: IEEE Technology Studentsā€™ Symposium, pp. 137ā€“141. IEEE, Kharagpur (2011). https://doi.org/10.1109/TECHSYM.2011.5783842

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Souvic Chakraborty .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Dataset Statistics

TableĀ 4 displays the dataset stats.

Table 4. The table below contains information on datasets that are used to train models and conduct case studies.

1.2 Baseline APIs andĀ Models

We used a bunch of APIs available for gender classification as baselines and compared them with the results obtained from our transformer based methods.

Gender APIĀ [2]: Gender-API.com is a simple-to-implement solution that adds gender information to existing records. It receives input via an API and returns the split-up name (first name, last name) and gender to the app or the website. According to the website, it will search for the name in a database belonging to the specific country, and if it is not found, it will perform a global lookup. If it cannot find a name in a global lookup, it performs several normalizations on the name to correct typos and cover all spelling variants.

Onograph APIĀ [11]: OnoGraph is a set of services that predicts a personā€™s characteristics based on their name. It can predict nationality, gender, and location (where they live). The services are based on the worldā€™s largest private database of living people, which contains over 4.25 billion people (as of July 2020). According to the documentation, ā€œOnoGraphā€™s results are the most accurate of any comparable service; and it recognizes around 40 million more names than the nearest comparable service.ā€

Genderize APIĀ [7]: It is a simple API that predicts a personā€™s gender based on their name. The request will generate a response with the following keys: name, gender, likelihood, and count. The probability denotes the certainty of the gender assigned. The count indicates the number of data rows reviewed to calculate the response.

1.3 Model Description

Logistic Regression: We concatenate the different parts of the name and compute character n-grams. Next we obtain TF-IDF scores from the character n-grams and pass them as features to the logistic regression model.

SVM: The objective of the support vector machine algorithm is to identify a hyperplane in N-dimensional space (N = the number of features) that categorizes the data points clearly. Then, we accomplish classification by locating the hyper-plane that best distinguishes the two classes. There are several hyperplanes that might be used to split the two groups of data points. Our goal is to discover a plane with the greatest margin or the greatest distance between data points from both classes.

Char CNN: Character-level CNN (char-CNN) is a well-known text classification algorithm. Each character is encoded with a fixed-length trainable embedding. A 1-D CNN is applied to the matrix created by concatenating the above vectors. In our model, we utilize 256 convolution filters in a single hidden layer of 1D convolution with a kernel size of 7.

Char LSTM: A name is a sequence of characters. Like char-CNN, each character of the input name is transformed into trainable embedding vectors and provided as input. Our model employs a single LSTM layer with 64 features and a 20% dropout layer.

Transformer Models

  • We choose BERT for demographic categorization, using full names as inputs because it has proven to be highly efficient in English data sequence modeling.

  • mBERT is trained using a masked language modeling (MLM) objective on the top 104 languages with the largest Wikipedia.

  • IndicBERT is a multilingual ALBERT model that has only been trained on 12 major Indian languagesFootnote 10. IndicBERT has much fewer parameters than other multilingual models.

  • MuRIL is pre-trained on 17 Indian languages and their transliterated counterparts. It employs a different tokenizer from the BERT model. This model is an appropriate candidate for categorization based on Indian names because it is pre-trained on Indian languages.

Hyperparameters

  • LR: learning rate = 0.003, n-gram range = (1ā€“6)

  • SVM: kernel=rbf, n-gram range = (1ā€“6), degree = 3, gamma = scale

  • Char CNN: learning rate = 0.001, hidden layers = 1, filters = 256, kernel size = 7, optimizer = adam

  • Char LSTM: learning rate = 0.001, dropout = 0.2, hidden layers = 1, features = 64, optimizer = adam

  • Transformer models: models = [bert-base-uncased, google/muril-base-cased, ai4bharat/indic-bert, bert-base-multilingual-uncased], epochs = 3, learning rate = 0.00005

1.4 Results

More detailed results are given in TablesĀ 5 and 6.

Handling of Corner Cases: As a name can be common across both genders or caste, we use majority voting inorder to label a name with binary label for both gender and caste classification tasks. In case of equality we considered arbitrarily decided labels.

Table 5. Performance of the models for gender classification on each dataset.
Table 6. Performance of the models for caste classification on AIEEE dataset.

1.5 Error Analysis - Baseline APIs vs Our Models

TableĀ 7 lists some of the best and worst test cases for the best performing baselines and the best performing transformer based models. Both these types of models perform the best when the first name (first word) is a good representative of the gender (e.g., Karishma Chettri). Baselines usually fail in three cases: the presence of parental name or surname (e.g., Avunuri Aruna), longer names where gender is represented by multiple words (e.g., Kollipara Kodahda Rama Murthy), and core Indian names (e.g., Laishram Priyabati, Gongkulung Kamei). The main reason for the better performance of transformer models might be that they are trained on complete names and larger datasets. As a result, they handle the complexity of Indian names. However, both these types of models tend to fail in presence of unusual and highly complicated names (e.g., Raj Blal Rawat, Pullammagari Chinna Maddileti).

Table 7. Table listing some common errors by the best performing baselines and the best performing transformer models. Here W stands for wrong and C stands for correct. And XX denotes the model, API results respectively; for e.g., WC lists names where transformer predicted wrong while API predicted correct. The letter in bracket denotes the gender (M for male and F for female). The listed names have multiple instances in the datasets. So none of the names uniquely identify any person

1.6 Case Studies - Values ofĀ Median Percentile

TableĀ 8 displays values that are plotted in the left plot of Fig.Ā 1.

1.7 Case Studies - State Wise Results

To understand state wise distribution of Caste and Gender, we answer following additional research questions(ARQ).

  • ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?

  • ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?

Fig. 2.
figure 2

Population normalized distribution of women and backward caste students in AIEEE 2011 data across Indian states.

Fig. 3.
figure 3

Rate of change of population normalized percentage of women and backward caste students across Indian states appearing in AIEEE exams.

Table 8. Median perctile of Women and Reserved students in AIEEE data

ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?

The AIEEE dataset has the state information for each data point. We also collect the state wise population record from Census 2011Footnote 11. We compute the population normalized fraction of women and backward caste people writing the AIEEE 2011 exam. From the plotted results in Fig.Ā 2, we observe that the top states with population normalized higher representation of women writing the AIEEE exam are Jammu & Kashmir, Himachal Pradesh, Punjab, West Bengal, and Maharashtra. Similarly, the states with population normalized higher representation of backward castes writing the AIEEE exam are West Bengal, Maharashtra, Punjab, Uttarakhand, and Jammu & Kashmir. We believe that the education policies of these states could act as a suitable guidance to improve the condition of the other Indian states.

ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?

One way to measure the reduction (increase) in bias would be to check for the increase (decrease) in the population normalized percentage of women and backward caste over time. To this purpose, we obtained the rate of change of population normalized women and backward class candidates taking the AIEEE exam. For each state, the rate of change is measured as the slope of the best fit line (linear regression) of the year versus population normalized percentage scatter plot. The year range considered was 2004 to 2011.

Table 9. Gender and caste breakup (%) in the Koo data.
Table 10. % users at in the oldest 1% data sorted by creation date.

From Fig.Ā 3, we observe that the most successful states in reducing the gender inequality are Himachal Pradesh, Andhra Pradesh (Seemandhra and Telangana), Haryana and Maharashtra. With respect to reducing caste inequality we find West Bengal, Punjab, Uttarakhand, Maharashtra, Karnataka are the most successful.

1.8 Distribution ofĀ Caste andĀ Gender inĀ Koo

Table 11. % users at in the most recent 1% data sorted by creation date.
Table 12. % users in the top 1% data sorted by number of followers.
Table 13. % users in the bottom 1% data sorted by number of followers.

In TableĀ 9 we show the % breakup of the cross-sectional categories in the Koo dataset. We observe that the largest representation is from the general category males while the smallest is from the reserved category females. In the latest time point (see TableĀ 11) we observe higher female representation than in the oldest time point (see TableĀ 10). The % of females (both general and reserved) in top 1% users sorted by followers is relatively larger than in the bottom 1% followers (see TablesĀ 12 and 13). This is exactly the opposite (see TablesĀ 12 and 13) for males (both general and reserved). We believe that a possible reason could be that women have closed coteries of followership.

1.9 Ethical Implications

Like any other classification task, it can also be potentially misused when in the hands of malicious actors. Instead of reduction of bias, the same technology can be used to enforce discrimination. Hence, we request the researchers to exercise caution while using this technology as some demography classification APIs are already publicly available. Further, to keep personally identifiable data private, we opensource the codebase to collect the datapoints instead of sharing the datasets, a policy ubiquitous for social science researchers.

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Medidoddi, V., Bantupalli, J., Chakraborty, S., Mukherjee, A. (2022). Decoding Demographic un-fairness fromĀ Indian Names. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19097-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19096-4

  • Online ISBN: 978-3-031-19097-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics