1 Introduction

Large language models (LLMs) powered by artificial intelligence (AI), such as ChatGPT, have become integral to various aspects of our daily lives, from assisting with natural language understanding [9] to providing recommendations for critical decisions, including academic and career choices [2]. One prominent decision where these AI models play a crucial role is in providing college major recommendations for students [7, 13]. These recommendations can greatly influence an individual's academic and career trajectory. Therefore, examining the potential biases present in these recommendations is of paramount importance to ensure fair and equitable guidance (Stein 2020).

This study delves into the critical issue of bias within LLMs concerning their college major recommendations for students from diverse backgrounds, including gender, race, socioeconomic status, and educational performance. It is essential to understand that the ability of LLMs to provide tailored recommendations for individuals can be a double-edged sword. On the one hand, it has the potential to offer highly personalized guidance that considers an individual's unique characteristics. On the other hand, it opens the door to the possibility of inadvertent or systemic biases infiltrating into these recommendations [8, 14]. In this study’s context, bias refers to the systematic and unfair discrimination by AI against certain individuals or groups, while favoring others in various aspects [5]. Failing to account for these biases may lead to suboptimal decisions at the individual level, perpetuating an inequitable societal system. Consequently, the importance of comprehending and identifying these biases cannot be overstated.

To investigate the presence of bias in LLM recommendations, I employ data from the year 2023, focusing on 12th-grade students who have taken standardized tests such as the California Standard Test (CAST). The choice of this dataset is motivated by its relevance to the high-stakes decision-making process that students face in their transition to higher education. I construct prompts for the ChatGPT API, which includes variables such as a student’s academic performance, race, gender, and socioeconomic status, mirroring the data from my chosen dataset.

Bias in LLMs can manifest in various forms [1, 8]. It may result from a lack of representation in training data, inadvertent learning from historical biases present in text corpora, or structural issues in the architecture of the language model. Regardless of its origin, bias in recommendations can significantly impact the opportunities and choices available to individuals, perpetuating disparities in education and careers [4].

My analysis utilizes a comprehensive set of three metrics to evaluate the fairness and equity of recommendations [6, 11, 14, 15]. The Jaccard coefficient measures the (dis)similarity of recommended majors between demographic groups [11]. The STEM Disparity Score, built on the widely recognized fairness metric Disparate Impact [5], assesses STEM major recommendation fairness (bias). Finally, the Wasserstein metric examines distributional differences in semantic similarity between recommendations [6].

The findings of this research reveal substantial disparities in the set of recommended majors, irrespective of the bias metric applied. Notably, the most pronounced disparities are observed for students who fall into minority categories, such as LGBTQ + , Hispanic, or the socioeconomically disadvantaged. Within these groups, ChatGPT demonstrates a lower likelihood of recommending STEM majors compared to a baseline scenario where these criteria are unspecified. For example, when employing the STEM Disparity Score metric, an LGBTQ + student scoring at the 50th percentile faces a 50% reduced chance of receiving a STEM major recommendation in comparison to a male student. Additionally, an average Asian student is three times more likely to receive a STEM major recommendation than an African-American student. Meanwhile, students facing socioeconomic disadvantages have a 30% lower chance of being recommended a STEM major compared to their more privileged counterparts.

This study underscores the necessity of recognizing and addressing bias in LLMs when making personalized decisions [16]. It is imperative to ensure that the recommendations provided by these models do not mislead students in their college major choices, regardless of their demographic backgrounds. Addressing these biases is essential to foster a more equitable educational and career environment for all students [4].

2 Literature review

The intersection of AI and education has garnered significant attention in recent years, driven by the emergence of LLMs and their potential to assist students in making critical decisions. Alwahaidi [2] highlights how students have begun using AI, particularly LLMs like ChatGPT, in their university applications. However, the integration of AI in education has raised concerns about algorithmic bias, particularly in academic contexts. Baker and Hawn [4] emphasize the critical need to address and mitigate biases in AI systems used for educational purposes. Furthermore, the role and potential setbacks of ChatGPT in the college application process are discussed in CollegeData's resource, which underscores the importance of understanding AI's role in academic decisions and its potential impact on students' futures [7]. Stein et al. [13] have delved into the development of college major recommendation systems, reflecting the practical applications of recommendation systems in guiding students through academic choices.

This study is also related to the nascent literature on machine bias/fairness. In an era where AI increasingly influences decision-making (Ning et al. 2023), fairness and machine learning have become pivotal topics of interest. Barocas et al. [5] provide a comprehensive examination of fairness issues in machine learning, discussing the challenges, limitations, and prospects of ensuring fairness in AI systems and underscoring the significance of equitable outcomes for all individuals and groups.

Bias and fairness are especially important for high-stakes environments like healthcare. Sam Altman, the founder of OpenAI, advocated for LLM’s regulation under such environments [16]. Obermeyer et al. [10] reveal the profound consequences of bias in algorithmic systems and identify that the healthcare system has consistently discriminated against patients of the black race when determining the severity of their health status. This work cautions the broad impact of bias and the need for rigorous examination in all AI applications, including its extension to education. Liang et al. [8] investigate social biases in language models, emphasizing the need to understand and mitigate such biases. This research is relevant to educational contexts where language models like ChatGPT may inadvertently perpetuate such biases.

Regarding bias in recommendation systems, the incorporation of AI-driven recommendation systems has spurred research on bias and debiasing methods. Chen et al. [6] present a comprehensive survey that sheds light on the biases that can be inherent in recommendation systems and reviews directions for debiasing techniques. They address the critical issue of ensuring that recommendation systems provide fair and unbiased user recommendations. Bias and disparity in recommendation systems have been the subject of investigation by Tsintzou et al. Tsaparas (2018), who emphasize the need for a comprehensive understanding of the biases permeating recommendation algorithms.

The evaluation metrics of recommendation systems used in this study build on the work of Pazzani and Billsus [11], Wang et al. [15]. Pazzani and Billsus [11] have contributed to the understanding of content-based recommendation systems and proposed various metrics to evaluate the performance of recommendation systems. Although their work predates the AI boom, the principles of content-based recommendations remain foundational in the design of modern recommendation systems. The theoretical analysis of performance measures in the context of recommendations is addressed by Wang et al. [15], providing insights into the evaluation of recommendation system performance.

These studies collectively shed light on the multifaceted issue of bias in AI-driven recommendation systems and the broader implications for educational decision-making, calling for the development of fair and equitable AI solutions to guide students on their academic journeys. However, none of these studies examined how the LLMs may cause biases in recommending college majors for high school students.

3 Data

There are two data sources used in the experiment. The first is data on student profiles, from the public data source CAASPP (California Assessment of Student Performance and Progress), which archives the test results of California’s assessments for various school districts. The tests include English Language Arts/Literacy and Mathematics (ELA), Alternate English Language Arts/Literacy and Mathematics (CAAs), English Language Proficiency (ELPAC), California Science Test (CAST), California Alternate Assessment for Science, Spanish Reading Language Arts (CSA). See Appendix A Fig. 4 for a screenshot of the website.

I retrieved the public 2023 California Statewide Research File aggregated at the student group level. The specific student groups and the associated demographic ID are presented in Tables 2, 3 in Appendix A. For each school, it tracks the test results for each demographic ID for each test, totaling 1,048,024 records (See Table 4 in Appendix A for a sample of 10 records). My research focuses on 12th-grade students and three demographic categories: Gender, Race, and Economic Status. This yields 945,635 records. Among all the students, Table 3 in the Appendix shows that 49.1% are female, 50.8% are male, and 0.01% are LGBTQ + . In terms of socioeconomic status, 62.6% are disadvantaged while the rest (37.4%) are not. The breakdown for race consists of 0.47% American Indian or Alaskan Native, 9.5% Asian, 5.3% Black or African American, 2.3% Filipino, 55.2% Hispanic or Latino, 0.4% Native Hawaiian or Pacific Islander, and 21.8% White.

The second data source is the college major recommendations generated by ChatGPT through the GPT-3.5-Turbo API, which will be elaborated in Sect. 4.

4 Method

I elaborate the exact data collection and analysis method in this Section, including the general procedure as well as the metrics measuring the disparity and bias of ChatGPT’s recommendations.

4.1 Procedure

My method mainly consists of three steps, as depicted in Fig. 1.

  1. 1.

    In Step 1, I extract the data on student groups and their test scores as detailed in Sect. 3.

  2. 2.

    In Step 2, I construct prompts based on the data and then use the prompts to ask ChatGPT to recommend the top 10 college majors. I use the developer’s version of GPT-3.5 Turbo API—an upgraded version of GPT3.5 fine-tuned using human feedback—to provide responses adhering to ethical standards [12]. The API was released on March 1, 2023, by OpenAI. The sample prompts are illustrated in Fig. 1. The prompts vary along four dimensions: score (e.g., 0–20th percentile…, 80–100th percentile), gender (male, female, or LGBTQ +), economic status (socioeconomically disadvantaged or not), race (White, Black or African American, Asian, etc.). Sample recommendations made by ChatGPT are also shown in Fig. 1. This step is repeated at various temperatures of prompt completion to ensure accuracy and reduce possible variability within GPT-3.5-Turbo’s generated output.

  3. 3.

    In Step 3, I compute the bias metrics based on GPT’s output of college major recommendations.

Fig. 1
figure 1

Method process

4.2 Measuring bias

The level of bias of GPT recommendation is measured as the degree to which its recommendation differs based on student profile. Let Gi denote a distinct student group i. For example, Gi = (female, black, disadvantaged) represents a student group that is female, black, and socioeconomically disadvantaged. Let the test score S be categorized into percentiles, ranging from 1st to 100th percentile, and let R1, R2R10 index the top 10 college majors recommended to the student group. I then measure to what degree the recommendations differ across the student profiles and scores with three metrics: Jaccard Coefficient, Wasserstein Metric, and STEM Disparity Score.

4.2.1 Jaccard coefficient

To assess recommendation disparity resulting from demographics, I use the Jaccard coefficient to quantify the dissimilarity between two sets of recommended majors representing different demographic groups. The Jaccard coefficient has been widely used in measuring recommendation similarity or dissimilarity (e.g., [3]). Denote A and B as two sets of recommended majors for two groups. The Jaccard coefficient J (A, B) measures the similarity between A and B in the form of:

$$J(A,B)=\frac{|\text{A }\bigcap \text{ B}|}{|\text{A }\bigcup \text{ B}|}$$
(1)

In Eq. (1), |A ∩ B| represents the size of the intersection of sets A and B, i.e., the number of overlapping college majors recommended by ChatGPT across the two recommendation sets. |A\(\bigcup\)B| denotes the size of the union of sets A and B, representing the total number of distinct majors recommended across the two sets. A high Jaccard coefficient indicates a high level of similarity or overlap between the two sets, suggesting less recommendation disparity. Conversely, a low Jaccard coefficient signifies a lower level of similarity and a higher degree of disparity. By employing the Jaccard coefficient conditional on student demographic groups, I can quantitatively evaluate the disparities between different groups.

4.2.2 Wasserstein metric

The second metric I employ is the Wasserstein Metric, also known as the Earth Mover's Distance (EMD). It is a powerful tool widely used in measuring the discrepancy between probability distributions. Denote u(A) and v(B) as two probability distributions representing the recommended majors for two demographic groups, and W(p, q) as the Wasserstein metric that calculates the minimum "cost" required to transform one distribution into the other. The Wasserstein metric is mathematically expressed as:

$${W}_{p}\left(u,v\right)={\left(\underset{\gamma \epsilon \Gamma (u,v)}{inf}{E}_{(x,y)\sim \gamma }{d(x,y)}^{p}\right)}^{1/p}$$
(2)

where \(\Gamma (u,v)\) represents the set of all possible recommendation couplings of u and v (e.g. recommending mathematics to student group A while recommending social science to student group B may consist of one combination); \({W}_{p\sim \infty }\left(u,v\right)\) is defined to be \(\underset{p\to \infty }{lim}{{W}_{p}\left(u,v\right)}_{,}\) and corresponds to a supremum norm; \(d\left(x,y\right)\) indicates the distance function. A coupling \(\gamma\) is a joint probability measure whose marginal probabilities are u and v, where \(\int \gamma \left(x,y\right)dy=u(x)\) and \(\int \gamma \left(x,y\right)dx=v(y)\).

In this research context, let u be the empirical distributions with samples x1, …, xk as the set of college major recommendations for student group A, while v be the empirical distributions with samples y1, …, yk as the recommendations for student group B. Here k indexes the total number of possible distinct majors considered, which is instantiated as 100 in this study. Then I can estimate the Wasserstein metric empirically as:

$$W(u(A),v(B))=\text{min}{\sum }_{i, j=1}^{k}\left({w}_{ij}{d}_{ij}\right)$$
(3)

s.t. \({\sum }_{i=1}^{k}{w}_{ij}={v(B)}_{j}, j=1,\dots ,k\)

$${\sum }_{j=1}^{k}{w}_{ij}={U(A)}_{i}, i=1,\dots ,k$$

where wij represents the cost of transporting “mass” from the ith major in u(A) to the jth major in v(B); dij denotes the amount of mass to be transported from i to j, and d(ui,vj) = x where x is the semantic similarity coefficient between the majors ui and vj provided by Google’s pre-trained Word2Vec model.

Thus, in the context of assessing recommendation disparity, the Wasserstein metric measures the minimum cost required to transform the distribution of recommended majors for one demographic group into the distribution for another. A lower value implies greater similarity between the two groups, signifying reduced recommendation disparity, while a higher metric suggests a greater degree of disparity. By employing the Wasserstein metric conditional on student demographic groups, I can mathematically quantify and analyze the disparities between different demographic categories, providing valuable insights into the fairness and equity of the recommendation system (e.g., [14]).

4.2.3 STEM disparity score (SDS)

It is of particular interest to investigate if there is a disparity in STEM (Science, Technology, Engineering, and Mathematics) major recommendations. To do this, I develop a new metric built on the Disparate Impact metric, which captures the differences between the outcomes of two groups [6]. Denote the set of 10 recommendations as (R1, R2, …R10), and the weights of these recommendations as (W1 = 10, W2 = 9, …, W10 = 1). Then, I examine whether a particular recommendation is a STEM major, if yes, the STEM indicator (denoted as I) takes a value of 1 and otherwise it is 0. Hence, the STEM Disparity Score (SDS) is calculated as follows:

$$SDS= {\sum }_{j=1}^{10}{R}_{j}{W}_{j}{I}_{j}/10$$
(4)

I normalize the score by 10 for the number of majors recommended by ChatGPT in each recommendation. A higher value of SDS indicates a higher probability of being recommended with a STEM major, and vice versa. The SDS differences across different student groups will inform us of the presence (or absence) of STEM recommendation biases.

5 Results

5.1 Disparity measured by Jaccard coefficient

I first analyze disparity with the Jaccard coefficient (JC), where a high value of JC indicates greater similarity in the set of recommended majors and vice versa. In my experiment, the JC metric is computed by comparing the recommendation set between a particular student group (e.g., gender is female) with a baseline group where the target demographic feature is unspecified (e.g., gender is unspecified) while holding all other factors (e.g., score, race, and socioeconomic status) constant. I group the test scores in increments of twenty percentile points, varying the 12 demographic features one at a time within each bracket while holding the other features constant. This results in a total of 60 (= 12*5) possible combinations as shown in Table 1. For each combination, I randomly sample 100 student groups within the score bracket and compute the average JC value as well as the 95% confidence interval of the JC value. The mean and confidence interval (in square bracket) are reported in Table 1.

Table 1 Results of Jaccard coefficient metric

Table 1 indicates a notable disparity among the recommendations. For example, a female student who scores within the top 20% has an average JC value of 0.301, while a male counterpart’s JC value is 0.336. The two confidence intervals, [0.287, 0.315] vs [0.322, 0.349], do not overlap, indicating that the differences are statistically significant. The disparity is most striking for the average LGBTQ + students scoring 40–60%: the JC value 0.264 is significantly lower than the male (0.314) or female (0.311) counterparts, as indicated by non-overlapping confidence intervals. These results show that ChatGPT’s recommendation is not gender agnostic: female, male, or LGBTQ + students who achieve the same score may not receive the same recommendations.

Similarly, ChatGPT is not unbiased regarding socioeconomic status or race. For example, for an average student within the 40–60% bracket, the JC value (0.317) for the socioeconomically disadvantaged group is significantly different from that of the non-disadvantaged group (0.283). Concerning race, the Native Hawaiian and Pacific Islander group receives the lowest JC value across all score brackets, indicating that recommendations to this group of students are most distinctive. I will examine how the recommended majors differ (e.g., STEM vs non-STEM majors) in the next two metrics.

5.2 Disparity measured by Wasserstein metric (WM)

The results are consistent when measured in Wasserstein metrics (WM). A high value of WM indicates high disparity and vice versa. For ease of exposition and space considerations, I will visualize the results in charts (without tabulating the results). Here I vary the test scores by percentiles in tens, while the rest of the experimental setup remains the same as described in Sect. 5.1. I plot the mean WM values for gender, socioeconomic status, and race in Fig. 2a–c respectively.

Fig. 2
figure 2

Results of WM by a gender (breakdown over every ten percentile points), b socioeconomic status, c race

Figure 2a indicates that the LGBTQ + group receives ostensibly different recommendations than the other two groups. For example, for an average student with a 50th percentile score, the WM values for LGBTQ + and male students are 7.4 versus 6.9 (statistically different when examining the confidence interval or the paired t-statistic). This means that ChatGPT considers the LGBTQ + group differently in college major recommendations. The difference between male and female students is not significant in this metric.

Regarding socioeconomic status (Fig. 2b), the disadvantaged group experiences a higher level of disparity across the whole spectrum of score percentiles. Figure 2c (below) shows that there is significant disparity across races; noticeably, students who are of Native Hawaiian/Pacific Islander or Hispanic descendent experience the highest level of disparity.

One advantage of WM over JC is that the former is calculated based on probability distributions and so I can answer probabilistic questions such as: “To be recommended a STEM major within the top 3 recommendations, does a student with a different profile (e.g., LGBTQ + vs. male) need to obtain a significantly different score in the standardized test?”.

Using Bayesian rules, I can answer the above question by deriving conditional probability of interest P(Score| STEM,G) as follows, where G represents gender:

$$P\left( {Score|STEM,G} \right) \, = P\left( {STEM|Score,G} \right) \, *P\left( {Score|G} \right) \, /P\left( {{\text{STEM }}|G} \right)$$
(5)

where:

  • P(Score | STEM, G) is the probability of a student group having a specific test score given that a STEM major is recommended (e.g., at least one STEM major appeared among the top three recommendations) and a specific gender;

  • P(STEM | Score, G) is the probability of a STEM major being recommended given a student group’s test score and gender;

  • P(Score | G) is the probability for a student group to have a specific score given gender;

  • P(STEM | G) is the probability of recommending a STEM major to a student group with a specific gender.

Using Eq. (5), I find that there is a significant difference in test scores between LGBTQ + and male students when STEM majors are recommended. For example, compared with an average male student whose score is at the 50th percentile and WM = 0.69, the LGBTQ + student needs to score 15% higher (at the 35th percentile) to be recommended the same number of STEM majors.

5.3 Results on STEM disparity score (SDS)

The SDS metric captures the disparity in terms of STEM major recommendations across various student groups, providing a direct analysis on the bias of STEM major recommendation. It is computed as a weighted disparate index as shown in Eq. (4). A higher SDS value indicates a greater likelihood of being recommended a STEM major. A significant difference in SDS across different student groups would indicate the presence of recommendation bias. I plot three charts (Fig. 3a–c) for gender, socioeconomic status, and race respectively. For each chart, the x-axis indexes the test score by every ten percentile points while the y-axis represents the SDS value.

Fig. 3
figure 3

SDS results by a gender, b socioeconomic status, c race

Several interesting patterns emerge when examining Fig. 3a. The differences across the three groups are insignificant on both ends (i.e., top 20% or bottom 10%). Interestingly, the graph displays a significant bias in recommendation in the middle. For example, students at the 50th percentile have significantly different SDC values of 0.98, 0.77, and 0.51 for male, female, and LGBTQ + groups. Roughly, these male students are 1.25 times (= 0.98/0.77) more likely to be recommended with STEM majors, compared to female students. LGBTQ + students are much less likely to be recommended a STEM major: a student at the 50th percentile only has half of the chance to be recommended with a STEM major compared to their male counterparts. This presents strong evidence of recommendation bias in terms of gender.

The disparity pattern is more noticeable when it comes to socioeconomic status: at every score percentile, disadvantaged students are much less likely to be recommended with STEM majors compared to their counterparts (Fig. 3b).

Regarding race, the Asian group has the highest likelihood of being recommended a STEM major: for an average student at the 50-60th percentile, this probability is three times higher than the corresponding probability for an African American student (Fig. 3c).

6 Discussion

As individuals rely more on LLMs to help them make decisions, there is a pressing need to understand whether systemic biases exist in AI decision-making processes. LLMs are trained on vast datasets derived from human interactions and feedback, which can introduce biases. Thus, it is critical to recognize that LLMs are also products of human influence. Recommendations made by these AI tools may not be impartial. Bias may result from a lack of representation in training data, inadvertent learning from historical biases or commonplace societal stereotypes, as well as structural issues in the architecture of the language model. The increasing prevalence of algorithmic decision-making systems necessitates a focus on ensuring the fairness of these systems. Decision-making systems, powered by LLMs, are employed in critical domains like criminal justice, recruitment, and social services.

My analysis of ChatGPT's recommendation disparity, as measured by Jaccard coefficients, STEM Disparity Score, and Wasserstein Metric, reveals significant biases in its recommendations for college majors. Specifically, the recommendations generated by ChatGPT exhibit a concerning bias, particularly against minority populations, such as those who are economically disadvantaged. This raises concerns about potential unfairness in the system. For instance, when comparing the recommendation sets for students of different genders while keeping other factors constant, it becomes evident that gender plays a crucial role in the recommendations. This means that even if two students achieve the same score, one male and one female, they likely will receive different major recommendations. The extent of this disparity becomes more pronounced for LGBTQ + students with average scores. These findings indicate a systemic bias within ChatGPT's recommendations.

The implications of these biases are multifaceted. For students, such disparities may limit their access to educational and career opportunities, potentially reinforcing existing inequalities. Algorithms like ChatGPT need refinement to reduce such biases, ensuring fairer recommendations. Society at large faces the consequences of these biases, as they contribute to inequities and hinder social progress. Regulators and developers must address these issues, aiming to establish guidelines and safeguards to mitigate biases in AI systems and foster equitable opportunities for all students. Not coincidentally, while completing this paper, on October 30, 2023, President Biden issued an executive order on Safe, Secure, and Trustworthy Artificial Intelligence, to ensure that America leads the way in seizing the promise and managing the risks of AI.

To conclude, the importance of mitigating bias in Large Language Models (LLMs) like ChatGPT cannot be overstated, particularly as these models become integral to decision-making processes that profoundly affect individuals' lives. The presence of biases in AI systems—stemming from skewed training data, historical prejudices, or systemic societal stereotypes—risks perpetuating and exacerbating existing inequalities. When these systems are deployed in sensitive areas such as education, employment, and legal decisions, this is especially worthy of addressing. Moreover, biased AI decision-making processes can undermine the credibility and trustworthiness of emerging technologies, posing ethical challenges and potential legal ramifications. By prioritizing the development of equitable AI, we safeguard not only the individual rights and opportunities of those directly affected but also uphold the collective values of fairness and justice within society. Regulatory oversight, diverse and representative data sets, transparent methodologies, and ongoing scrutiny of AI-driven recommendations are crucial steps toward reducing AI biases. By committing to these principles, we can work towards a future where AI supports equitable access to opportunities and facilitates a more inclusive society