Account

How Accurate are GPT-3’s Hypotheses About Social Science Phenomena?

Original Paper
Open access
Published: 03 July 2023

Volume 2, article number 26, (2023)
Cite this article

You have full access to this open access article

Digital Society Aims and scope Submit manuscript

How Accurate are GPT-3’s Hypotheses About Social Science Phenomena?

2033 Accesses
5 Altmetric
Explore all metrics

Abstract

We test whether GPT-3 can accurately predict simple study outcomes in the social sciences. Ground truth outcomes were obtained by surveying 600 adult US citizens about their political attitudes. GPT-3 was prompted to predict the direction of the empirical inter-attitude correlations. Machine-generated hypotheses were accurate in 78% (zero-shot), 94% (five-shot and chained prompting), and 97% (extensive finetuning) of cases. Positive and negative correlations were balanced in the ground truth data. These results encourage the development of hypothesis engines for more challenging contexts. Moreover, they highlight the importance of addressing the numerous ethical and philosophical challenges that arise with hypothesis automation. While future hypothesis engines could potentially compete with human researchers in terms of empirical accuracy, they have inherent drawbacks that preclude full automations for the foreseeable future.

Similar content being viewed by others

How do psychology researchers interpret the results of multiple replication studies?

Article Open access 12 January 2023

Online panels in social science research: Expanding sampling methods beyond Mechanical Turk

Article Open access 11 September 2019

Going Back to Basics: How to Master the Art of Making Scientifically Sound Questions

Chapter © 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Social scientists spend much of their time constructing hypotheses about statistical associations between measurements. For instance, do political ideologies correlate with health behaviors, or does poor sleep predict suicidal intent? During the production and revision of such hypotheses, researchers often cast a wide net of inspiration including written scholarship, anecdotes, and intricate theories. Here, we examine the ability of GPT-3, OpenAI’s large language model (LLM), and one of the most well-read entities in the world (Brown et al., 2020), to hypothesize about the directionality of correlations between social science measures.

LLMs have allowed for the automation of increasingly complex tasks including some chores of social scientists (Argyle et al., 2022; Aher et al., 2022; Hansen & Hebart, 2022; see also Yeung et al., 2022). Researchers execute a large range of complex tasks which are seemingly very far from being automated. These include conducting original research, teaching, university administration, and public outreach. Here, we focus on original research. A substantial part of conducting research consists of reading scholarship within (and often beyond) the researcher’s field of expertise. Mechanically, this step is not challenging to LLMs that are routinely fed with scientific texts too plentiful for a single person to read. However, developing a deep understanding, or encoding, of the scholarship—allowing for abstraction, hypothesis generation, and real-world application—is substantially more challenging and possibly beyond the capabilities of current LLMs. Groups of human experts, and even laypeople, can provide fairly accurate predictions about which study outcomes will materialize (often replicate), especially when their predictions are aggregated within competitions or prediction markets (e.g., Gordon et al., 2021; Hoogeveen et al., 2020). Thus, comparing GPT-3’s performance to a human accuracy baseline can reveal how difficult the current task is and whether GPT-3 can be helpful for hypothesis generation and testing.

If LLMs can provide accurate hypotheses about the empirical correlations in social science datasets, they could inspire and support social scientific research in the future. Currently, it is challenging for researchers to gain awareness of the wide ranges of interrelated social science constructs in their field and exponentially more time-consuming to formulate hypotheses about the masses of inter-construct relationships, for instance when constructing large network graphs (e.g., McGlashan et al., 2016). An example is the recent phenomenon of the pandemic eliciting profound shifts in networks of political attitudes (Albarracín & Shavitt, 2018). Each potential relationship between two attitudes requires an integration of existing scholarship and current affairs. Fittingly, awareness of interrelated, real-world objects appears to be one of the strong suits of language models whose parameters can encode encyclopedic knowledge from large text corpora. However, whether these models can accurately anticipate how measurements will covary in empirical studies is yet to be examined.

The current work focuses on the social scientific context of ideological attitudes as they steer many human behaviors and cognitions and are relevant across many social sciences (Albarracín et al., 2014; Bosnjak et al., 2020). Specifically, the challenges given to the LLM in the current work revolve around predicting correlations between two attitudes (e.g., towards feminism vs. towards Donald Trump). In psychological research, attitudes constitute a personal evaluation of a target object (Albarracín et al., 2014; Briñol et al., 2019). They are important for virtually all life contexts as they underlie what behaviors we engage in, how we relate to others, and what we deem moral or immoral (e.g., Maio et al., 2018).

Entire ideologies and belief systems are often conceptualized as networks of interrelated attitudes (Brandt & Sleegers, 2021; Dalege et al., 2016). For instance, there are typical patterns of correlations between people’s attitudes towards Donald Trump, Barack Obama, gun control, and immigration. Knowing a person’s attitudes towards these attitude objects defines part of a person’s overall political ideology and enables predictions of adjacent attitudes in the belief network (e.g., towards Hillary Clinton) and behaviors (e.g., voting in the next election; Dalege et al., 2017). Many social scientists examine relationships between attitudes or entire attitude networks. Examples include work on cognitive dissonance (Starzyk et al., 2009), planned behavior (Ajzen, 1991), or inter-group relations (Brewer et al., 1985)—all of which have attitudes at their core.

1 Can LLMs Forecast Empirical Correlations Between Attitudes?

The GPT-3 model family, and LLMs in general, are neural networks with sophisticated architectures (Vaswani et al., 2017) that were trained on large datasets and can be used for a variety of text processing tasks (Akhbardeh et al., 2021; Brown et al., 2020; Chowdhery et al., 2022; Dzendzik et al., 2021; Jordan et al., 2015; Lin et al., 2021; Stahlberg, 2020). Their inner parameters (i.e., regression weights) are often optimized to anticipate the next (or left out) words in a given text, essentially auto-completing any prompt they are given. Many current approaches transform input words into numerical vectors, recombine and transform these vectors (as done with predictor variables in logistic regression), and adjust the numerical weights to maximize word prediction accuracy for texts within the training data (for details, see Vig & Belinkov, 2019; Wolf et al., 2020).

While the idea to automate parts of the scientific process is not new (e.g., King et al., 2009), using LLMs for this purpose was only recently suggested (Krenn et al., 2022). Despite a current wave of enthusiasm around AI applications (cf., Gozalo-Brizuela & Garrido-Merchan, 2023), mechanical predictions of social science phenomena have had mixed levels of success in the past, often concluding that human behavior outside the lab is too complex to be accurately predicted. For instance, Salganik and colleagues observed that many central life outcomes remain difficult to predict even when large teams of scientists have access to ostensibly high-value predictors (2020). Similar observations were made when machine learning methods were used to predict romantic attraction between people that were yet to meet (Joel et al., 2017). Additionally, poorly performing models can inflict harm in applied settings, especially when self-reinforcing (e.g., in some cases of predictive policing; Berk, 2021).

When predicting social science constructs based on textual data, some encouraging findings were made. For instance, personality traits are sometimes predictable based on a person’s social media posts (e.g., Christian et al., 2021; Schwartz et al., 2013; but see Sumner et al., 2012). When it comes to political attitudes—the context of the current work—previous work found political affiliations can be predicted quite well from certain social media texts, even without relying on LLMs (Ullah et al., 2021). Given the plentiful political texts on which LLMs are trained, they can be a promising tool for predicting political phenomena like ideological attitudes (Argyle et al., 2022; Jiang et al., 2022). However, such highly relevant training data come with the caveat of ideological biases potentially distorting outputs (e.g., Liang et al., 2021). Another caveat of such large data dumps is the difficulty in determining whether a specific task is truly “new” for the model or can simply be reproduced from the training data (Rae et al., 2021).

Additionally, LLMs are argued to relate poorly to human interaction partners (Montemayor, 2021), are restricted to a superficial learning strategy (identifying relationships between words; Lake & Murphy, 2021), and have therefore no “understanding” of its own outputs (Tamkin et al., 2021; van der Maas et al., 2021). In short, most researchers still evaluate their capabilities as being far from human-level intelligence (Floridi et al., 2020; but see Wei et al., 2022), or argue that machines, albeit powerful, remain fundamentally different to humans. Accordingly, it is sometimes suggested that trying to re-engineer human-like intelligence might be a bad strategy towards maximizing machine performance (Korteling et al., 2021).

Thus, when reviewing the literature, there are notable arguments for and against the idea that LLMs could accurately predict the outcome of social science studies about political attitudes. Furthermore, this task diverges quite far from those that LLM’s are typically scored on like text translation, summarization, or simple info extraction (e.g., Akhbardeh et al., 2021; Barrault et al., 2019; Lai et al., 2017). Somewhat more relevant is the common task of extracting implied information from a text (e.g., an event’s duration and starting time is mentioned and the model is prompted to name the ending time; Dua et al., 2019; Chowdhery et al., 2022). Sometimes, such logical deduction tasks include human behaviors and cognitions. The following example task is taken from Chowdhery et al. (2022, p 14):

“Input: Students told the substitute teacher they were learning trigonometry. The substitute told them that instead of teaching them useless facts about triangles, he would instead teach them how to work with probabilities. What is he implying? (a) He believes that mathematics does not need to be useful to be interesting. (b) He thinks understanding probabilities is more useful than trigonometry. (c) He believes that probability theory is a useless subject.

Answer: (b) He thinks understanding probabilities is more useful than trigonometry”.

For such multiple-choice tasks, current LLMs approach the best performance of human participants. Importantly, human performance is usually estimated with small batches of crowd-sourced efforts instead of work from dedicated experts (e.g., Chowdhery et al., ). Furthermore, it must be noted that these tasks usually remain substantially easier than developing novel psychological theories from complex academic texts.

As daunting as the scientific literature is, the reading and writing level of LLMs has increased very rapidly over the last years, and the barriers towards automatic article processing and thesis generation might fall rather soon. In fact, a GPT-3-based startup is showing promising results when prompted to answer users’ research questions by providing summaries of relevant research papers (Elicit, 2022). As the application was designed to support literature reviews, it lacks the interface for generating new hypotheses or answering research questions that are yet to be addressed, but it showcases the abilities of current LLMs (specifically GPT-3) to process academic texts.

In the current work, we test the ability of GPT-3, to freely hypothesize about research questions. If successful, the model could participate during a crucial step in the work of an empirical scientist: predicting the outcome of a study. Below, we give GPT-3 exactly this task, pre-register its predictions, run the study, test the accuracy of the predictions, and evaluate whether it can serve as a useful hypothesis-generation machine.

2 Study 1

The goal of Study 1 is to quantify the accuracy with which GPT-3 can predict correlations between attitudes. We chose the topic of attitudes as they are immensely important across the social sciences. Similarly, we chose the challenge of predicting correlations as it is very common in scientific work.

The chosen attitude objects all pertain to US American politics and ideologies as GPT-3’s training data includes various texts about these objects, and scientific work on attitude networks is often conducted in this specific context. We estimate the difficulty of the task as fairly low compared to the many text-based tasks of social scientists (but not compared to typical LLM tasks). In order to assess this assumption, we contrast the performance of GPT-3 with the performance of 30 empirical scientists predicting the correlations.

We only had weak a priori hypotheses about GPT-3’s performance as the current task is not common in research on LLMs. Specifically, we simply expected the performance to lie between a chance baseline and 100% accuracy (uniform prior). The pre-registrations of the prediction and validation procedure can be found here: bit.ly/3ym9cWJ.

3 Method

We utilized 36 people, organizations, behaviors, and abstract concepts that are central to political ideologies frequently discussed in a US American context. Of these attitude objects, 18 are associated with left-wing and right-wing ideologies, respectively. This split was chosen to ensure the presence of at least some strong attitude correlations (e.g., attitudes towards Donald Trump and Barack Obama are likely negatively associated). The full list is provided in the supplementary materials. The 36 objects were arranged into all possible 630 binary pairs, and GPT-3 was prompted to estimate whether the correlation between attitudes towards the paired objects was going to be positive or negative.

3.1 GPT-3 and Prompt Specifications

Given that GPT-3 s performance can depend on the prompt in unanticipated ways, we always provided seven zero-shot prompts (see supplementary materials) with very different phrasings for binary correlation predictions. The seven individual predictions were averaged to arrive at the final prediction; performances of the individual prompts are reported as well. An example prompt is:

“If there was a correlation between being a fan of <object1> and being a fan of <object2>, would this correlation be positive or negative?”

We used the Python interface from openAI to access GPT-3. The selected engine was “davinci-2”, the most advanced engine at the time of writing this text. All other settings are listed in the pre-registration. GPT-3’s responses did not give a clear prediction about the directionality of the correlation coefficient in 0.05% of cases. These predictions all came from the same prompt and were coded as missing values.

3.2 Accuracy Scoring

After preregistering GPT-3’s predictions, we collected actual attitude measurements for all 36 objects from 600 US American adults (300 female, 300 male; M_age = 42.25, SD_age = 13.33) through prolific.com. All attitudes were measured with the item: “What is your attitude towards < object > ?” and responses were given on a seven-point scale (0 = very negative; 7 = very positive). All 630 inter-attitude Pearson correlation coefficients were computed and treated as the ground truth labels for GPT-3’s previous predictions. Due to measurement error, these binary directionality labels are unlikely to be perfect for very small correlation coefficients and the unbiased accuracy limit could therefore be below 100%.

3.3 Performance of Human Experts

In order to contextualize the performance of GPT-3 in predicting correlations, we asked 30 experts to complete the same task. All experts either held or were currently working on a PhD in political science research or research methodology. All experts were currently conducting empirical research themselves as part of their employment at a research institution in the USA, Germany, or the Netherlands. Each expert was presented with 93 random attitude pairings (e.g., attitude towards Hillary Clinton and attitude towards smoking weed) and had to indicate whether the correlation between these attitudes in an adult US sample would be positive or negative. The reason for choosing 93 per person was primarily dictated by printed versions of the questionnaire fitting 93 pairs on an A4 sheet (front and back), more might have been perceived as overtasking. Participants were allowed to use all resources available to them to come to their predictions, but based on anecdotal interviews most people relied on their intuition occasionally supported by a short Google query. Notice, that the collection of expert predictions was suggested during the revision of the current work and is therefore not included in the original preregistration.

4 Results

The accuracy of GPT-3’s prediction of correlation signs was 78.25% across all predictions (95% CI [76, 82]; see Fig. 1). A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 77.62% (95% CI [74, 81]; for details and plots see supplementary materials).

Fig. 1

The baseline accuracy of always predicting “negative correlation” would have led to an accuracy of 50%. The baseline accuracy of always predicting positive associations between same-ideology (e.g., left-wing) concepts and always predicting a negative association between different-ideology concepts would have led to an accuracy of 98.57% with all nine mistakes involving people’s attitudes towards John McCain and liberal concepts/people. The confusion matrix below (see Table 1) shows that GPT-3 erred mostly on the side of predicting negative correlations.

Table 1 Confusion matrix study 1

Full size table

Figure 1 summarizes the central findings of Study 1. Empirical correlations showed a bi-modal distribution (likely given ideological polarization in the USA). The magnitude of the true correlation coefficients was associated with an increased probability of correct predictions by GPT-3. Some prompts performed better than others while the aggregated predictions were about as accurate as the best-performing prompt. The largest residuals occurred for the associations between Barack Obama and Hillary Clinton (true: r = 0.76; predicted: “negative”), and between abortion and gay marriage (true: r = 0.69; predicted: “negative”).

Human experts showed an average prediction accuracy of 92% (95% CI [89, 94]). As indicated in Fig. 2, no expert made flawless predictions, while some appeared inattentive, dragging the average human performance down slightly.

Fig. 2

Our suspicion of inattentiveness is supported by examining the most error-prone attitudes. While “John McCain” and “Joe Biden” understandably elicited relatively many mispredictions, given that their career activity has led to cross-party reputations, the most common error source was “gun control,” which has a fairly established position on the political spectrum. This suggests that some participants might have simply flipped the word meaning when providing their correlation predictions.

5 Discussion

On the task of predicting correlations between attitudes, GPT-3 performed substantially above chance, but far below the highly accurate baseline “predict correlations based on same vs. other ideological group.” The task was relatively easy in reference to common prediction scenarios in the social sciences as it merely involved predicting bivariate relationships between polarized, prominent objects, although none of the human experts performed flawlessly either. Conversely, the task was relatively complex and uncommon in reference to typical reasoning tasks that LLMs are tested on. Nonetheless, relating ideological attitude objects to each other should have played into the strengths of GPT-3 as it can largely rely on an encyclopedic knowledge of political concepts and partisanship, which is a luxury that it would not have for many social science questions.

It is important to note that the results above were collected with an out-of-the-box GPT-3 pipeline. That is, all predictions were generated in a zero-shot format (i.e., no labeled, task-clarifying examples were shown to the model). Furthermore, the model was not instructed to solve the reasoning task in a step-by-step fashion (i.e., chained reasoning) which can increase the performance of LLM’s (Wei et al., 2022).

In sum, the out-of-the-box performance of GPT-3 was not high enough to be useful to human experts, who outperformed the model by a substantial margin. In Study 2, we complement this first finding by giving the model a much better shot at achieving a high performance by customizing it to the task at hand.

6 Study 2: Top-Down Pipeline Customization and Few-Shot Learning

In Study 2, we optimize the prediction pipeline in two ways. First, we change the previous zero-shot setup to a few-shot setup, and second, we utilize a chained prompting pipeline. We will briefly discuss both additions below.

The essence of few-shot learning is that the model is presented with a few example instructions and correct responses within the prompt and can thereby recognize the nature of the presented NLP task, whereas zero-shot learning does not include exemplary responses. Through the added context, few-shot prompting often increases model performance across various NLP tasks (e.g., Wei et al., 2021). In our current study, few-shot learning consists of providing examples of correctly predicting correlations to the model before requesting the prediction of the targeted correlation.

The concept of chained prompting consists of breaking down a complex reasoning task into smaller steps that the model has to solve consecutively (cf., chain-of-thought prompting, Wei et al., 2022). If set up favorably, the procedure of stepwise reasoning can allow the solver to avoid learning unknown, complex procedures in favor of multiple simpler procedures (Wu et al., 2022). An intuitive example for humans would be to solve the task “3 × 3.5” by consecutively computing “3 + 3 + 3” and “0.5 + 0.5 + 0.5” and adding the results. For LLMs, an example would be to transform a complex reasoning task into consecutive look-up tasks. For the current context, this is achieved by breaking down the task of predicting correlations into multiple steps where GPT-3 first labels attitude objects as either left-wing or right-wing objects and then integrates this additional insight into a consecutive prompt to predict attitude correlations.

For Study 2, we expect the simple classification of left vs. right-wing objects to be completed with a high accuracy (> 90%) as it constitutes a simple look-up task which was a strong suit of GPT-3 in the past (Brown et al., 2020). Furthermore, we expect the accuracy for subsequently predicting correlations to improve from Study 1 (> 80%; pre-registration bit.ly/3ym9cWJ). Notice that customizing the prediction pipeline in this way can potentially improve model performance but entails that the model would no longer be useful across different social scientific tasks. Furthermore, an improved performance would not constitute evidence of a more intelligent model, as part of the thinking is done beforehand by the researcher (“use the concept of ideological polarization to predict attitudes”). Whereas Study 1 explored GPT-3’s performance in a virtually unassisted manner, Study 2 provides strong assistance to the model through a heavily customized prediction pipeline.

7 Method

First, we prompted GPT-3 to categorize all 36 objects into left-wing and right-wing objects (pre-registration and method details: bit.ly/3ym9cWJ). Subsequently, we explicitly integrated the generated labels into the prompts for predicting correlations. Furthermore, we continued to use an integration of seven prompts, but now prefacing each prompt with five examples of correct correlation predictions. An example of a single prompt (gun control vs. Hillary Clinton) is:

“Given that Donald Trump is endorsed by the politically [prediction:] right and Barack Obama is endorsed by the politically [prediction:] left, what do you think is the correlation between people’s attitude towards Donald Trump and Barack Obama? Please answer positive or negative.

negative

Given that the Bible is endorsed by the politically [prediction:] right and the NRA is endorsed by the politically [prediction:] right, what do you think is the correlation between people’s attitude towards the Bible and the NRA? Please answer positive or negative.

positive

[…]

Given that gun control is endorsed by the politically [prediction:] left and the Hillary Clinton is endorsed by the politically [prediction:] left, what do you think is the correlation between people’s attitude towards gun control and Hillary Clinton? Please answer positive or negative.”

The attitude objects referred to in the five labeled examples were randomly sampled from all objects except for the two objects in the current prediction target. The engine specifications as well as the performance analyses remained the same as in Study 1 (pre-registration: bit.ly/3ym9cWJ).

8 Results

The simple categorization into left-wing and right-wing concepts was accurate for 35 out of the 36 concepts. The concept of “targeted policing” was predicted to be a left-wing policy. This might have been due to our phrasing of the concept being less politicized than for instance “stop-and-frisk,” but we still scored it as a mistake as most current forms of policing can be expected to be less popular among the politically left (e.g., Silver & Pickett, 2015) and all real attitude correlations between targeted policing and the other concepts were coherent with a right-wing categorization. Note that the mistake in this intermediate step was consciously carried forward in the pipeline before predicting attitude correlations.

The accuracy of GPT-3’s prediction of correlation signs was 93.97% across all predictions 95% CI [92, 95]. A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 93.02% (95% CI [91, 95]; for details and plots see supplementary materials).

Table 2 shows that GPT-3 still slightly erred on the side of predicting negative correlations.

Table 2 Confusion matrix study 2

Full size table

Figure 3 summarizes the central statistics of Study 2. The magnitude of the true correlation coefficients was again associated with an increased probability of correct predictions by GPT-3. Prompts performed similarly well and were marginally outperformed by the aggregated predictions.

Fig. 3

In total, GPT-3 made 38 incorrect predictions depicted as edges in Fig. 4. True absolute correlation coefficients (that were mispredicted) ranged from 0.01 to 0.33 with a mean of 0.19. One source of confusion quite obviously resulted from the previous stage in the prediction pipeline where GPT-3’s mislabeled “targeted policing” as a concept endorsed by the political left. The other errors primarily mirror the simple “partisan-split” baseline, where correlations between attitudes towards John McCain and left-wing concepts were also mispredicted.

Fig. 4

9 Discussion

The accuracy when using a few-shot approach and a highly customized prompting procedure is much improved compared to Study 1. It now lies within the range of performances that were observed among human experts. This mirrors previous findings that deliberate prompt engineering and chaining is a very potent method for tackling narrow problems (Hansen & Hebart, 2022). Such an approach can be well-utilized with repetitive tasks that allow for the re-usage of highly task-specific prompts (e.g., Tanana et al., 2021). However, it is apparent that much of the think-work with such an approach is put on the researcher, who has to explicitly guide GPT-3 to consider specific aspects or strategies when generating hypotheses. In short, the researcher takes over much of the I from the AI. Furthermore, through the high amount of customization of the prompt chain, this pipeline can no longer function as a general hypothesis generator and will remain restricted to correlations within the context of political partisanship. In Study 3, we test whether these issues can be addressed through data-based modifications to the model.

10 Study 3: Finetuning

Instead of researcher-driven improvements of the model performance, Study 3 focuses on data-driven improvements. Specifically, we expand on the previous few-shot setup by retraining the GPT-3 model through openAI’s finetuning API. Thus, human researchers no longer provide context-specific information in the prompt. Rather, the model must extract relevant information (here: “political sidedness matters for ideological attitude correlations”) autonomously from training data. This way the procedure of auto-generating hypotheses can be generalized to other research contexts.

The details of openAI’s finetuning procedure, as well as details of their commercial models, are not publicly accessible. However, the general practice of finetuning transformer models is very common (e.g., Bhatia & Richie, 2021; Gutiérrez et al., 2022; Liu, 2019). Often, task-specific data is used to retrain (parts of) the original network and/or a new layer of regression weights. In past studies, finetuning transformer models has led to performance increases across tasks, and many prominent transformer models are developed specifically to be finetuned later on (e.g., Devlin et al., 2018; López-Úbeda, 2021; Liu, 2019). GPT-3 is one of these models that was explicitly devised for finetuning (openAI, 2022). The strategy of finetuning is intuitive as seeing more task-relevant data and focusing model outputs has the potential to improve any statistical model, albeit incurring the risk of overfitting on the newly provided data (Xu et al., 2021).

For Study 3, we expect the accuracy of GPT-3 to lie in between Study 1 (zero-shot) and Study 2 (few-shot + customized prompting chain). The reason is that we expect top-down customization to have contributed more to the performance increase from Study 1 to Study 2 than the bottom-up customization (few-shot instead of zero-shot). Given that Study 3 removes the top-down prompt adjustment to improve the generalizability of the prediction pipeline, we tentatively expect a performance decrease compared to Study 2.

11 Method

We again examined the performance of GPT-3 on predicting correlations between the ideological target objects from Studies 1 and 2. We utilized the same 7 prompts from Study 1,, requiring us to finetune GPT-3 separately for each prompt. We utilized 46 new ideological concepts (i.e., their 1035 intercorrelations) as finetuning data, which can be found here: bit.ly/3ym9cWJ. Three examples are Nancy Pelosi, tough love, and focusing on the future.

The desired responses (i.e., finetuning labels) were generated through a simple partisanship split (same-side objects are positively correlated and vice versa). Notice, that we know at this point that these labels are noisy as there are exceptions to this rule (e.g., pro-Biden and pro-McCain attitudes). However, the sidedness heuristic is highly accurate (cf. Study 1) and might even allow us to speculate about the reason for errors in Study 3 (e.g., overreliance on the sidedness heuristic from training data). The uploaded dataset of 1035 rows was uploaded and finetuned through openai’s command line interface. The returned model IDs are included in the analysis scripts and can be utilized by anyone (bit.ly/3ym9cWJ). Method decisions, finetuning data, and hypotheses were pre-registered under the same link.

12 Results

The accuracy of GPT-3’s prediction of correlation signs was 97.46% across all predictions 95% CI [96, 98]. A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 96.51% (95% CI [95, 98]; for details and plots see supplementary materials).

Table 3 shows that GPT-3 again overpredicted negative correlations. In fact, it never erroneously predicted positive correlations.

Table 3 Confusion matrix study 3

Full size table

Figure 5 summarizes the central statistics of Study 3.

Fig. 5

The magnitude of the true correlation coefficients was again associated with an increased probability of correct predictions by GPT-3. Prompts performed similarly well as the aggregated predictions. True absolute correlation coefficients (that were incorrectly predicted) ranged from 0.01 to 0.46 with a mean of 0.23. When examining the errors depicted in Fig. 6, two central nodes again account for most errors.

Fig. 6

As in Study 2, correlations between attitudes towards John McCain and various left-wing objects were erroneously predicted to be negative. Furthermore, “defunding the police” was the most error-eliciting object as many correlations with left-wing objects were erroneously predicted to be negative. Notice, that a similar pattern emerged in Study 2 with the object “targeted policing.”

13 General Discussion

Large AI models are commonly trained on texts featuring social and psychological phenomena and can, by encoding word associations, produce statements about the relationships between these phenomena (Hansen & Hebart, 2022; y Arcas, 2022). Furthermore, they can produce outputs that appear creative or even unique, which currently changes the landscape of artistic work (Anantrasirichai & Bull, 2022; Stevenson et al., 2022). Does this mean that such models can automate the creative reasoning task of hypothesis generation in the social and behavioral sciences?

The current work demonstrates that GPT-3 can, at least sometimes, accurately predict the direction of correlations between ideological attitudes. While zero-shot prompting elicited poor results, prompting with a highly task-specific, multi-step procedure or finetuning the model with relevant training data led to accuracies around 95% (where real-world study outcomes were considered the ground truth). A replication study collecting the same attitude data from a representative sample of 600 US-American adults showed the same results (see supplementary materials). These performances were above the average performance of our expert sample, which supports previous work showing that LLM’s can do useful autonomous work in the social and behavioral sciences (Hansen & Hebart, 2022). We want to emphasize that the expert performance that was observed here is not equal to the peak expert performance that researchers would show when generating hypotheses about their own empirical studies. Here, some experts participated very casually or conducted only marginally related research themselves (sometimes on other geographical areas). Still, the current work encourages the testing of more general study prediction models. Most study outcomes within the social sciences are more difficult to predict than the ones used here. Especially when outcomes have to be predicted on the individual level, rather than the collective level like in the current work, prediction accuracy often plummets (Joel et al., 2017; Salganik et al., 2020).

If AI models remained competitive across research contexts, they could function as a “second opinion” before conducting studies, a reference point for power analyses, an indicator of how intuitive new study results are, or a creative thinking tool for eliciting probable hypotheses in a given field. Note however that a few glaring errors remained in all configurations and that the gradual extension and retesting of these models in research outside of political attitudes remains a task for future endeavors. The question of how general such models should finally be (one model for all social sciences or specific to subfields) requires empirical testing. Naturally, the training corpus for new models should consist of results from empirical studies, similarly to the simulated training data we used for finetuning in Study 3. Efforts to make publications machine readable or to scrape them into a standardized format can directly support this goal (Lakens & DeBruine, 2021; Rosenbusch et al., 2020).

While the accumulation and pre-processing of such a training corpus constitutes a large project, we assume that it is generally superior to task-specific prompt engineering (cf. Study 2), which always remains dependent on the human researcher and cannot easily extend beyond its original scope of application (cf., the research robot “Adam,” King et al., 2009). However, if task-specific pipeline customization were to become the primary strategy for auto-generating research hypotheses, meta-methods like PromptChainer (Wu et al., 2022) might start playing a central role (Chase, 2022).

While encouraging for AI-supported research, we want to emphasize that even an LLM which is 100% accurate in predicting study outcomes is not “doing social science.” Specifically, we highlight three key limitations that can be at least partly addressed in future studies.

First, making a prediction about a study’s outcome is not the same as developing a theory about whether and how study outcomes will come about. The job of a social scientist is not just to state how attitudes towards A and B will correlate, but to justify their hypotheses, consider the impact of their results, review the methods used to generate findings, integrate findings into a body of literature, and many more activities. Thus, even if an AI were to reliably outperform researchers in predicting study outcomes, it would be far from being an autonomous scientist.

Second, in the current study, we utilized ideological concepts that are well-documented in the LLM’s training corpus. If they had been absent in the data (say a politician that started their career after model training, or a newly suggested psychological construct), it would have effectively prevented the generation of valid predictions. Similarly, bivariate correlations are among the most simple statistics to be predicted. Had the task instead been to, for instance, anticipate complex network structures, model performance would have very likely decreased substantially (as does the success rate of human hypothesizers).

Third, all current LLMs should be expected to regularly violate research ethics. Most prominently, models are likely to generate discriminatory hypotheses based on the many biases in their training data (Weidinger et al., 2021). Current efforts explore possibilities to adjust training data, model architecture, and training procedure, which is not feasible for human hypothesizers. However, especially for ethically-loaded hypotheses, an LLM based on word associations in large internet scrapes should not be given much credibility (cf., Bender et al., 2021; Corbett-Davies & Goel, 2018).

These limitations together bring forward the question of what a machine-generated hypothesis would need to look like to be given equal consideration as a human-generated hypothesis. We assume that a convincing theory justifying the hypothesis would be sufficient despite arising mechanically and without human-like understanding. A well-argued and coherent chain of thought would be sufficient as supervising researchers would primarily care about a convincing rationale behind a mechanically generated hypothesis (Krenn et al., 2022). Given that the production of such rationales is an independent challenge, machine-generated predictions of study outcomes might become empirically superior to human hypotheses without being able to provide much-needed theoretical explanations. In addition, LLMs are only exposed to knowledge in the form of written text, whereas human knowledge is gained and captured in many alternative modes. The near future will show how AI products and the research community come together to address these issues. A fruitful next step might be to enter hypothesis generators into study prediction contests and replication markets (Liu et al., 2020).

In conclusion, the journey towards an artificial intelligence winning a nobel prize for scientific discovery has begun (Kitano, 2016). The road towards a universally persuasive hypothesis engine is far from over, and new tasks often take a while to be mastered in the field of machine learning (Grace et al., 2018). However, the current work highlights the possibility of automatically generating hypotheses for simple questions. The promising next steps are to illuminate the philosophical and ethical implications of accurate hypotheses generated by machines that have no built-in incentive to tell the truth (Sobieszek & Price, 2022).

References

Aher, G., Arriaga, R. I., & Kalai, A. T. (2022). Using large language models to simulate multiple humans. arXiv preprint. arXiv:2208.10264
Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50(2), 179–211.
Article Google Scholar
Akhbardeh, F., Arkhangorodsky, A., Biesialska, M., Bojar, O., Chatterjee, R., Chaudhary, V., & Zampieri, M. (2021). Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation (pp. 1–88).
Albarracín, D., Johnson, B. T., & Zanna, M. P. (2014). The handbook of attitudes. Psychology Press.
Book Google Scholar
Albarracín, D., & Shavitt, S. (2018). Attitudes and attitude change. Annual Review of Psychology, 69, 299–327.
Article Google Scholar
Anantrasirichai, N., & Bull, D. (2022). Artificial intelligence in the creative industries: a review. Artificial Intelligence Review, 1–68.
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2022). Out of one, many: Using language models to simulate human samples. Political Analysis, 1–15.
Barrault, L., Bojar, O., Costa-Jussa, M. R., Federmann, C., Fishel, M., Graham, Y., & Zampieri, M. (2019). Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1) (pp. 1–61).
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623).
Berk, R. A. (2021). Artificial intelligence, predictive policing, and risk assessment for law enforcement. Annual Review of Criminology, 4, 209–237.
Article Google Scholar
Bhatia, S., & Richie, R. (2021). Transformer networks of human conceptual knowledge. Psychological Review.
Bosnjak, M., Ajzen, I., & Schmidt, P. (2020). The theory of planned behavior: Selected recent advances and applications. Europe’s Journal of Psychology, 16(3), 352.
Article Google Scholar
Brandt, M. J., & Sleegers, W. W. (2021). Evaluating belief system networks as a theory of political belief system dynamics. Personality and Social Psychology Review, 25(2), 159–185.
Article Google Scholar
Brewer, M. B., & Kramer, R. M. (1985). The psychology of intergroup attitudes and behavior. Annual Review of Psychology, 36(1), 219–243.
Article Google Scholar
Briñol, P., Petty, R. E., & Stavraki, M. (2019). Structure and function of attitudes. In Oxford Research Encyclopedia of Psychology.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Chase, H. (2022, October 17). LangChain (Version 1.2.0) [Computer software]. Retrieved from https://github.com/hwchase17/langchain
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., & Fiedel, N. (2022). Palm: Scaling language modeling with pathways. arXiv preprint. arXiv:2204.02311
Christian, H., Suhartono, D., Chowanda, A., & Zamli, K. Z. (2021). Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging. Journal of Big Data, 8(1), 1–20.
Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint. arXiv:1808.00023
Dalege, J., Borsboom, D., Van Harreveld, F., Van den Berg, H., Conner, M., & Van der Maas, H. L. (2016). Toward a formalized account of attitudes: The causal attitude network (CAN) model. Psychological Review, 123(1), 2–22.
Article Google Scholar
Dalege, J., Borsboom, D., Van Harreveld, F., Waldorp, L. J., & Van der Maas, H. L. (2017). Network structure explains the impact of attitudes on voting decisions. Scientific Reports, 7(1), 1–11.
Article Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., & Gardner, M. (2019). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint. arXiv:1903.00161
Dzendzik, D., Vogel, C., & Foster, J. (2021). English machine reading comprehension datasets: A survey. arXiv preprint. arXiv:2101.10421
Elicit. (2022). https://elicit.org/
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694.
Article Google Scholar
Gordon, M., Viganola, D., Dreber, A., Johannesson, M., & Pfeiffer, T. (2021). Predicting replicability—analysis of survey and prediction market data from large-scale forecasting projects. PLoS ONE, 16(4), e0248780.
Article Google Scholar
Gozalo-Brizuela, R., & Garrido-Merchan, E. C. (2023). ChatGPT is not all you need. A State of the Art Review of large Generative AI models. arXiv preprint. arXiv:2301.04655
Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2018). When will AI exceed human performance? Evidence from AI experts. Journal of Artificial Intelligence Research, 62, 729–754.
Article MathSciNet Google Scholar
Gutiérrez, B. J., McNeal, N., Washington, C., Chen, Y., Li, L., Sun, H., & Su, Y. (2022). Thinking about GPT-3 in-context learning for biomedical IE? Think Again. arXiv preprint. arXiv:2203.08410
Hansen, H., & Hebart, M. N. (2022). Semantic features of object concepts generated with GPT-3. arXiv preprint. arXiv:2202.03753
Hoogeveen, S., Sarafoglou, A., & Wagenmakers, E. J. (2020). Laypeople can predict which social-science studies will be replicated successfully. Advances in Methods and Practices in Psychological Science, 3(3), 267–285.
Article Google Scholar
Jiang, H., Beeferman, D., Roy, B., & Roy, D. (2022). CommunityLM: Probing partisan worldviews from language models. arXiv preprint. arXiv:2209.07065
Joel, S., Eastwick, P. W., & Finkel, E. J. (2017). Is romantic desire predictable? Machine learning applied to initial romantic attraction. Psychological Science, 28(10), 1478–1489.
Article Google Scholar
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260.
Article MathSciNet MATH Google Scholar
King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., & Clare, A. (2009). The automation of science. Science, 324(5923), 85–89.
Kitano, H. (2016). Artificial intelligence to win the nobel prize and beyond: Creating the engine for scientific discovery. AI Magazine, 37(1), 39–49.
Article Google Scholar
Korteling, J., van de Boer-Visschedijk, G., Blankendaal, R., Boonekamp, R., & Eikelboom, A. (2021). Human-versus artificial intelligence. Frontiers in Artificial Intelligence, 4.
Krenn, M., Pollice, R., Guo, S. Y., Aldeghi, M., Cervera-Lierta, A., Friederich, P., & Aspuru-Guzik, A. (2022). On scientific understanding with artificial intelligence. arXiv preprint. arXiv:2204.01467
Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
Lake, B. M., & Murphy, G. L. (2021). Word meaning in minds and machines. Psychological Review.
Lakens, D., & DeBruine, L. M. (2021). Improving transparency, falsifiability, and rigor by making hypothesis tests machine-readable. Advances in Methods and Practices in Psychological Science, 4(2), 2515245920970949.
Article Google Scholar
Liang, P. P., Wu, C., Morency, L. P., & Salakhutdinov, R. (2021). Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning (pp. 6565–6576). PMLR.
Lin, T., Wang, Y., Liu, X., & Qiu, X. (2021). A survey of transformers. arXiv preprint. arXiv:2106.04554
Liu, Y. (2019). Fine-tune BERT for extractive summarization. arXiv preprint. arXiv:1903.10318
Liu, Y., Gordon, M., Wang, J., Bishop, M., Chen, Y., Pfeiffer, T., & Viganola, D. (2020). Replication markets: Results, lessons, challenges and opportunities in AI replication. arXiv preprint. arXiv:2005.04543
López-Úbeda, P., Plaza-del-Arco, F. M., Díaz-Galiano, M. C., & Martín-Valdivia, M. T. (2021). How successful is transfer learning for detecting anorexia on social media? Applied Sciences, 11(4), 1838.
Article Google Scholar
Maio, G. R., Haddock, G., & Verplanken, B. (2018). The psychology of attitudes and attitude change. Sage.
Google Scholar
McGlashan, J., Johnstone, M., Creighton, D., de la Haye, K., & Allender, S. (2016). Quantifying a systems map: Network analysis of a childhood obesity causal loop diagram. PLoS ONE, 11(10), e0165459.
Article Google Scholar
Montemayor, C. (2021). Language and intelligence. Minds and Machines, 31(4), 471–486.
openAI. (2022). Finetuning. Retrieved August 1, 2022, from https://beta.openai.com/docs/guides/fine-tuning
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint. arXiv:2112.11446
Rosenbusch, H., Wanders, F., & Pit, I. L. (2020). The semantic scale network: An online tool to detect semantic overlap of psychological scales and prevent scale redundancies. Psychological Methods, 25(3), 380.
Article Google Scholar
Salganik, M. J., Lundberg, I., Kindel, A. T., Ahearn, C. E., Al-Ghoneim, K., Almaatouq, A., & McLanahan, S. (2020). Measuring the predictability of life outcomes with a scientific mass collaboration. In Proceedings of the National Academy of Sciences, 117(15), 8398–8403.
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS One, 8(9), e73791.
Silver, J. R., & Pickett, J. T. (2015). Toward a better understanding of politicized policing attitudes: Conflicted conservatism and support for police use of force. Criminology, 53(4), 650–676.
Article Google Scholar
Sobieszek, A., & Price, T. (2022). Playing games with Ais: The limits of GPT-3 and similar large language models. Minds and Machines, 32(2), 341–364.
Article Google Scholar
Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343–418.
Article MathSciNet Google Scholar
Starzyk, K. B., Fabrigar, L. R., Soryal, A. S., & Fanning, J. J. (2009). A painful reminder: The role of level and salience of attitude importance in cognitive dissonance. Personality and Social Psychology Bulletin, 35(1), 126–137.
Article Google Scholar
Stevenson, C., Smal, I., Baas, M., Grasman, R., & van der Maas, H. (2022). Putting GPT-3’s creativity to the (alternative uses) test. arXiv preprint. arXiv:2206.08932
Sumner, C., Byers, A., Boochever, R., & Park, G. J. (2012). Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets. 11th international conference on machine learning and applications (Vol. 2, pp. 386–393). IEEE.
Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint. arXiv:2102.02503
Tanana, M. J., Soma, C. S., Kuo, P. B., Bertagnolli, N. M., Dembe, A., Pace, B. T., & Imel, Z. E. (2021). How do you feel? Using natural language processing to automatically rate emotion in psychotherapy. Behavior Research Methods, 53(5), 2069–2082.
Ullah, H., Ahmad, B., Sana, I., Sattar, A., Khan, A., Akbar, S., & Asghar, M. Z. (2021). Comparative study for machine learning classifier recommendation to predict political affiliation based on online reviews. CAAI Transactions on Intelligence Technology, 6(3), 251–264.
Article Google Scholar
van der Maas, H. L., Snoek, L., & Stevenson, C. E. (2021). How much intelligence is there in artificial intelligence? A 2020 update. Intelligence, 87, 101548.
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Vig, J., & Belinkov, Y. (2019). Analyzing the structure of attention in a transformer language model. arXiv preprint. arXiv:1906.04284
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint. arXiv:2109.01652
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint. arXiv:2201.11903
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint. arXiv:2112.04359
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations (pp. 38–45).
Wu, T., Jiang, E., Donsbach, A., Gray, J., Molina, A., Terry, M., & Cai, C. J. (2022). Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (pp. 1–10).
Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., & Huang, F. (2021). Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint. arXiv:2109.05687
y Arcas, B. A. (2022). Do Large Language Models Understand Us? Daedalus, 151(2), 183–197.
Article Google Scholar
Yeung, R. C., & Fernandes, M. A. (2022). Machine learning to detect invalid text responses: Validation and comparison to existing detection methods. Behavior Research Methods, 1–16.

Download references

Author information

Authors and Affiliations

Department of Psychological Methods, University of Amsterdam, Amsterdam, Netherlands
Hannes Rosenbusch, Claire E. Stevenson & Han L. J. van der Maas
Appinio GmbH, Hamburg, Germany
Hannes Rosenbusch

Authors

Hannes Rosenbusch
View author publications
You can also search for this author in PubMed Google Scholar
Claire E. Stevenson
View author publications
You can also search for this author in PubMed Google Scholar
Han L. J. van der Maas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hannes Rosenbusch.

Ethics declarations

Conflict of Interest

The authors declare to not have any conflicts of interest regarding the current work. All funding was provided by the Department of Psychological Methods at the University of Amsterdam. Ethics approval was obtained by the ERB commitee of the host department (2022-PML-14761). All authors jointly authored the manuscript. The first author developed the original idea, conducted the statistical analyses, and produced the visualizations.

Open Practices Statement

Pre-registrations, scripts, data, and supplementary materials can be found here: bit.ly/3ym9cWJ.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rosenbusch, H., Stevenson, C.E. & van der Maas, H.L.J. How Accurate are GPT-3’s Hypotheses About Social Science Phenomena?. DISO 2, 26 (2023). https://doi.org/10.1007/s44206-023-00054-2

Download citation

Received: 02 December 2022
Revised: 17 March 2023
Accepted: 02 June 2023
Published: 03 July 2023
DOI: https://doi.org/10.1007/s44206-023-00054-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.