Automating creativity assessment with SemDis: An open platform for computing semantic distance

Beaty, Roger E.; Johnson, Dan R.

doi:10.3758/s13428-020-01453-w

Automating creativity assessment with SemDis: An open platform for computing semantic distance

Open access
Published: 31 August 2020

Volume 53, pages 757–780, (2021)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Automating creativity assessment with SemDis: An open platform for computing semantic distance

Download PDF

Roger E. Beaty¹ &
Dan R. Johnson²

17k Accesses
142 Citations
32 Altmetric
Explore all metrics

Abstract

Creativity research requires assessing the quality of ideas and products. In practice, conducting creativity research often involves asking several human raters to judge participants’ responses to creativity tasks, such as judging the novelty of ideas from the alternate uses task (AUT). Although such subjective scoring methods have proved useful, they have two inherent limitations—labor cost (raters typically code thousands of responses) and subjectivity (raters vary on their perceptions and preferences)—raising classic psychometric threats to reliability and validity. We sought to address the limitations of subjective scoring by capitalizing on recent developments in automated scoring of verbal creativity via semantic distance, a computational method that uses natural language processing to quantify the semantic relatedness of texts. In five studies, we compare the top performing semantic models (e.g., GloVe, continuous bag of words) previously shown to have the highest correspondence to human relatedness judgements. We assessed these semantic models in relation to human creativity ratings from a canonical verbal creativity task (AUT; Studies 1–3) and novelty/creativity ratings from two word association tasks (Studies 4–5). We find that a latent semantic distance factor—comprised of the common variance from five semantic models—reliably and strongly predicts human creativity and novelty ratings across a range of creativity tasks. We also replicate an established experimental effect in the creativity literature (i.e., the serial order effect) and show that semantic distance correlates with other creativity measures, demonstrating convergent validity. We provide an open platform to efficiently compute semantic distance, including tutorials and documentation (https://osf.io/gz4fc/).

Measuring creativity: an account of natural and artificial creativity

Article Open access 02 October 2020

A New Dataset and Method for Creativity Assessment Using the Alternate Uses Task

On Creative Uses of Word Associations

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Creativity researchers have long grappled with how to measure creativity. Indeed, the question of how to best capture creativity remains open and active, with a recent special issue on creativity assessment recently published in Psychology of Aesthetics, Creativity, and the Arts (Barbot, Hass, & Reiter-Palmon, 2019). Over the years, a range of assessment approaches have been developed, from methods that rely on experts to judge the creative quality of products (i.e., the Consensual Assessment Technique; Amabile, 1983; Cseh & Jeffries, 2019) to frequency-based methods that use standardized norms (Forthmann, Paek, Dumas, Barbot, & Holling, 2019; Torrance, 1972) to subjective scoring methods that rely on layperson judgements (Silvia et al., 2008). Although each method has shown some degree of utility for creativity research, each comes with challenges and limitations. Two challenges that are common to most creativity assessments are subjectivity (raters don’t always agree on what’s creative) and labor cost (raters often have to score thousands of responses by hand)—both of which pose threats to the reliable and valid assessment of creativity (Barbot, 2018; Forthmann et al., 2017; Reiter-Palmon, Forthmann, & Barbot, 2019). To address these issues, researchers have begun to explore whether the process of scoring responses for their creative quality can be automated and standardized using computational methods, and preliminary evidence suggests that such tools can yield reliable and valid indices of creativity (Acar & Runco, 2014; Dumas, Organisciak, & Doherty, 2020; Heinen & Johnson, 2018; Kenett, 2019; Prabhakaran, Green, & Gray, 2014). In the present research, we aim to capitalize on these promising advances by developing and validating an open-source platform for the automated assessment of creativity, allowing researchers to objectively quantify the creative quality of ideas across a range of common creativity tasks.

Measuring creativity: The status quo

Creative thinking is widely assessed with tests of divergent thinking, which present open-ended prompts and ask people to think of creative responses (Acar & Runco, 2019). One of the most widely used tests of divergent thinking is the Alternate Uses Task (AUT), where people are presented a common object (e.g., a box) and asked to think of as many creative and uncommon uses for it as possible within a given time period (usually 2–3 min; Benedek, Mühlmann, Jauk, & Neubauer, 2013). A strength of the AUT is that it seems to offer a good approximation of a person’s general capacity to come up with original ideas. Although it has limitations (Barbot, 2018), the AUT and other divergent thinking tests have shown consistent evidence of validity, with several studies reporting moderate to large correlations between AUT performance and real-world creative achievement in the arts and sciences (Beaty et al., 2018; Jauk, Benedek, & Neubauer, 2014; Plucker, 1999). Paul Torrance, who developed a widely used creativity assessment (the Torrance Test of Creative Thinking; TTCT), provided perhaps the most compelling longitudinal evidence for the validity of divergent thinking tests: highly creative children—assessed by performance on the TTCT—grew up to be highly creative adults, reporting significantly more creative accomplishments when assessed decades later in adulthood (Plucker, 1999; Torrance, 1981); remarkably, a 50-year follow-up of Torrance’s data further confirmed the validity of divergent thinking in predicting students’ future creative accomplishment (Runco, Millar, Acar, & Cramond, 2010). These findings indicate that divergent thinking tests provide a measure of “domain-general” creative ability that may support real-world creative actions (cf. Jauk et al., 2014; but see Barbot, Besançon, & Lubart, 2016; Dietrich, 2015; and Zeng, Proctor, & Salvendy, 2011 for alternate views on the domain-generality and utility of divergent thinking tasks).

Divergent thinking responses are often scored on two dimensions: fluency (the total number of responses) and originality (the creative quality of responses). Fluency offers a proxy of generative ability; however, it has been criticized for a lack of reliability, with inter-item fluency correlations on the AUT often as low as .3 to .4 (Barbot, 2018; cf. Dumas & Dunbar, 2014). Recent work suggests that this low inter-item correlation could be due to variability in item (object) characteristics such as semantic object features (Beaty, Kenett, Hass, & Schacter, 2019) and word frequency (Forthmann et al., 2016). At the same time, low inter-task fluency correlations have not been consistently reported in the literature; for example, Jauk et al. (2014) reported high standardized factor loadings on an AUT fluency latent variable (suggesting strong reliability) and Forthmann, Holling, Çelik, Storme, and Lubart (2017) reported inter-task correlations for AUT items ranging from .57 to .71. Nevertheless, perhaps the most notable limitation of fluency is that it does not take into consideration the quality of ideas. Thus, a given person may produce many ideas on the AUT—which would be captured by calculating their fluency score—but, absence an index of quality, whether those ideas were actually creative (i.e., qualitatively different from common ideas) would be unknown.

Originality scoring, in contrast, can capture the creative quality of responses. A popular approach to originality scoring is the subjective scoring method (Hass, Rivera, & Silvia, 2018; Silvia et al., 2008). The subjective scoring method is based on the Consensual Assessment Technique (CAT; Amabile, 1983; Cseh & Jeffries, 2019; Kaufman, Lee, Baer, & Lee, 2007), a procedure that involves convening a panel of experts to judge a series of products, ranging from ideas to poems to inventions. When applied to divergent thinking assessment via the subjective scoring method, a group of raters (often undergraduate students) are briefly trained on how to assess the creative quality of responses, typically using a 1 (not at all creative) to 5 (very creative) scale (Benedek et al., 2013; Silvia et al., 2008). Notably, the subjective scoring method, like the CAT, provides only limited guidance to raters as to what constitutes a creative response (e.g., uncommon, remote, clever), largely deferring to raters’ own subjective perception of creativity (Cseh & Jeffries, 2019; Mouchiroud & Lubart, 2001). Although subjective scoring methods have shown evidence of convergent validity, including positive correlations with frequency-based originality (Forthmann, Holling, Çelik, et al., 2017) and measures of creative activities and achievements (Jauk et al., 2014), inter-rater agreement is not always high, raising issues of reliability (Barbot, 2018). Reconciling such disagreements is a common feature of the CAT—where experts can meet to discuss their ratings and work toward agreement—but many studies using subjective scoring with divergent thinking responses do not employ this approach, likely due to its time-consuming nature. Moreover, the undergraduate students that often serve as raters for these tests are typically tasked with scoring thousands of responses, leading to rater fatigue and contributing to poor reliability (Forthmann, Holling, Zandi, et al., 2017). Taken together, although the CAT and subjective scoring method have been valuable to creativity research, the approaches are marked by the key limitations of subjectivity and labor cost.

Automating creativity assessment

To address the limitations of subjective scoring, researchers have begun to explore the utility of automated scoring approaches using computational tools (Acar & Runco, 2014; Dumas et al., 2020; Dumas & Runco, 2018; Green, 2016; Hass, 2017b; Heinen & Johnson, 2018; Kenett, 2019; Prabhakaran et al., 2014; Zedelius, Mills, & Schooler, 2019). One such approach uses latent semantic analysis (LSA; Landauer, Foltz, & Laham, 1998) to quantify the “semantic distance” between concepts in a given semantic space. LSA and other computational linguistic tools can quantify the semantic relatedness between words in large corpora of texts, for example, by counting the number of co-occurrences between words and documents (i.e., count models) or by deriving co-occurrence weights by trying to predict word-context links (i.e., predict models), all in a high-dimensional word-vector space (Günther, Rinaldi, & Marelli, 2019). For example, the words “hammer” and “nail” are likely to occur in similar contexts and would thus yield a higher similarity value; in contrast, the words “hammer” and “tissue” are less likely to occur in similar contexts and would thus yield a relatively lower similarity value. Application of LSA in creativity research is rooted in the associative theory of creativity (Kenett, 2019; Mednick, 1962) which proposes that creative thinking requires making connections between seemingly “remote” concepts. The associative theory has received increasing support from several recent computational modeling studies showing that high-creative individuals, defined by performance on a battery of creativity tasks, show a more flexible semantic network structure, characterized by low modularity and high connectivity between concepts (Christensen, Kenett, Cotter, Beaty, & Silvia, 2018; Gray et al., 2019; Kenett et al., 2018; Kenett, Anaki, & Faust, 2014; Kenett & Faust, 2019). According to Kenett and colleagues, this flexible (or small-world) semantic network architecture is conducive to creative thinking because it allows people to form conceptual combinations between concepts that are typically represented further apart (e.g., hammer and tissue).

Prabhakaran et al. (2014) provided an early test of LSA for creativity assessment in the context of the classic verb generation task (see also Bossomaier, Harre, Knittel, & Snyder, 2009; Forster & Dunbar, 2009). When presented with nouns and instructed to “think creatively” while searching for verbs to relate to the nouns, participants produced responses that were significantly more semantically distant, defined as the inverse of semantic similarity, compared to when they were not cued to think creatively (and simply generated common verbs). Here, the simple instruction to “think creatively” yielded more creative (i.e., semantically distant) responses, consistent with prior work showing explicit instruction to think creatively improves creative task performance (Acar, Runco, & Park, 2019; Nusbaum, Silvia, & Beaty, 2014; Said-Metwaly, Fernández-Castilla, Kyndt, & Van den Noortgate, 2019). Critically, at the individual subject level, the authors found that semantic distance values in the cued creativity condition correlated positively with a range of established creativity measures, including human ratings of creativity on divergent thinking tests, performance on a creative writing task, and frequency of self-reported creative achievement in the arts and sciences. Prabhakaran et al. (2014) thus provided validity evidence of LSA for creativity research in the context of the verb generation task, demonstrating the potential of using automated scoring approaches to measure verbal creativity.

The initial LSA findings of Prabhakaran et al. (2014) have since been replicated using a different computational model and corpora (Heinen & Johnson, 2018) and extended to other creativity tasks, including the AUT (Hass, 2017b), albeit with mixed evidence for validity (Forster & Dunbar, 2009; Forthmann, Holling, Çelik, et al., 2017; Forthmann, Oyebade, Ojo, Günther, & Holling, 2018; Harbison & Haarmann, 2014; Hass, 2017a; Hass, 2017b). As LSA has been increasingly employed in creativity research, researchers have begun to identify limitations of the approach and best-practices in data processing. In a study on the AUT, for example, Forthmann et al. (2019) found that LSA values are confounded by elaboration—the more words used to describe a response, the higher LSA-based cosine similarity (i.e., lower semantic distance derived from similarity)—but this confound was partially mitigated by removing “stop words” from responses (e.g., the, an, but) prior to computing LSA. Another consideration with semantic distance-based scoring concerns the balance of novelty and usefulness (or appropriateness), the two criteria that jointly define a creative idea or product (Diedrich, Benedek, Jauk, & Neubauer, 2015). In addition to detecting novelty, Heinen and Johnson (2018) found that LSA can also be used to assess the combination of novelty and usefulness/appropriateness, depending on the type of instruction given to participants: semantic distance was lowest with a “common” instruction, highest with a “random” instruction, and between common and random with a “creative” instruction. They found that when participants were asked to “be creative,” they spontaneously tended to give creative responses constrained by appropriateness, as opposed to giving highly novel, but nonsensical responses. These findings demonstrate the utility of semantic distance metrics as a means to quantify creativity in the context of verbal idea generation tasks.

The present research

Subjective scoring methods are commonly used to assess the creative quality of responses on verbal creativity tasks. Although subjective methods and other manual-based approaches have shown evidence of reliability and validity (Silvia et al., 2008), they suffer from two fundamental issues: subjectivity and labor cost. Regarding subjectivity, raters don’t always agree on what constitutes a creative response, and they are often given little guidance—consistent with the widely adopted guidelines of the Consensual Assessment Technique (Cseh & Jeffries, 2019)—leading to low inter-rater reliability. Moreover, raters are often asked to code hundreds or thousands of responses, leading to rater fatigue and further threatening reliability (Forthmann, Holling, Zandi, et al., 2017). Critically, these issues can also act as a barrier of entry for people without the time and resources to code thousands of responses by hand, such as researchers without teams of research assistants, or educators without the time to score creativity tests. To address the limitations of subjective scoring methods, automated scoring methods such as LSA have begun to be employed, with preliminary evidence pointing to their potential to provide a reliable and valid index of creative thinking ability, particularly with tasks that require single word responses (Heinen & Johnson, 2018; Kenett, 2019; Prabhakaran et al., 2014), with more mixed findings for tasks that require multi-word responses, like the AUT.

In the present research, we aim to capitalize on recent progress in the automated scoring of verbal creativity. We develop and test a new online platform that computes semantic distance called SemDis. SemDis was built to handle a range of verbal creativity and association tasks, including single word associations and word phrase associations, with a focus on the AUT. SemDis compliments and extends recent efforts to compare the relative performance of various computational approaches to computing semantic distance in predicting human creativity ratings. For example, Dumas et al. compared several semantic models (TASA-LSA, EN_100k_lsa, GloVe 840B, and word2vec-skipgram) in predicting human creativity ratings on the AUT, reporting evidence for the reliability and validity of these different models, particularly GloVe (Dumas et al., 2020). Here we build on the work of Dumas and colleagues by: 1) comparing additive and multiplicative composition of vectors, 2) modeling various semantic spaces within a latent variable approach (reducing biases of any single text corpus; Kenett, 2019), 3) including multiple published and unpublished datasets, 4) considering both AUT and word association responses, and 5) including a variety of external validity criteria.

Using latent variable modeling, we extract common measurement variance across multiple metrics of semantic distance and test how well this latent factor predicts human creativity ratings. As a further test of validity, we examine whether the semantic distance factor predicts established creativity measures, including real-world creative achievement and creative self-efficacy, as well as other cognitive assessments of verbal creativity (e.g., creative metaphor production). Our goal is to provide a reliable, valid, and automated assessment of creativity. To our knowledge, we provide the first comparison between the application of additive and multiplicative compositional semantic models in the context of creativity assessment. Compositional semantic models are relevant when participants give multi-word responses (e.g., AUT) and a researcher needs to combine each individual word vector into a single compositional vector. There is some preliminary evidence that multiplicative models may show higher correlations with human ratings because, compared to an additive model, similar meanings between two responses get more weight, and dissimilar meanings get less weight in the final compositional vector (Mitchell & Lapata, 2010). In addition, prior research suggests one substantial weakness of applying an additive compositional model in creativity assessment is that it penalizes (i.e., reduces) semantic distance scores for more elaborate creativity responses (Forthmann et al., 2018). We attempt to replicate this finding and determine whether or not multiplicative models similarly penalize semantic distance scores, with the goal of explaining maximal variance in human creativity ratings.

Although similar tools are currently available (e.g., lsacolorado.edu; snaut, Mandera, Keuleers, & Brysbaert, 2017), we provide more robust text processing via optional methods of text cleaning, more flexibility in the creation of underlying semantic model (i.e., allowing users to select which semantic space and which compositional model to include in the computation of semantic distance), and latent variable-extracted factor scores from diverse semantic spaces. In addition, in contrast to some platforms, our online platform (SemDis) can run on Macs or PCs because it is a web-based platform and does not require downloading software.

Study 1

Our first study aimed to provide preliminary evidence for the reliability and validity of our approach to automated creativity assessment using latent variable modeling. To this end, we test whether combining multiple models of semantic distance into a single latent variable can approximate human creativity ratings. Latent variables can suppress methodological variance specific to each model, mitigating unreliability by reducing the influence of any one semantic model and extracting common measurement variance across multiple models (cf. Beketayev & Runco, 2016). We reanalyzed AUT responses from a recently published study (Beaty et al., 2018) and tested the relative performance of five semantic models in predicting human creativity ratings. We focused on additive and multiplicative semantic models that have previously shown adequate correspondence to human ratings and semantic similarity (Mitchell & Lapata, 2010). Regarding validity, we examined the extent to which a latent variable, comprised of common variance of the five semantic models, relates to several other measures of creativity, assessed via task performance and self-report. Previous research using word association tasks and semantic distance values found that semantic distance on these tasks correlated with both human ratings (Heinen & Johnson, 2018; Johnson, Cuthbert, & Tynan, 2019) and a range of other creativity measures (Prabhakaran et al., 2014). We thus expected our combined semantic distance latent variable to positively correlate with human creativity ratings on the AUT and other creativity measures.

Method

Participants

Participants were recruited as part of a larger project on individual differences in creativity (see Adnan, Beaty, Silvia, Spreng, & Turner, 2019; Beaty et al., 2018; Maillet et al., 2018). The total sample consisted of 186 adults from the University of North Carolina at Greensboro (UNCG) and surrounding community. Participants were paid up to $100 based on their level of completion in the three-part study, which included magnetic resonance imaging (MRI), daily-life experience-sampling, and laboratory assessments. Of the total sample, 172 participants completed both divergent thinking assessments; one participant was excluded as a multivariate outlier (Cooks Distance > 10), yielding a final sample of 171 (123 females, mean age = 22.63 years, SD = 6.29). All participants were right-handed with normal or corrected-to-normal vision, and they were not enrolled in the study if they reported a history of neurological disorder, cognitive disability, or medication and other drugs known to affect the central nervous system. The study was approved by the UNCG Institutional Review Board, and participants provided written informed consent prior to completing the study.

Procedure

Participants completed a battery of tasks and questionnaires that measure different aspects of verbal creative ability (divergent thinking; novel metaphor production), real-world creative behavior (activities and achievements), and creative self-concept (self-efficacy and identity). Cognitive assessments were administered in a laboratory setting using MediaLab; questionnaires were administered both in the lab via MediaLab and online via Qualtrics.

Divergent thinking

Participants completed two trials of the AUT. The two trials (box and rope) were completed in a conventional testing environment on a computer running MediaLab (3 minutes of continuous idea generation). As in our prior work (Nusbaum et al., 2014), participants were instructed to “think creatively” while coming up with uses for the objects; notably, the instructions explicitly emphasized quality over quantity, as well as novelty over usefulness. Responses were subsequently scored for creative quality using the subjective scoring method (Benedek et al., 2013; Silvia et al., 2008). Four raters scored responses using a 1 (not at all creative) to 5 (very creative) scale. We provide task instructions and rater guidelines in the Supplemental Materials (also available via OSF; https://osf.io/vie7s/).

Creative behavior

We administered a battery of questionnaires to measure two facets of creative behavior: 1) creative activities (i.e., hobbies) and 2) creative achievements. Creative activities were assessed using the Biographical Inventory of Creative Behavior (BICB; Batey, 2007), which presents a list of 34 creative activities (e.g., making a website) and asks participants if they have participated in each activity within the past year (yes/no response). The Inventory of Creative Activities and Achievements (ICAA; Diedrich et al., 2018) includes two subscales that capture both creative activities/hobbies and higher-level accomplishments across eight domains of the arts and sciences. The Creative Achievement Questionnaire (CAQ; Carson, Peterson, & Higgins, 2005) assesses publicly-recognized creative achievements across ten creative domains.

Creative self-concept

The Short Scale of Creative Self (SSCS; Karwowski, 2014) assessed creative self-perceptions. The SSCS (11 items) captures two components of creative self-concept: creative self-efficacy (CSE) and creative personality identity (CPI). The CSE subscale measures the extent to which people perceive themselves as capable of solving creative challenges, such as “I am good at proposing original solutions to problems.” The CPI measures the extent to which creativity is a defining feature of the self-concept, such as “Being a creative person is important to me.”

Creative metaphor

As a further test of validation with a cognitive assessment of creativity with human ratings, we included two creative metaphor production prompts (Beaty & Silvia, 2013). Participants were presented with two open-ended prompts (i.e., common everyday experiences) and asked to produce novel metaphors to describe these experiences. One prompt asked participants, “Think of the most boring high-school or college class that you’ve ever had. What was it like to sit through?” Another prompt asked participants, “Think about the most disgusting thing you ever ate or drank. What was it like to eat or drink?” (Beaty & Silvia, 2013; Silvia & Beaty, 2012). Four raters scored the two metaphors using a 1 (not at all creative) to 5 (very creative) scale; the same four raters that scored the divergent thinking responses scored the metaphor responses.

Fluid intelligence

Past work indicates that fluid intelligence (Gf)—the ability to solve novel problems through reasoning—correlates positively with human creativity ratings on divergent thinking tests (Beaty, Silvia, Nusbaum, Jauk, & Benedek, 2014; Benedek, Jauk, Sommer, Arendasy, & Neubauer, 2014; Jauk et al., 2014). We thus included several measures of Gf to determine whether automated creativity ratings similarly relate to Gf, including: 1) the series completion task from Cattell’s Culture Fair Intelligence Test (Cattell & Cattell, 1973), which presents a row of boxes containing changing patterns and asks participants to choose the next image in the sequence based on the rule governing their change (13 items, 3 min); 2) the letter sets task (Ekstrom, French, Harman, & Dermen, 1976), which presents sequences of changing sets of letters and asks participants to choose the next letter set in the sequence (16 items, 4 min); and 3) the number series task (Thurstone, 1938), which presents sequences of changing sets of numbers and asks participants to choose the next number set in the sequence (15 items, 4.5 min).

Personality

We administered the 240 item NEO PI-R to assess the five major factors of personality (McCrae, Costa, & Martin, 2005). The full NEO includes six facet-level subscales for each personality factor, which were averaged to form composites for each of the five personality factors: neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness. Participants were presented with a series of statements and asked to indicate their level of agreement using a five-point Likert scale (1 = strongly disagree, 5 = strongly agree).

Semantic spaces

Five semantic spaces were selected based on the following criteria: 1) validity evidence showing associations between semantic distance and human judgments of semantic relatedness was available, 2) varied in the model used to compute word vectors, and 3) varied in the corpora used in the computational model. We used multiple computational models to build word vectors and various corpora because prior research indicates each model has idiosyncratic strengths and weaknesses in predicting human performance, with some models exhibiting advantages in predicting free association and others showing advantages in predicting human relatedness judgments (Mandera et al., 2017). Given the variety of methodologies employed to assess creativity, we reasoned that varied model selection would provide the highest generalizability and validity. Two semantic spaces were built using a neural network architecture, which uses a sliding window to move through the text corpora and tries to predict a central word from its surrounding context, similar to algorithms first developed in word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). These two continuous bag of words (CBOW) models have previously demonstrated robust associations with human judgments of relatedness, lexical decision speed, and free associations (Mandera et al., 2017). The first CBOW model is built on a concatenation of the ukwac web crawling corpus (~ 2 billion words) and the subtitle corpus (~ 385 million words). The second CBOW model was built on the subtitle corpus only. Each semantic space consisted of context window size of 12 words (six to the left and six to the right of the target words), 300 dimensions, and the most frequent 150,000 words (for more details, see Mandera et al., 2017).

The third semantic space was also built using CBOW but on a concatenation of the British National Corpus (~ 2 billion words), ukwac corpus, and the 2009 Wikipedia dump (~ 800 million tokens) using a context window size of 11 words, 400 dimensions, and the most frequent 300,000 words. This space also shows robust associations with human judgements of relatedness and was shown to be the best-performing model compared to multiple CBOW and LSA-based count models (Baroni, Dinu, & Kruszewski, 2014).

The fourth semantic space has the longest history and was built using LSA called TASA, from the Günther, Dudschig, and Kaup (2015) website and the lsacolorado.com interactive website. Termed a count model, it was built by computing the co-occurrence of words within documents, followed by a singular value decomposition on that sparse matrix. The corpus contained over 37,000 documents, including 92,393 different words, and was reduced to 300 dimensions. Primary text sources were middle and high school textbooks and literary works. This space demonstrates validity in its application to a creative word association task (Prabhakaran et al., 2014).

The fifth space was also built using a count model, but in contrast to LSA, it capitalizes on global information across the text using weighted least squares, called global vectors (GloVe; Pennington, Socher, & Manning, 2014). It was built on a concatenation of a 2014 Wikipedia dump and the Gigaword corpus, which contains numerous news publications from 2009–2010. The model was trained on ~ 6 billion tokens, with a final dimensionality of 300 and the top 400,000 words. GloVe has shown robust associations with human judgments of relatedness, comparable to other CBOW models (Pennington et al., 2014).

All five spaces can be used to compute the semantic distance between two words, where the cosine angle between the word vectors represents semantic similarity, and distance is then computed by subtracting this similarity from 1 (Beaty, Christensen, Benedek, Silvia, & Schacter, 2017; Green, 2016; Kenett, 2019; Prabhakaran et al., 2014). Semantic distance ranges from – 1 to 1, with higher scores indicating the two words are most distantly related ideas or concepts. The cosine was computed between word vectors using the LSAfun package of Günther et al. (2015) in R. However, when comparing words to phrases or phrases to phrases, the word vectors must be combined in some way to compute semantic distance. We describe this procedure in the Compositional Vector Models section below.

Compositional vector models

All five of the above spaces are comprised of word vectors across a variable number of dimensions. When comparing texts that contain multiple words, a number of challenges arise; foremost being how to combine words vectors into a single vector for a comparison. Mitchell and Lapata (2008, 2010) investigated the strength of human relatedness judgments against various vector composition models of semantic distance. While additive models can perform adequately, multiplicative composition models performed best, even compared to more complex models like a weighted additive model. Consequently, most of our results are based on multiplicative vector composition models, where elementwise multiplication was used to combine vectors. However, the SemDis app gives users the option to choose additive or multiplicative models for semantic distance computations.

The other major challenge when dealing with phrases is how to clean text. Should all special characters be stripped? Should filler or stop words be removed? SemDis provides options for basic text cleaning, where only special characters and numbers are stripped, or to also remove filler or stop words. The stop words removed are based on the database from the tm R package (Feinerer, 2012). There is evidence that when applying latent semantic analysis to the AUT, removing stop words improves validity (Forthmann et al., 2018).

We provide a step-by-step tutorial with example data in the SemDis app with materials on OSF (https://osf.io/gz4fc/).

Manual text preprocessing

Although SemDis provides preprocessing options, it does not include spellchecking, requiring users to manually spellcheck responses. This decision was made due to the impression of available spellchecking software and integration with the app; moreover, human intervention is often needed to resolve ambiguities in spellchecking. We recommend users employ spellchecking tools available in conventional software packages as they become available prior to uploading data files to SemDis. However, Johnson et al. (2019) did not employ spell checks and set misspelled words to missing data (the current default setting of SemDis). Combining misspelled words and words that the semantic model did not recognize resulted in a 4.1% loss of data. This minimal loss seems worth the labor savings if human raters instead had to perform spellchecking. In the current study, AUT responses were screened for misspelling and non-ambiguous spelling errors were corrected. As an additional optional step, the cue words (e.g., box and rope), as well as their plurals (e.g., boxes and ropes), were manually removed from responses to avoid potential bias of semantic distance values^{Footnote 1}.

Analytic approach

Study 1 had two primary goals: 1) to compare semantic distance scores from several semantic spaces to human creativity ratings on the AUT and 2) to further validate these semantic distance scores against established creativity measures (e.g., creative behavior and achievement). Semantic distance scores, along with other creativity measures, were modeled as indicators of their respective latent variables, which allowed us to extract the common variance from each underlying factor. Latent variables were estimated using maximum likelihood estimation with robust standard errors in Mplus 8. The factor variances were fixed to 1, and the loadings for variables with less than three indicators were constrained to be equal (Kline, 2015).

In a first step, we conducted confirmatory factor analyses to model correlations between human creativity ratings and the five semantic distance variables. Next, we identified the best-performing semantic distance metric and probed its convergent validity in a series of structural equation models with the other creativity measures. To determine how human and automated creativity metrics differentially relate to creative activities and achievements, we modeled them as two separate latent variables (see below). All task variables were standardized prior to analysis. The standardized effects are presented in the r metric and can be interpreted using the conventional small (.10), medium (.30), and large (.50) guidelines (Cumming, 2013).

Results

Table 1 presents zero-order correlations and descriptive statistics for creativity ratings and semantic distance models.

Table 1 Study 1 descriptive statistics and correlations of human ratings and multiplicative semantic distance models

Full size table

Predicting human creativity ratings

Our first set of analyses compared the relative prediction of human creativity ratings from additive vs. multiplicative compositional models of semantic distance. We began by conducting a confirmatory factor analysis to assess latent correlations between an additive semantic distance factor and human ratings on the two AUT items (box and rope): χ² (132 df) 266.582, p < .001; CFI .927; RMSEA .077; SRMR .113. We found a moderate and negative correlation between the additive semantic distance factor and human ratings (r = -.37, p = .04), thus explaining only 14% variance in human creativity ratings.

Prior research that used additive composition models found that responses with higher word counts received a penalty in semantic distance, meaning lower semantic distance scores (Forthmann et al., 2018). Replicating this result, we found word count per response was negatively correlated with semantic distance scores (r = – 0.25). This is problematic, because higher word count responses (i.e., responses higher in elaboration) were rated by humans as being more creative (r = 0.41 between response word count and the mean creativity score for raters). Consequently, with humans giving higher ratings to longer responses, and semantic distance generating lower values, word count seems to explain the negative correlation between human rating and the additive semantic distance factor. Next, we test whether a multiplicative composition model can mitigate this issue.

We specified a model assessing the relationship between a multiplicative semantic distance model and human creativity ratings (Fig. 1). This model fit the data well: χ² (132 df) 185.785, p < .001; CFI .970; RMSEA .049; SRMR .079. Results revealed a large correlation between latent semantic distance and human ratings: r = .91, p < .001 (Fig. 2). Thus, 83% of the variance in human ratings could be explained by a latent factor of five multiplicative semantic distance models. It is important to note this is much higher than the variance explained by the latent semantic factor derived from additive models.

In addition, the multiplicative composition model reversed the correlation between response word count and semantic distance (r = .47). For this model, responses with more elaboration now receive a boost in semantic distance. This is consistent with human creativity ratings, which also give a boost to more elaborate responses (r = .41), as noted above. This is a critical new finding because it shows a multiplicative model can substantially mitigate the elaboration bias demonstrated in prior research using semantic distance to capture creativity (Forthmann et al., 2018).

Validation with external measures

Having found that multiplicative models outperform additive models in predicting human creativity ratings, we turned to further validate multiplicative models with a range of external creativity measures, spanning cognition (novel metaphor), behavior (creative achievement), and self-report (creative self-efficacy).

We began by specifying a CFA with the same latent semantic distance and human creativity variables, adding a higher-order novel metaphor factor comprised of two lower-order metaphor prompts, four raters per prompt (χ² (293 df) 362.249, p < .001; CFI .973; RMSEA .037; SRMR .070). The model yielded significant correlations between creative metaphor and both AUT creativity (r = .49, p = .001) and AUT semantic distance (r = .41, p = .005), indicating converging validity of semantic distance at the cognitive level.

Our next analysis focused on creative behavior. We specified a latent variable comprised of the four creative behavior scales, along with the same AUT creativity and AUT semantic distance variables (χ² (204 df) 276.402, p < .001; CFI .967; RMSEA .046; SRMR .078). Consistent with past work, creative behavior correlated significantly with AUT creativity (r = .43, p < .001). The model also showed a small effect for AUT semantic distance (r = .21, p = .04).

Next, we assessed effects of creative self-efficacy, specifying a latent variable comprised of its two lower-order facets, along with AUT creativity and AUT semantic distance (χ² (166 df) 225.198, p < .001; CFI .971; RMSEA .046; SRMR .078). The model showed significant correlations between creative self-efficacy and both AUT creativity (r = .36, p < .001) and AUT semantic distance (r = .32, p = .002), replicating prior work on AUT creativity and providing further converging validity for semantic distance at the level of creative personality.

Finally, we examined effects of fluid intelligence and personality. We first specified a model with fluid intelligence, AUT creativity, and AUT semantic distance (χ² (184 df) 246.041, p < .001; CFI .970; RMSEA .044; SRMR .073). Fluid intelligence correlated significantly with AUT creativity (r = .36, p = .003), consistent with past work, but it showed a small and nonsignificant effect on AUT semantic distance (r = .10, p = .39). Regarding personality, we specified a model with the five factors of personality correlating with the two AUT variables (χ² (222 df) 357.001, p < .001; CFI .938; RMSEA .060; SRMR .080) and found that only openness correlated with AUT creativity (r = .30, p < .001) but not AUT semantic distance (r = .03, p = .77); no other personality factors showed significant effects on AUT creativity or semantic distance.

Study 2

Study 1 provided preliminary evidence for the validity of semantic distance in predicting human judgements of creativity on the AUT. We found that a latent variable comprised of five semantic distance metrics strongly correlated with human subjective creativity ratings. Semantic distance scores also correlated positively with cognitive and self-report measures related to creativity (metaphor production, creative self-efficacy) but not to other cognitive and personality factors (fluid intelligence and openness). In Study 2, we aimed to replicate a subset of findings from Study 1, using the same AUT items from a previously published dataset (Silvia, Nusbaum, & Beaty, 2017). To this end, we employed the same approach to computing semantic distance from Study 1, and we reanalyzed subjective creativity scores obtained from the original study. We hypothesized that the latent semantic distance variable would again predict human judgements of creativity on the AUT. Notably, Silvia et al. (2017) found that human creativity ratings on the AUT did not significantly correlate with creative behavior, so it remains unclear whether semantic distance scores could predict creative behavior in this sample. We also again tested whether semantic distance correlated with openness, which was not the case in Study 1, but was reported by Prabhakaran et al. (2014) in their study with the noun-verb task. Furthermore, Study 2 sought to extend Study 1 by examining whether semantic distance scores also relate to participants’ self-ratings of creativity on the AUT.