1 Introduction

Text embedding models from Natural Language Processing (NLP) can map texts (e.g. words, sentences, articles) to supposedly semantically meaningful, numeric vectors (i.e. embeddings) with typically a few hundred or thousand dimensions (e.g. [1, 2]). Intuitively speaking, this means that the embeddings of similar texts (e.g. words like “big” and “large”) would be closer to one another than those of dissimilar texts (e.g. “big” and “paper”) in the vector space.

Such models are often pretrained on an enormous amount of text data (e.g. Wikipedia, websites, news) and made publicly available (e.g. [14])). This allows researchers to obtain off-the-shelf pretrained text embeddings for downstream applications, without the need to spend many computational resources on training the models from scratch. Researchers can also choose to further train the text embedding models on additional task-specific data (i.e. fine-tuning) or domain-specific data (i.e. continued pre-training) for better performance.

Because of their capability for meaningful text representation and convenient use, text embedding models have attracted various applications in (computational) social science. For instance, they have been employed to encode the Big-Five personality questionnaire for personality trait prediction [5], social media posts for suicide risk assessment [6], Tweets and TV captions for emotion detection [7], and relevant terms (e.g. gender pronouns, names, occupations) to quantify societal trends of gender and ethnic stereotypes in the US [8].

Despite the popularity of text embedding models, an important question to ask is: how good are the text representations, given a research task? High-quality representations are desirable because they encode relevant information, which leads to greater generalization capabilities [9]. Typically, the quality of text embeddings is evaluated by established benchmarks like GLUE [10] and SentEval [9]. GLUE is a collection of common natural language understanding (NLU) tasks like question answering, sentiment analysis and textual entailment recognition. Similarly, SentEval consists of a broad and diverse set of NLU tasks, such as sentiment analysis, semantic textual similarity, natural language inference, and verb tense prediction.

While such benchmarks are useful for evaluating an embedding model’s capability for certain NLU tasks, they do not necessarily inform researchers about whether an embedding model is suitable for a specific downstream task at hand. In computational social science research, the goal of using embedding models is often to tackle a substantive question (e.g. personality prediction, bias detection) that requires the embedding models to be capable of handling tasks that are different from or not directly addressed by the tasks in existing benchmarks. Moreover, the type of text data can also be different from the evaluation data sets in those benchmarks. For instance, in [5], the authors used embedding models to encode personality questionnaires and social media posts for personality prediction. This requires that the embedding models should be able to encode personality-relevant information from questionnaire texts and social media text data. However, knowing how well an embedding model performs many of the NLU tasks (e.g. natural language inference, verb tense prediction) on other types of data (e.g. movie reviews, image captions) do not tell us much about whether this embedding model is appropriate for the task of personality prediction.

Therefore, we need an additional framework that can help to guide the quality evaluation of text embedding models for a specific research task, which benefits model selection, debugging and improvement. Concretely, this framework should evaluate whether an embedding model encodes what it is supposed to for the research task. Furthermore, it should be easily adaptable to different research tasks.

In this work, we propose such a framework based on the classic construct validity testing framework widely used in the social sciences. In measurement theory, construct validity concerns whether the measure of a construct “measures what it is supposed to”. Constructs are theoretical, latent variables (e.g. personality traits, emotions and biases). They are not directly observable and hence, need to be translated into something measurable. This process is called operationalization, which results in some concrete measure for the construct of interest [11]. For instance, the emotion “anger” can be translated into actionable measures like one’s facial expression, one’s tone of voice or a simple question “how angry are you?”. However, the operationalization procedure inevitably introduces some degree of incongruence between the intended construct and the measure. When this incongruence is high, we say the construct validity of a measure is low and vice versa. Needless to say, the construct validity of a measure needs to be checked and established before it is adopted in research. To achieve this, social scientists resort to an established construct validity testing framework, which involves examining various types of construct validity such as face validity, content validity, convergent validity, discriminant validity and predictive validity [11]. This testing framework is grounded in measurement theory and is the standard approach to testing the validity of a measure in the social sciences.

That an embedding model should “encode what it is supposed to” is comparable to the idea of construct validity (i.e. a measure should measure what it is supposed to). Specifically, we can treat what an embedding model needs to encode as the construct of interest, the embedding model as the measure, and the embeddings as the measurements. This allows us to approach the quality evaluation of text embedding models from the perspective of construct validity testing.

Thus, in this paper, we propose using the construct validity testing framework to evaluate the quality of text embedding models, given a research task. This framework has the advantages of being theory-driven and having established testing procedures. It is also easily adaptable to different research tasks. Furthermore, we focus on the validity of the text embeddings (rather than model predictions), because this can help researchers choose more “valid” models, which encode information relevant for the task at hand. This can help to lower the risk of overfitting due to the model learning from noises.

This paper has three main parts. First, we review several popular text embedding models in Sect. 2. Second, in Sect. 3, we describe the classic construct validity testing framework and how it can be adapted to the quality evaluation of text embedding models, given a specific research task. The main challenge is that established procedures for construct validity analysis only apply to low-dimensional measurements (e.g. responses to survey questions), while text embeddings are much higher-dimensional. We show how to adapt the classic construct validity framework to high-dimensional settings by, for instance, borrowing tools from interpretable NLP. Finally, in Sect. 4, we apply our adapted construct validity testing framework to a concrete application, where we show how to evaluate the quality of embedding models when they are used to encode survey questions.

2 Background: text embeddings

In this section, we review several popular text embedding models, including word2vec, fastText, GloVe, BERT, Sentence-BERT and Universal Sentence Encoder. We are aware of other (near) state-of-the-art text embedding models, such as GPT-3 [12], ALBERT [13] and XLNet [14]. Nevertheless, it is not our goal to cover all text embedding models.

2.1 word2vec and fastText

One influential family of text embeddings algorithms is word2vec [1, 15]. Simply put, word2vec is a two-layer neural network model that takes as input a large corpus of text and gives as output a vector space. This vector space has typically several hundred dimensions (e.g. 300), with each unique word in the training corpus being assigned a corresponding continuous vector. Such a vector is also called an embedding. The objective of the algorithm is to predict words from surroundings words (or the other way around). In this way, the final word vectors are positioned in the vector space such that words sharing common contexts in the corpus are located closely to one another. Under the Distributional Hypothesis [16, 17], which states that words occurring in similar contexts tend to have similar meanings, closely located words in the vector space are expected to be semantically similar.Footnote 1 The similarity between two word vectors can be measured by cosine similarity. Mathematically, it is simply the cosine of the angle between two vectors, which can be calculated as the dot product of the two vectors divided by the product of the lengths of the two vectors. Cosine similarity scores are bounded in the interval \([-1, 1]\), where −1 indicates complete lack of similarity while 1 suggests the other extreme.

Word2vec has been shown to produce text representations that capture syntactic and semantic regularities in language, in such a way that vector-oriented reasoning can be applied to the study of word relationships [19]. A classic example is that the male/female relationship is automatically learned in the training process, such that a simple, intuitive vector operation like “King − Man + Woman” would result in a vector very close to that of “Queen” in the vector space [19].Footnote 2 Many studies have also made use of this characteristic to study human biases (e.g. gender and racial bias) encoded in texts [8, 2123].

One popular extension of word2vec is fastText [3], which is trained on subwords in addition to whole words. This allows fastText to estimate word embeddings even for words unknown to the training corpora. fastText was shown to outperform its word2vec predecessors across various benchmarks [24].

2.2 GloVe

GloVe, which stands for Global Vector (for word representations) [25], is another popular text embedding model. Similar to word2vec, GloVe also produces word representations that capture syntactic and semantic regularities in language. However, a major difference is that GloVe is trained on a so-called global word-word co-occurrence matrix, where matrix factorisation is used to learn word embeddings of typically 25 to 300 dimensions.

2.3 BERT and sentence-BERT

More sophisticated embedding models have recently become available. A prominent one is BERT, which stands for Bidirectional Encoder Representations from Transformers [4]. Like word2vec and fastText, BERT is a type of neural networks trained on a large text corpus in order to learn a good representation of natural language. Notably, BERT produces context-dependent embeddings. That is, while word2vec, fastText and GloVe models produce a fixed embedding for each word, BERT can produce different embeddings for the same word depending on the context (e.g. the neighbouring words). For instance, the word “bank” can have different meanings (e.g. a financial establishment or land alongside a river) or function (i.e. as a noun or verb), across linguistic contexts.

BERT has achieved state-of-the-art performance on various natural language tasks [4]. BERT embeddings have also been shown to encode syntactic and semantic knowledge about the original texts [26].

Sentence-BERT [2], an extension of BERT, differs from the original BERT in that its architecture is optimised for generating semantically meaningful sentence embeddings that can be compared using cosine similarity [2].

2.4 Universal sentence encoder

Universal Sentence Encoder (USE) is another text embedding model meant for greater-than-word length texts like sentences, phrases and short paragraphs [27]. USE uses both a Transformer model and a Deep Averaging Network model. The former focuses on achieving high accuracy despite suffering from greater resource consumption and model complexity, while the latter targets efficient inference at the cost of slightly lower accuracy [27]. It is trained on various language understanding tasks and data sets, with the goal to learn general properties of sentences and thus produce sentence-level embeddings that should work well across various downstream tasks. Pretrained USE embeddings have been shown to outperform word2vec-based pretrained embeddings across different language tasks.

3 Construct validity testing

In this section, we first introduce the classic construct validity testing framework. Then, we describe how this framework can be adapted for the quality evaluation of text embedding models.

3.1 The classic framework

As mentioned in the introduction, construct validity concerns the extent to which a measure matches the intended construct in theory. In practice, evaluating the construct validity of a measure means answering questions like: Does the measure capture all relevant aspects of the construct? Do the measurements correlate with other measurements of the same construct, or the measurements of similar constructs? Do the measurements differ from measurements of unrelated constructs? Each of these questions focuses on a particular aspect of construct validity.

Therefore, construct validity testing involves carefully thinking about the relevant aspects of construct validity that apply (i.e. need investigation) and designing the appropriate tests to evaluate those aspects. The literature often refers to such aspects as the (sub)types of construct validity [11]. We describe below the most common and important ones, as well as how they are usually tested. They are: face validity, content validity, convergent validity, discriminant validity, predictive validity and concurrent validity. We use the example of measuring emotional intelligence (EI) as a recurring example.

3.1.1 Face validity

Face validity concerns checking whether the measure “on its face” seems like a good operationalization of the construct. For example, EI was defined by Goleman [28] as consisting of five domains: self-awareness (i.e. knowing one’s emotions), self-regulation (i.e. managing one’s emotions), motivation (i.e. motivating oneself), empathy (i.e. recognizing emotions in others), and social skills (i.e. handling relationships). If a self-report questionnaire is designed to measure this definition of EI, then the questions in the questionnaire should appear to reflect the definition in the eyes of, for instance, experts on this topic.

3.1.2 Content validity

Content validity concerns whether a measure adequately covers all relevant aspects of the intended construct. For example, per the earlier definition of EI, the questionnaire should consist of questions that measure all the five domains of EI. To judge the content validity of such a test, expert opinions can be requested. Often, researchers also ask a sample of individuals to provide responses to the questionnaire. Then, they use latent variable models like factor analysis to test whether the data supports the theoretical structure of the construct: What is the most likely number of EI domains based on the data? Do those questions intended to measure the same domain also fall under the same and only that domain? Are there questions that do not belong to any domain?

3.1.3 Convergent and discriminant validity

Convergent validity concerns whether the measure of a construct is similar to other measures that it should be similar to. For instance, say there is already an established questionnaire-based measure of EI, but we wish to develop a shorter version of it (to save time and costs associated with filling out the questionnaire). Then, given that the same people respond to both the long and the short questionnaires, the responses to one version (e.g. aggregated scores on an EI domain) should highly correlate with the responses to the other.

In contrast, discriminant validity concerns whether the measure of a construct is different from other measures that it should not be similar to. For instance, EI was proposed to be distinct from constructs like IQ and personality traits. Therefore, measurements of EI should not highly correlate with IQ or personality traits.

3.1.4 Predictive validity

Predictive validity concerns the ability of a measure to predict some future target it should be able to. For instance, Goleman [28] claimed that per his theory about EI, EI is crucial for success in life. Then, EI measurements should be able to predict one’s future performances in, for example, career and academics.

3.2 Construct validity testing for text embedding models

To apply the idea of construct validity to text embedding models, we can view what an embedding model needs to encode as the construct of interest, the embedding model as the measure, and the embeddings as the measurements. Then, similar to the regular way of construct validity testing, we need to think about the relevant aspects of construct validity to evaluate. However, because text embeddings are high-dimensional data without interpretable labels for the dimensions, we need to adapt the testing procedures described earlier accordingly, which we describe next.

3.2.1 Face validity

We can evaluate the face validity of a text embedding model by looking at whether some fundamental aspects of the model (e.g. architecture, training data) are suitable for the task at hand. For instance, if the task is to predict hate speech from Tweets, a sentence embedding model is likely more valid than a word embedding model; a model trained on Twitter data is also likely more valid than one trained on Wikipedia entries.

3.2.2 Content validity

Text embeddings are high-dimensional and opaque, which makes it difficult to find out what information is encoded in them. We therefore borrow an approach from interpretable NLP: probing classifiers. The idea is to train a classifier that takes text representations as input and predicts some property of interest (e.g. sentence length). If the classifier performs well, this suggests that the representations encode information relevant to the property [29].

A recommended practice in choosing a classifier is to select a linear model like (multinomial) logistic regression, because a more complex probing classifier may run the risk of inferring properties not actually present in the text representations [3034]. Furthermore, it is recommended to always include baselines for comparison [29]. The better the probing classifier based on some text representation performs relative to the baselines, the more evidence that the probed property is present and that content validity is supported. Previous studies [3538] suggest using two forms of baselines: simple majority in the training data and random embeddings.

3.2.3 Convergent and discriminant validity

We define convergent validity as the extent to which a reference embedding is similar to some embedding that it is supposed to be similar to. To investigate convergent validity, we introduce perturbations to a given text (i.e. the reference text) without changing its label or core meaning. For instance, if the research task is to detect hate speech, then a text considered as hate speech should not be modified in a way that would render it no longer hate speech. High cosine similarity between the embeddings of the reference and the modified text indicates convergent validity.

Likewise, we define discriminant validity as the extent to which a reference embedding is dissimilar to some embedding that it is not supposed to be similar to. To investigate discriminant validity, we introduce perturbations to the reference text in such a way that the new text is substantially different from the reference text (e.g. that would lead to changing the label or the core meaning). Accordingly, we would expect low cosine similarity between the two embeddings, which would then indicate discriminant validity.

In the social sciences, there are guidelines to interpret the sizes of correlations for convergent and discriminant validity testing. However, there are no such agreed-upon rules for cosine similarity in NLP. Therefore, it is difficult to say how high or low cosine similarity scores need to be to provide sufficient evidence for convergent or discriminant validity. To partly mitigate this problem, we propose using the difference between the two cosine similarity scores (i.e. one from convergent validity analysis and the other from discriminant validity analysis) as a joint indicator for convergent and discriminant validity. This provides the benefit of having a natural baseline: zero. If the difference score is above zero, it provides some degree of support for both convergent and discriminant validity.

Our proposed testing procedure for convergent and discriminant validity shares some commonality with several existing model testing methods in NLP, including robustness tests [39], adversarial changes [40], invariance checks and directional expectation tests [41]. However, they focus on evaluating model predictions, while we focus on text representations.

3.2.4 Predictive validity

How well an embedding model performs the research task of interest already provides information about the predictive validity of the model. In addition, we can test how well the embeddings can perform tasks that are related to the research task but not part of it. For instance, if an embedding model is designed to detect hate speech in Tweets, then the embeddings should also be able to predict whether a Tweet gets reported later. If an embedding model is used for political stance detection, then the embeddings should also be able to predict a person’s party affiliation or their future voting behaviour.

4 Application: embedding models for survey questions

There is growing research interest in using embedding models to encode subjective survey questions. Such questions are subjective because they aim to measure information that only exists in the respondent’s mind (e.g. opinions, feelings). For instance, Vu et al. [5] used pretrained BERT to encode participants’ social media posts and the questions from the Big-Five personality questionnaire, to predict individual-level responses to out-of-sample Big-Five questions. Pellegrini et al. [42] used the skip-gram embedding algorithm [1] to encode psychiatric questionnaires and patients’ responses to those questionnaires. They showed that the resulting embeddings can be used for effective diagnosis of some mental health conditions.

However, to the best of our knowledge, there is no work on evaluating whether text embedding models can provide valid representations of subjective survey questions. In this section, we demonstrate how our proposed construct validity testing procedures can be used to investigate and compare the validity of text embedding models for representing subjective survey questions. Nevertheless, because survey questions exist in various forms and are used differently across research fields, we are unable to cover all kinds of survey questions in our analysis. In this work, we focus on the ones that are more typical in sociological research.

We begin an overview of the embedding models we use (Sect. 4.1.1), and describe how to generate sentence-level embeddings (Sect. 4.1.2). Then, we describe the data that we use for the analysis, with a focus on the synthetic data set (Sect. 4.2). Next, we present our analysis of face validity (Sect. 4.3), content validity (Sect. 4.4), convergent and discriminant validity (Sect. 4.5), as well as predictive validity (Sect. 4.6).

4.1 From survey questions to pretrained sentence embeddings

4.1.1 Pretrained embedding models

We focus on the following embedding models: fastText, GloVe, BERT and USE. We use their pretrained versions (as opposed to fine-tuned models) because of their widespread and convenient use. However, our approach to construct validity analysis also applies to fine-tuned text embeddings and other vector representations of texts.

Table 1 lists the pretrained embedding models we adopt. For fastText, we use the pretrained model developed by [3]. It is trained on Common Crawl with 600B tokens, and produces word embeddings with 300 dimensions for 2M words. For GloVe, we use the model pretrained on Common Crawl with 840B tokens and a vocabulary of 2.2M words. It also outputs word vectors of 300 dimensions.

Table 1 Overview of Pretrained Text Embedding Models Investigated in this Study

As for pre-trained Sentence-BERT models, there are many to choose from, which differ not only in the specific natural language tasks that they have been optimised for, but also in their model architecture. We select two pretrained models which have been trained on various data sources (e.g. Reddit, Wikipedia, Yahoo Answers; over 1B pair of sentences) and are thus designed as general purpose models [2]. They are “All-DistilRoBERTa” and “All-MPNet-base”, where “DistilRoBERTa” [43, 44] and “MPNet” [45] are two different extensions of the original BERT. “Base” indicates that the embedding dimension is 768, as opposed to “Large” where the dimension is 1024. Both “All-DistilRoBERTa” and “All-MPNet-base” have been shown to have the top average performance across various language tasks. For the purpose of comparison, we also include two pretrained models of the original BERT model [4]: “BERT-base-uncased” and “BERT-large-uncased”. “Uncased” refers to BERT treating upper and lower cases equally. Both models have been trained on Wikipedia (2.5B words) and BookCorpus (800M words).

As for USE, we use the most recent (i.e. 4th) version of its pretrained model, which outputs a 512 dimensional vector given an input sentence. The sources of training data are Wikipedia, web news, web question-answer pages, discussion forums and the Stanford Natural Language Inference corpus [46].

4.1.2 Sentence-level embeddings

For survey questions, we need to obtain sentence-level representations (as opposed to word-level). Namely, we treat a survey question as one sentence and represent it as a single embedding. Obtaining sentence-level embeddings is straightforward with Sentence-BERT and USE models, because they have been designed for this specific purpose. In contrast, word2vec and GloVe models only produce word embeddings. When using these models, it is therefore necessary to combine the word embeddings into a sentence-level representation. Among various methods, simple averaging across all the word embeddings (e.g. taking the means along each dimension) has been shown to either outperform other methods [47] or approximate the performance of more sophisticated ones [48]. Therefore, we use simple averaging to compute sentence-level embeddings for survey questions from fastText and GloVe word embeddings. The resulting embeddings have the same number of dimensions as the word-level embeddings, as we average the word embeddings along each dimension. However, one disadvantage of this approach is that information like word order is absent in the aggregated representation.

As for the original BERT models, we follow the advice of [2, 26] to average the word embeddings produced at the last layer of BERT to form sentence-level embeddings. This way, the resultant sentence-level representation has the same dimension as that of the word embeddings.

4.2 Data sets

4.2.1 Synthetic data set

For the analysis of content validity, convergent validity and discriminant validity, we use a data set of subjective survey questions which we generate by ourselves (hence the name “synthetic”). The reason for using a synthetic data set over existing survey questions is two-fold. First, because we would like to study how well a text embedding model encodes subjective survey questions in general, it is important that the questions we study cover a diverse range of concepts and formulations. Existing questionnaires like the European Social Surveys only concern a small set of concepts and formulations. Second, for the analysis of convergent and discriminant validity, we need to introduce small perturbations to a reference survey question to create similar and dissimilar survey questions. This is difficult when the survey questions themselves are problematic, which is the case for many existing survey questions in use. For instance, many surveys use this question to measure generalized trust: “Generally speaking, would you say that most people can be trusted, or that you can’t be too careful in dealing with people?”. The problem with this question is that it is unclear what this question is measuring: trust or caution [49]? This makes it difficult to make variants of the question that are similar or different in terms of the underlying concept. Using synthetic questions, in contrast, allows us to focus on high-quality questions that are also easier to vary.

Therefore, we create a synthetic data set of survey questions for the analysis of content, convergent and discriminant validity. To generate a diverse set of high-quality survey questions, we follow the three-step procedure described in [50]. The full details are described in Appendix A. We summarize below three main characteristics of this data set.

First, we focus on covering a wide selection of survey questions. According to Saris and Gallhofer [50], subjective survey questions normally fall under one (or more) of the following 13 basic concepts: “evaluation”, “importance”, “feelings”, “cognitive judgment”, “causal relationship”, “similarity”, “preferences”, “norms”, “policies”, “rights”, “action tendencies”, “expectation”, and “beliefs” (see Appendix B for definitions and examples of these 13 basic concepts). The questions in our data set therefore cover these 13 basic concepts.

Second, for every subjective concept, we assign three reference concrete concepts. Take the basic concept “evaluation” as an example: we specify “the state of health services”, “the quality of higher education” and “the performance of the government” as the three corresponding reference concrete concepts. Next, for every reference concrete concept, we specify one similar concrete concept and one dissimilar concrete concept (see Appendix C for a complete list of concrete concepts). Finally, for each concrete concept, we create survey questions that vary in their formulation. [50] provided many templates for each type of formulation. We adopt 19 templates (see Appendix D) and thus create differently formulated survey questions for each concrete concept. Our final data set contains 5436 unique subjective survey questions.

Table 2 shows six example questions from the data set. They all fall under the basic concept “evaluation”. The main concrete concept here is evaluation about “the state of health services”, while the corresponding similar and dissimilar concepts are evaluation about “the state of medical services” and “the state of religious services”. Each concrete concept has two differently formulated questions in the table: DR (i.e. direct request) and InDe (i.e. interrogative-declarative request).

Table 2 Example Questions from Our Survey Question Data Set. InDe: interrogative-declarative request. DR: direct request

4.2.2 European social survey wave 9

For the analysis of predictive validity, we use the publicly available European Social Survey (ESS) Wave 9 data [51]. More information is available in Sect. 4.6.1.

4.3 Analysis of face validity

Survey questions are carefully curated sentences that are used to measure some concepts or constructs. This means that an embedding model should ideally: (1) provide sentence-level representations; (2) have been trained on texts similar to survey questions. Therefore, to examine the face validity of the embedding models (i.e. fastTest, GloVe, BERT, Sentence-BERT and USE), we can check whether they fulfil these two requirements.

With regard to the first aspect, only Sentence-BERT and USE can generate sentence-level embeddings by design. The other models require aggregating word-level embeddings to form sentence-level embeddings. However, because BERT can produce context-dependent embeddings, their representations of survey questions are likely of higher-quality than those of fastText and GloVe.

For the second aspect, none has been trained specifically on survey questions. However, all the models have been trained on common data sources (e.g. Common Crawl, Wikipedia and Reddit) which certainly contain some survey questions. Among them, Sentence-BERT models are additionally trained on many question answering data sets, which may allow them to represent survey questions better.

Taking both aspects into account, Sentence-BERT seems to show the highest face validity, followed by USE, then BERT, then fastText and GloVe.

4.4 Analysis of content validity

The analysis of content validity concerns whether text embeddings encode information about all relevant aspects of survey questions. Naturally, not all aspects are equally important, and we also cannot provide an exhaustive list of them. In this paper, we consider four such aspects.

The first aspect concerns the underlying concepts. According to [50], most subjective survey questions can be categorised into one of 13 so-called basic concepts, such as “feelings”, “cognitive judgement” and “expectations” (see Appendix B). In addition to the basic concept, a survey question also has a concrete concept, such as “happiness” (under the basic concept “feelings”) and “political orientation” (under “cognitive judgement”) (see Appendix C).

Furthermore, survey questions can differ in terms of formulation. Specifically, five types of formulation often apply in survey research [50]: direct request (DR), imperative-interrogative request (ImIn), interrogative-interrogative request (InIn), declarative-interrogative request (DeIn) and interrogative-declarative request (InDe).Footnote 3 See Appendix D for examples of these different formulations.

Lastly, complexity is another important aspect of survey questions which can affect how respondents understand and answer a survey question [52]. A simple proxy for complexity is the length of a survey question [50].

Therefore, we investigate whether text embeddings encode information about the following aspects of survey questions: basic concepts, concrete concepts, formulation and length. We refer to them as properties in the remainder of the paper.

4.4.1 Methods

We use probing classifiers for content validity analysis. Following [3538], we include two baselines: simple majority in the training data and random embeddings. To generate random embeddings for each survey question, we randomly generate from a uniform distribution (-1,1) a unique fixed size embedding for each word in the training data. Then, we simply average the word embeddings along each dimension to derive sentence-level embeddings for the survey questions.

A common problem with probing classifiers is that the good performance of the classifier could simply be due to it making use of other properties present in the embeddings that are correlated with the properties of interest [53]. For instance, if we want to find out whether text embeddings encode information about basic concepts, our training data should differ only in terms of the probed property (i.e. basic concepts). In other words, for survey questions corresponding to a particular basic concept (e.g. “feelings”), the distribution of other properties should be similar to that of questions belonging to another basic concept (e.g. “expectation”). Otherwise, we cannot conclude that the performance of our classifier is explained by whether the text embeddings encode knowledge about basic concepts.

Unfortunately, with natural language data such as survey questions, it is extremely difficult, if not impossible, to construct a data set where properties like concepts, length and formulation are completely uncorrelated. To mitigate this issue, we need to construct our training and test sets in a way that they do not share the same distribution of the correlated properties. In this way, the probing classifier is less likely to make use of the correlated properties to achieve good performance on the test sets. We explain below how we achieve this.

In our data, we see that sentence length (as a four-level categorical variable, see Table 3) is highly correlated with all the other properties. Chi-square tests of independence show that sentence length is statistically significantly related to basic concepts (\(\chi ^{2} = 3636.7\), \(df = 36\), \(p < 0.05\)), concrete concepts (\(\chi ^{2} = 4612.1\), \(df = 348\), \(p < 0.05\)) and formulation (\(\chi ^{2} = 1252.7\), \(df = 12\), \(p < 0.05\)), with multiple testing corrected for. Therefore, when probing these properties (i.e. basic concepts, concrete concepts and formulation), we split the data into a training set and a test set of similar sizes, where the length of the survey questions in the training set is different from that in the test data (see Fig. 1 for illustration). Likewise, when probing sentence length, we make sure that our training and test data do not share the same concepts or formulation. Furthermore, when probing basic concepts, because concrete concepts are nested within basic concepts (and hence highly correlated), we make sure that the concrete concepts between the training set and the test set do not overlap.

Figure 1
figure 1

Illustration of Train-test Splitting for the Probing Analysis. Note that in this figure, the probing target is basic concept, while the correlated feature is sentence length

Table 3 Results of Content Validity Analysis: Prediction Accuracy Scores of Probing Classifiers. Note that sentence length is converted into a categorical variable with four levels including “0-10”, “11-12”, “13-15” and “16-25”; basic concept, concrete concept and formulation are also categorical with 13, 117 and 5 levels, respectively

Nevertheless, even separating the training and test set in terms of sentence length is not enough for effective probing of concrete concepts. We find that regardless of whether we use random embeddings or the actual text embeddings, the classifier always achieves perfect performance on the test set. The absence of difference in performance prohibits us from concluding whether the embedding models encode information about concrete concepts to different extent. This is likely due to the fact that the prediction of concrete concepts may rely solely on the presence of certain words, which is a simple task and can be fully captured by even random embeddings. We therefore decided to increase the difficulty of the probing task for concrete concepts. Specifically, we made the classifier predict for a survey question its similar concrete concept (such as “the importance of achievement” and “the importance of success”) (which we define in Sect. 4.2.1), while ensuring that the training set and the test set have not seen the exact same concrete concepts.

Using the probing approaches above, if we observe any positive difference between the performance of the probing classifier and that of the baseline using random embeddings, we can more confidently attribute it to the relevant survey question property being encoded in the text embeddings (on top of simple word-level information). In this way, we can learn about whether one text embedding model encodes more information about a property than does another model.

4.4.2 Results

Table 3 summarises the performance of the probing classifier (multinomial logistic regression) across different embedding models. The baseline classification accuracy scores are based on simple majority voting, random embeddings of three dimension sizes, two simple count-based text representation approaches: term-frequency (TF) and term frequency-inverse document frequency (TF-IDF).

If the classifier performs better on a particular type of text embeddings than on the random embedding baselines for a survey question property, we can conclude that the corresponding text embeddings of survey questions likely encode information about that property.

For sentence length, we see that the BERT-based and the USE embeddings perform better than all the baselines, which suggests that they likely encode information about sentence length. Among them, both original BERT models have better performance than any of the Sentence-BERT models.

For basic concepts, none of the pretrained text embeddings seems able to beat the performance of the baseline random embeddings, TF and TF-IDF vectors. The fact that all text embeddings (including the random embeddings) have similar performance and perform better than the simple majority baseline suggests that only simple word-level information could be used by the classifier. A possible explanation is that the basic concepts as defined by [50] are too abstract information which are not explicitly encoded in the embeddings.

As for concrete concepts, all types of pretrained text embeddings perform better than the baselines. This suggests that text embeddings likely encode information about concrete concepts of survey questions. We also see that both Sentence-BERT embeddings (0.916 and 0.929) show better performance than do the original BERT embeddings (0.815 and 0.739). Furthermore, USE (0.903) and GloVe (0.908) have similarly good performance.

Lastly, we can see that random embeddings themselves can already achieve good prediction on the types of formulation, likely because single words are indicative of formulation. This holds true also for TF and TF-IDF. The random embeddings even outperform fastText and GloVe, despite the margin being relatively small. The original BERT representations, like with sentence length, perform the best again, suggesting that they encode sentence-level information about formulation. Both Sentence-BERT models and USE also perform better than the random baselines, however, only to a much smaller margin.

To conclude, we find that different text embedding models encode somewhat different kinds of information about survey questions and to different degrees. If we rank the importance of the properties of survey questions in the order of concepts, formulation and sentence length, then USE seems to demonstrate the highest level of content validity with regard to survey questions on average. The sentence-BERT and original BERT models quickly follow. FastText and GloVe as word embedding models do encode some information about survey questions like concrete concepts, but not sentence length or formulation.

4.5 Analysis of convergent & discriminant validity

We analyse convergent validity of text embeddings for survey questions as the extent to which the text embeddings of two conceptually similar survey questions are similar to each other. High convergent validity (as is desired) would be indicated by a high degree of similarity between the two text embeddings. In contrast, discriminant validity concerns the degree to which two conceptually dissimilar survey questions differ in their text embeddings. High discriminant validity (as is desired) is signalled by low similarity between the text embeddings.

4.5.1 Data

We use the same synthetic data set described in Sect. 4.2.1.

4.5.2 Methods

We take a joint approach to examining convergent and discriminant validity. That is, if text embedding models possess both convergent and discriminant validity, the embeddings of conceptually similar survey questions would be closer to one another while the embeddings of conceptually dissimilar survey questions would be further apart. Two hypotheses naturally follow:

With Hypothesis 1, we expect cosine similarity scores to be higher between the embeddings of conceptually similar survey questions than between those of conceptually dissimilar survey questions, with all other aspects of the survey questions being the same.

With Hypothesis 2, we expect cosine similarity scores to be higher between the embeddings of conceptually identical but differently formulated survey questions than between those of conceptually dissimilar but identically formulated survey questions.

In the synthetic survey question data set, we can find many pairs of survey questions that only differ in their concrete concepts and those that differ in their formulation but not in their concrete concepts.

For Hypothesis 1, we first calculate the cosine similarity between the embedding of a given survey question (i.e. \(E_{\text{reference}}\)) and the embedding of the corresponding conceptually similar question (i.e. \(E_{\text{similar}}\)). Then, we calculate the cosine similarity between \(E_{\text{reference}}\) and the embedding of the corresponding conceptually dissimilar question (i.e. \(E_{\text{dissimilar}}\)). In this way, we obtain \(\cos{(E_{\text{reference}}, E_{\text{similar}})}\) and \(\cos{(E_{\text{reference}}, E_{\text{dissimilar}})}\). We expect the difference between these two scores (i.e. \(\cos{(E_{\text{reference}}, E_{\text{similar}})} - \cos{(E_{ \text{reference}}, E_{\text{dissimilar}})}\)) for a given survey question to be larger than zero. As an example, in Table 2, the two scores of interest are \(\cos{(E_{\text{ID1}}, E_{\text{ID3}})}\) and \(\cos{(E_{\text{ID1}}, E_{\text{ID5}})}\). Note that the two comparison questions differ from the reference question only in terms of the underlying concrete concepts; all other aspects like the formulation and sentence length are identical. This applies to all the question triads, which allows us to attribute any observed differences in similarity scores to the differences in the concrete concepts.

For Hypothesis 2, we first calculate the cosine similarity between the embedding of a given survey question (i.e. \(E_{\text{reference}}\)) and the embedding of the corresponding conceptually identical but differently formulated question (i.e. \(E_{\text{identical}}\)). Then, we calculate the cosine similarity between \(E_{\text{reference}}\) and the embedding of the corresponding conceptually dissimilar but identically formulated question (i.e. \(E_{\text{dissimilar}}\)). In this way, we obtain \(\cos{(E_{\text{reference}}, E_{\text{identical}})}\) and \(\cos{(E_{\text{reference}}, E_{\text{dissimilar}}})\). We expect the difference between these two scores (i.e. \(\cos{(E_{\text{reference}}, E_{\text{identical}})} - \cos{(E_{ \text{reference}}, E_{\text{dissimilar}}})\)) for a given survey question to be larger than zero. In the exemplar Table 2, the two scores of interest are \(\cos{(E_{\text{ID1}}, E_{\text{ID2}})}\) and \(\cos{(E_{\text{ID1}}, E_{\text{ID5}})}\). Note that each comparison question differs from the reference question only in terms of one aspect: either concept or formulation.

4.5.3 Results

Figure 2 shows the distribution of the difference between \(\cos{(E_{\text{reference}}, E_{\text{similar}})}\) and \(\cos{(E_{\text{reference}}, E_{\text{dissimilar}})}\) scores for Hypothesis 1 across various baselines and text embedding approaches. The more positive the difference scores are, the more support for convergent and discriminant validity.

Figure 2
figure 2

The distribution of the cosine similarity difference scores for Hypothesis 1 across text representation approaches. The y-axis indicates the size and direction of the differences. The more positive the difference scores are, the more support for convergent and discriminant validity

We can see in Fig. 2 that the only models that consistently score above zero are the two Sentence-BERT models (“All-DistilRoBERTa” and “All-MPNet-Base”) and USE, with the percentages of positive scores being 98.3%, 96.8% and 95.4%, respectively. This result shows evidence of convergent and discriminant validity for the three models. Only in a small percentage of cases does this observation not hold. In stark contrast, none of the baselines models (i.e. TF, TF-IDF, random embeddings) show performance comparable to any of the Sentence-BERT and USE models. To our surprise, this observation holds also for fastText, GloVe and the two original BERT pretrained models, suggesting that these text embeddings lack convergent and discriminant validity. However, for the original BERT embeddings, one other possible explanation is that cosine similarity might not be a suitable measure, as earlier research suggested [2].

Figure 3 shows the distribution of the difference between \(\cos{(E_{\text{reference}}, E_{\text{identical}})}\) and \(\cos{(E_{\text{reference}}, E_{\text{dissimilar}})}\) scores for Hypothesis 2. Note that Jaccard similarity is explicitly included here as an additional baseline of similarity between two survey questions. It is calculated as the ratio of the number of unique overlapping words (i.e. intersection) to the total number of unique words between two survey questions (i.e. union). Naturally, Jaccard similarity scores are bounded in the interval [0, 1]. For Hypothesis 1, because any pair of comparison questions differs in only one word, the difference in Jaccard similarity for any pair of comparison questions would always be zero and therefore not a useful measure.

Figure 3
figure 3

The distribution of the cosine similarity difference scores for Hypothesis 2 across text representation approaches. The y-axis indicates the size and direction of the differences. The more positive the difference scores are, the more support for convergent and discriminant validity

Similar to Fig. 2, the two Sentence-BERT models and USE again score consistently above zero (98.8%, 97.9% and 87.2% of the cases, respectively). Only in a few cases does this observation not hold (e.g. the concrete concept “petition institutional racism”). We can thus say that for these models, there is reasonable evidence for convergent and discriminant validity. Most of the other approaches (including the baselines, the random embedding models and the original BERT models) score either around or below zero. The only exception is TF-IDF, which in 94.5% cases scores above zero, suggesting evidence for convergent and discriminant validity. However, this conclusion should be treated with great caution, because when we generate the TF-IDF vectors, we build the vocabulary based on all the survey questions. We follow this approach because in this analysis, it is unclear what the training and testing data should be. In real research applications, TF-IDF may not perform as well due to the difference in the vocabulary between training and test data.

Finally, it is worth noting that the two Sentence-BERT models performed either about equally or better in Hypothesis 2 than in Hypothesis 1, in terms of the percentages of scores above zero. This is somewhat surprising considering that the task in the second hypothesis is supposedly more difficult because the survey questions differ in one extra aspect: formulation.

Overall, we can conclude that text embeddings of survey questions based on Sentence-BERT and USE demonstrate the highest convergent and discriminant validity. Meanwhile, there is not enough evidence to suggest the same for the other approaches.

4.6 Analysis of predictive validity

If embeddings are valid measures of survey questions, we should be able to use them to help with the prediction of people’s responses to survey questions. For instance, knowing someone’s response to a survey question about gender issues, we should be able to predict that person’s response to a question about family values because the two are conceptually related. Also, knowing how someone responds to a particular question formulation may also be helpful in predicting how that person would respond to new questions of similar formulation. Therefore, we test the predictive validity of survey question embeddings by checking whether they can improve the prediction of a respondent’s answer to new survey questions, compared to not using text embeddings. Specifically, we can inspect the correlation between the predicted responses and the actual responses. The higher the correlation (compared to some baseline), the more evidence for predictive validity.

4.6.1 Data: European social survey wave 9

We use the publicly available European Social Survey (ESS) Wave 9 data [51]. The ESS is a research-oriented cross-national survey that is conducted with newly selected, cross-sectional samples every two years since 2001 [54]. The survey aims to measure attitudes, beliefs and behaviour patterns of diverse populations in Europe, concerning topics like media and social trust, politics, subjective well-being, human values and immigration. We focus on the UK sample (\(N= 2204\)), because the official language of the UK is English, which is consistent with the language of the data on which our text embedding models are pretrained.

Out of more than 200 questions that were asked to the participants, we select only the ones which measure subjective concepts and are continuously or ordinally scaled, totalling 94 questions. This choice is consistent with the type of survey questions we examined previously during the analysis of content, convergent and discriminant validity analysis. To harmonise the difference in response scales across the survey questions, we rescale all the responses to be between 0 and 1.

In addition to these 94 survey questions and the individual responses to each of them, we included the following background variables for each participant: region, gender, education, household income, religion, citizenship, birthplace, language, minority status, marital past and marital status.

The reason for including background variables is two-fold. First, background variables are predictive of many subjective opinions (like the ones measured by the ESS survey questions). Second, we also expect that different people (defined by their background characteristics) will interpret survey questions differently, which in turn influences their responses. This can be modelled by allowing the background variables to interact with the text embedding dimensions, which may help the model to achieve better prediction performance.

Note that we use existing survey questions here (despite some of them having quality problems like unclear underlying concepts), because for this predictive validity analysis, we need response data, which we cannot generate synthetically.

4.6.2 Methods

The data set includes two types of features: (1) various background variables and (2) text representations of survey questions. The prediction target concerns respondents’ answers to survey questions. Every respondent answered all or most of the 94 survey questions. Each row in the data corresponds to a unique respondent and survey question combination.

To estimate the average performance of a prediction model, we apply 10-fold cross-validation [55]. Namely, we split the data into 10 folds by survey question IDs, whereby each data partition contains its own unique set of survey questions and responses (see Fig. 4 for illustration). Then, for every one out of the 10 cross-validation loops, we use 9 folds as the training set (covering 90% of the survey questions) and keep the remaining one fold as the test set. We train the prediction model on the whole training set, also using 10-fold cross-validation but for hyperparameter selection (i.e. fine-tuning). Once we have the fine-tuned model, we apply it to the test set and obtain an evaluation score. Because we apply 10-fold cross-validation to the entire data set, we obtain 10 evaluation scores in total. Finally, we average the 10 scores to arrive at the performance estimate of the prediction model.

Figure 4
figure 4

Illustration of Train-test Splitting for the Analysis of Predictive Validity

Text representation methods

Apart from the various embedding models, we include TF, TF-IDF and random embeddings of dimension 300, 768 and 1024 as baseline text representation methods. Note that with these baseline approaches, we build the feature vocabulary based only on the training data, similar to how we conducted the content validity analysis earlier. That is, new words encountered in the test set would be assigned zero weight.

Lasso and random forest

We adopt two popular prediction models. The first is Lasso regression [56], which differs from OLS regression by including an additional regularisation term in the loss function. The regularisation term has the advantage of reducing model variance (i.e. lower prediction error, at the cost of slightly higher bias). Furthermore, it can zero-out the parameter estimates of those predictors considered by the model to be “unimportant”, thus simplifying the model and easing interpretation.

The second model is Random Forest (RF) [57], which constructs multiple regression trees during training time and outputs the average prediction of all the trees. This approach of combining multiple models falls under the so-called ensemble learning technique, which generally provides the benefit of more powerful prediction. In addition, RF automatically considers interaction among the predictors, which Lasso regression falls short of. This may enable RF to learn more fine-grained patterns from data.

Evaluation metric

We use Pearson’s correlation r to evaluate the predictive validity of text embeddings. Specifically, we measure the Pearson’s correlation between the predicted responses to survey questions and the observed responses. This metric has the advantage that it is bounded between −1 and 1, hence easier to interpret.

As a prediction baseline (as opposed to TF, TF-IDF and random embeddings, which are baselines for text embedding models), we use the average response of each participant in the training data as the prediction for that participant’s responses in the test set. We choose this baseline because previous research has shown that a respondent’s answer behaviour (e.g. choosing a particular answer option) can be quite consistent across survey questions [58]. We also tried using model predictions based on only the background variables of respondents, but the model performance (the best Pearson’s r based on RF: 0.187) is similar to using the within-person average baseline.

4.6.3 Results

Table 4 summarises the prediction performance of all text representation methods, measured as the average Pearson’s correlation r across all 10 folds. Δ% in the parentheses indicates the percentage change in r in comparison to the prediction baseline r. Higher scores mean that the observed response scores are more correlated with the predicted response scores than with the prediction baseline. The more positive the value of Δ%, the more evidence for predictive validity. The 95% confidence intervals (CI) around r are also provided. The prediction baseline r is 0.187 with a 95% CI of \([0.184, 0.190]\).

Table 4 Results of the Predictive Validity Analysis. r is the average Pearson’s correlation between predicted and observed scores. Δ% in the parentheses indicates the percentage change in r in comparison to the baseline r: 0.187. 95% CI refers to the 95% confidence interval around r

We can make three observations. First, RF consistently performs substantially better than Lasso regression. This is not a surprising result, as we know that RF can learn more complex patterns (like interactions) from data and impose stronger model variance reduction, compared to Lasso regression. Furthermore, because Lasso regression performs worse than or about the same as the prediction baseline, we can infer that the interaction between background variables and the survey question representations is crucial for a good prediction performance.

Second, despite RF faring better, the exact r scores depend on the text representation methods used. We see that simply using RF with baseline text representation methods (TF, TF-IDF and random embeddings) can already lead to substantial prediction improvement compared to the baseline prediction. All the embeddings approaches achieved higher r scores than did TF, TF-IDF and random embeddings. This suggests that predictive validity holds (to some extent) for the embedding models, especially the BERT-based models and the USE.

Lastly, in Appendix E, we summarise the prediction performance in terms of RMSE (Root Mean Square Error) and MAE (Mean Absolute Error). We see that the performance pattern (i.e. which text representation performs better or worse) stays similar. However, the magnitude of improvement over the baseline prediction is much smaller, compared to when Pearson’s correlation is used as the evaluation metric. This may suggest that the prediction models are more capable of predicting the trend in survey responses (i.e. higher or lower) rather than the exact values.

In summary, we see that all the text embeddings exhibit some degree of predictive validity. Namely, they can be used to some extent to predict respondents’ answers to completely new survey questions. However, the level of predictive validity that can be demonstrated seems to highly depend on both the specific prediction algorithm, the embedding model and the evaluation metric used.

4.7 Summary

We find that different text embedding models demonstrate different degrees of validity evidence for the purpose of survey question representation. For instance, the USE and the two Sentence-BERT models show the best overall performance across all the construct validity analyses. In contrast, fastText fails to achieve performance comparable to the BERT-based approaches and the USE on all the validity analyses. Furthermore, even for the same text embedding model, its performance varies depending on the specific type of construct validity analysis. For instance, in our probing experiments to assess content validity, the USE and the two Sentence-BERT models show the best overall content validity. However, the original BERT models perform best in probing the formulation of survey questions. GloVe achieves prediction accuracy comparable to the Sentence-BERT models when probing concrete concepts, but it does poorly on the other probing tasks.

5 Conclusion and discussion

In this paper, we propose to view the quality evaluation of text embedding models as a measurement validity problem. That is, we can view what needs to be encoded from texts for a research task as the construct of interest, the text embedding model as the corresponding measure, and the embeddings as the measurements. In this way, we can apply the classic construct validity testing framework to evaluating the quality of text embeddings. This approach has the advantage of being grounded in measurement theory, providing a clear framework about what needs to be tested (i.e. in terms of the different types of validity), and being applicable to many substantive research questions in computational social science.

We describe how we can adapt this framework to suit text embeddings. For face validity, we recommend looking at readily available information about a text embedding model, such as the architecture and the data it is pretrained on. For content validity analysis, we suggest using probing classifiers to investigate explicit information in text embeddings. For convergent and discriminant validity analysis, we recommend constructing test cases where the texts only differ on a main property of interest. Then, we can inspect whether the cosine similarity score for a pair of texts is in line with our theoretical expectation. As for predictive validity, we need to define a relevant prediction task whose observed scores can be seen as prediction target. The better the prediction, the more evidence for predictive validity.

We also demonstrate this framework on the case of survey question representation. More concretely, we investigate the face, content, convergent, discriminant and predictive validity of various popular pre-trained text embedding models, including fastText, GloVe, BERT, Sentence-BERT and USE. Note that despite our use of pretrained embeddings, our approach to construct validity analysis applies equally to fine-tuned and continued pre-trained embeddings.

5.1 Limitations

We are aware of several limitations of this study. First, we focus only on the validity of text embeddings. However, validity is only one aspect of measurement quality; the other aspect is reliability (or instability). Previous work has shown that text embeddings can be unstable (e.g. across random seeds and hyper-parameters) [5961]. Therefore, this instability aspect can also have an influence on the quality of text representation, and even on the results we obtain in our application study. Nevertheless, it is beyond the scope of our study to address the instability issue of embedding models. Second, the synthetic survey question data set and the ESS data do not cover nearly most survey question types (in terms of, for instance, the concrete concepts, the presence of problematic features), nor do they necessarily resemble the type of survey questions commonly used in research. This means that our findings concerning the construct validity of the investigated text embedding models may not generalise to other/all survey questions. Nevertheless, the goal of our paper is not to perform a comprehensive analysis of the construct validity of text embeddings for all survey questions, but to show how to conduct construct validity analysis for text embeddings, with our specific choice of survey questions as an application example. Last but not least, while we follow Saris and Gallhofer’s [50] recommended procedure to generate the synthetic survey question dataset (according to the authors, this procedure, “if properly applied, will always lead to a measurement instrument that measures what is supposed to be measured”), the measurement quality of those synthetic survey questions is assumed rather than empiricallly tested and supported.

5.2 Future directions

Our proposed approach to quality evaluation of text embeddings, based on construct validity testing, can benefit many research areas that make use of text embedding models. For instance, predicting personality traits from Twitter data has been a difficult task, where models often do not outperform simple majority-class baseline [62]. One possible reason is that relevant linguistic cues are not encoded in the text embeddings used to represent Tweets. Our construct validity analysis can help to uncover possible reasons: Do the embeddings encode relevant linguistic cues (i.e. content validity)? Are the embeddings robust to noises and sensitive to altered texts that would indicate a different personality type (i.e. convergent and discriminant validity)? Can the embeddings perform tasks (e.g. prediction of personal achievements and health outcomes) that personality traits can (i.e. predictive validity)? We also hope that our proposed approach can provide insight into the strengths and weaknesses of current text embedding models and thus help the field develop more valid embedding models. Lastly, our paper focuses on the construct validity of text embeddings, while the construct validity of downstream models’ prediction is equally important and should be studied in the future.