The idea that language use reveals information about personality has long circulated in the social and medical sciences. Previous research has demonstrated that the way people use words convey a great deal of information about themselves and their mental health conditions [1,2,3,4], including academic success [5]; however, much of the previous research has focused on the analysis of self-reports or essays. In contrast, implicit motives, which are indicators used by professional psychologists during the aptitude diagnosis, are not readily accessible to the conscious mind and, therefore, not detected using self-reports of personal needs, or through essay writing [6]. Instead, they are primarily assessed using indirect measures that rely on projective techniques that instruct individuals to produce imaginative stories based on ambiguous pictures stimuli that depict people in different situations (Examples of these stimuli pictures are shown in Fig. 1). These pictures aim to influence the content of the subjects’ fantasy and to provoke that such fantasy is projected onto the characters through a short (textual) story (see Table 1 for some examples of the type of produced stories for the top-left image from Fig. 1). Consequently, this motivational response emerges through the contents of the written imaginative material and can be coded for its motive imagery using standardized and validated content coding systems.

Fig. 1
figure 1

Sample images that are shown to subjects during the OMT test. Credits of the image correspond to organizers of the GermEval 2020 shared task2

Table. 1 Example answers produced to the stimuli image on the top-left side of Fig. 1

The most frequently used measures of implicit motives are the Picture Story Exercise (PSE) [7], the Thematic Apperception Test (TAT) [8], the Multi-Motive Grid (MMG) [9], and the Operant Motive Test (OMT) [10, 11]. Generally speaking, these tests are based on the operant methods, i.e., participants are asked ambiguous questions or are shown simple images, which they have to describe. Specifically, the OMT test is a projective instrument in which participants are presented ambiguous pictures (e.g., sketched scenarios) and ask to think of a story that describes what is happening in the picture. Thus, participants are asked to first pick the main protagonist, think of a story involving this person, and then answer the following three questions as spontaneously as possible: “What is important to this person in this situation and what is the person doing?, How does the person feel?, and Why does the person feel that way?” [11, 12]. Then, trained psychologists label these textual answers with one of five motives, namely M-power, A-affiliation, L-achievement, F-freedom, and 0-zero; and each motive is associated with its corresponding level or emotion (from 0 to 5), resulting in a total of 30 (5-motives \(\times\) 6-levels) different OMT categories. In Table 2 we briefly describe the meaning of the operant motives; interested reader is referred to [11, 12].

Table. 2 Brief outline of the motives (imagery types) [11, 12]

Even though nowadays there is a huge demand for psychological data and its automated analysis, see for example, works presented at forums such as the CLPsych workshopFootnote 1, until recently, little research has been performed on the Operant Motive classification task [13,14,15,16,17,18]. The primary reason is the lack of available labeled psychological text data, as [19] point out, and the difficulty of capturing psychological traits from text data, especially from very short texts.

Accordingly, in this paper, we aim to mitigate the lack of research on this issue and we explore how the recent advances in natural language processing (NLP) can be applied in the task of automatically identify psychological traits from short textual data, specifically, we performed an exhaustive evaluation on the classification of the Operant Motives from the text. We evaluate the impact of very recent deep learning architectures such as transformers [20] (BERT, XLM, DistilBert), recent generalization techniques as supervised autoencoders [21], traditional classification methods, e.g., fully connected neural networks and support vector machines. To perform our experiments, we use the dataset provided during the “GermEval 2020 Task on the Classification and Regression of Cognitive and Emotional Style from Text” [22].Footnote 2

The present paper represents an important extension of our previous reported participation [18] at the GermEval 2020 [22]. The main addition relies upon the adaptation and evaluation of recent generalization techniques, namely, deep supervised autoencoders (SAE). To the best of our knowledge, this is the very first time such an exhaustive evaluation of SAE techniques is performed on the OMT classification task. Additionally, we perform a thorough analysis of how the attention mechanism from the transformers architectures is affected when solving the OMT task.

In summary, the main contributions of this paper are as follows:

  1. 1.

    To the best of our knowledge, this paper represents the very first systematic exploration, and comparison, of several recent NLP technologies as well as machine learning techniques on the OMT classification task;

  2. 2.

    We propose a supervised autoencoder architecture for solving the OMT classification task, which as per our literature review, none of the recent research has applied this type of technology on the posed task. At the same time, we evaluate the impact of different feature types, ranging from char ngrams to recent contextual embeddings, as inputs to the proposed SAE;

  3. 3.

    Finally, we conducted an analysis of how the attention mechanism of the transformer-based architectures is adapted during the OMT classification task. This type of analysis provides, to some extent, transparency on how the classifier is making its decisions. During this process, we observed some strong connections between our obtained results and psychometrics research.

The rest of the paper is organized as follows. Related work is discussed in “Related Work”. “Dataset” describes the dataset used in this work and its main characteristics. Details of our applied methodology are given in “Methodology”. The experimental setup, results, and analysis are provided in “Results and Discussion”. Finally, we share our main conclusions and future work directions in “Conclusion”.

Related Work

Nowadays, there is an acknowledged necessity for digital solutions for addressing the burden of mental health diagnosis and treatment. It is recognized that won’t be possible to treat people by professionals alone, and even if possible, some people might require to use alternative modalities to receive mental health support [23]; such situation has become more evident with the current COVID-19 pandemic. Examples of recent efforts building technology toward this direction are the Woebot [24] and Wysa [25] dialog systems for health and therapy support for patients that have depression symptoms; Expressive Interviewing [26], which is a conversational agent aiming at support users to cope with COVID-19 issues.

The underlying hypothesis of most these works relies on the notion of the language as a powerful indicator about our personality, social, or emotional status, and mental health [3, 27]. Accordingly, the NLP community has focused on proposing several methods to identify different psychological traits from texts, and to examine the connection between language and mental health. As a few examples of this type of research, we can mention dementia identification [28, 29], depression detection [27, 30, 31], crisis counselling [32], suicide risks identification [31, 33, 34], mental illnesses classification [35, 36], anxiety detection [37], personality traits identification [38, 39], etc.

Although plenty of research has been done in the field of mental disorders detection and personality traits detection, there has been very little research for identifying motivation, success, or similar characteristics from psychological projective tests. One representative work, that brought back the discussion of how these traits could be automatically detected through traditional NLP techniques, is the research reported in [16]. Authors performed a process of features engineering to train a logistic model tree (LMT) [40] to classify a reduced set of implicit motives (0, M, A, and L). An LMT is a decision tree, which performs logistic regression at its leaves. In their research, authors found that the perplexity of the language models for each motive, closed-class words, and ratios (words per sentence ratio, type/token ratio) were the most discriminating features during the classification process. In [15], authors proposed using a Long Short-Term Memory (LSTM) neural network combined with an attention mechanism for classifying OMT motives (0, M, A, and L) from text data. For their experiments, authors employed pre-trained german fastText word embeddings [41] to represent tokens in the OMT data. Authors mention that when reviewing tokens that have high associated attention weights and compared with the Linguistic Inquiry and Word Count (LIWC) tool [42], they found a weak connection between LIWC categories and the OMT theory.

More recently, during the 5th SwissText & 16th KONVENS Joint Conference 2020Footnote 3, it was organized a shared task on the classification and regression of cognitive and motivational style from the text, GermEval 2020\(^2\). This represented the very first time a shared task for detecting and classifying OMT motives was organized, providing a huge dataset under which was carefully curated for this purpose [22]. An important characteristic of the provided dataset is the fact of including the F-motive, and the labeling of six emotions (or levels), representing the first time a dataset with these characteristics is ever released. Three different research teams participated during the shared task, showing as the main results, the pertinence of using recent Bidirectional Encoder Representations from Transformers (ie., BERT) for solving the posed task. The winning approach employed the pre-trained Digitale Bibliothek MunchenerDigitalisierungszentrum (DBMDZ) German model, achieving an F-score of 70% in the prediction of motives, and levels [17]. Generally speaking, and based on the shared task submissions [22], it is possible to observe that the performance of systems using pre-trained BERT embeddings and attention-based models tend to perform better than linear models. We referred the interested reader to the system description papers presented during the shared task [13, 17, 18].

Hence, and with the aim of providing a more detailed analysis of the performance of different Natural Language Processing techniques and Machine Learning approaches in the recently released GermEval 2020 dataset, this paper will positively impact future research done on the comprehension of the implicit motive theory and its automatic detection through recent techniques. Accordingly, in this paper, we present a substantial extension of our system description paper [18], presented during the GermEval 2020, with the following main differences: i) we introduce and perform a series of experiments using novel generalization techniques, namely deep supervised autoencoders; which have never been tested on the OMT task before; ii) we evaluate the performance of three different transformer-based architectures; and, iii) we perform an analysis on the attention mechanism of transformer-based technologies which helps to understand why this type of technologies are well suited for this particular problem, showing strong connections with previous findings from the psychometrics research field.


Table. 3 Distribution of the train partition across the OMT’s motives and levels values
Table. 4 Distribution of the dev partition across the OMT’s motives and levels values
Table. 5 Distribution of the test partition across the OMT’s motives and levels values

To perform our experiments, we employed the dataset available in the GermEval 2020 shared task on the “Classification and Regression of Cognitive and Motivational style from the text”.1 The provided data, mostly written in standard German language, has been collected from around 14000 subjects that participated in the OMT test.Footnote 4 Each answer was manually labeled with the motives (0, A, L, M, F) and the levels (from 0 to 5). This annotation was performed by expert psychologists, trained by the OMT manual as described in [11]. The size of the data set is 209,000 texts, from which 167,200Footnote 5 are part of the training (train) partition, 20,900 are part of the development (dev), and partition and 20,900 for testing (test). Tables 3, 4, and 5 show the distribution of the instances across the different classes for the train, dev, and test partitions respectively.

As can be observed in tables (Tables 3, 4, and 5), the dataset is highly unbalanced, making the classification task more challenging. The majority of the instances (\(\approxeq 41\%)\) are from the power motive (M), followed by the achievement (L) motive (\(\approxeq 20\%)\). Regarding the levels, most of the instances are grouped among classes 4 (\(\approxeq 30.7\%)\), 5 (\(\approxeq 20.4\%)\), and 2 (\(\approxeq 20.84\%)\). It is important to mention that the same distribution remains in the dev and test partitions (Tables 4 and 5). Table 6 shows some statistics of the GermEval 2020 dataset, for train, dev, and test partitions. We compute the average number of tokens, vocabulary, and lexical richness of each text in the dataset. Lexical richness (LR), also known as “type/token ratio” is a value that indicates how the terms from the vocabulary are used within a text. LR is defined as the ratio between the vocabulary size and the number of tokens from a text (\(LR=|V|/|T|\)). Thus, a value close to 1 indicates a higher LR, which means vocabulary terms are used only once, while values near to 0 represent a higher number of tokens used more frequently (i.e., more repetitive).

Two main observations can be done at this point. On the one hand, notice that for the three partitions (i.e., train, dev, and test), textual descriptions are very short, on average 20 tokens with a vocabulary of 18 words, resulting in a very high LR (0.92). The high LR value means that very few words are repeated within each textual description, i.e., very few redundancies. On the other hand, globally speaking, the complete dataset has a low LR (0.08 for train and 0.13 for dev and test). Although these values are not directly comparable due to the size of each partition, they indicate, to some extent, that information across texts is very repetitive, i.e., similar types of words are being used by tested subjects for describing different images, even though they belong to different classes (motives and levels). Overall, this initial analysis helped us to envision the complexity and nature of the data. Finally, we measure the coverage ratio of the German word embeddings (EmbC)Footnote 6 into our dataset, resulting in a 68.26% of coverage for the training partition, 83.93% for the dev set, and 84.10% for the test set. Similar to the LR score, we can not directly compare EmbC results due to the size of each partition. However, it was expected not to have a 100% coverage due to the noise contained in the OMT data, e.g., the many spelling and grammar errors present. Nevertheless, it is relevant to highlight the low coverage in the training set (\(\sim 68\%\)). As it will be explained in “Results Analysis”, this low coverage has an important impact on the experiments based on these word embeddings.

Table. 6 Statistics of the OMT dataset in terms of number of tokens, vocabulary size and lexical richness. The minimum length of the texts is 1 token, while the maximum length is 99, 90, and 96 tokens for train, dev, and test partitions, respectively. In all partitions, the 75% of the data has a length of 27 tokens


Figure 2 shows the general view of the applied methodology for performing our experiments in this paper. As shown, given a textual description, we extract different types of features: character n-grams, word n-grams, non-contextual word embeddings (FastText), and pre-trained contextual embeddings (BERT). Then, depending on the selected learning strategy, computed features are feed to a specific learning technique, e.g., a supervised autoencoder (SAE), a fully connected network (FC), or into a transformer-based architecture (ST); that is trained to detect motives and levels from text, i.e., the OMT task. It is important to mention that instead of facing the OMT task as a 30 class classification problem, we split the problem into two separate classification tasks: motives (5 classes), and levels detection (6 classes). This decision was made in accordance with the operant motive (OMT) theory [11], which states that motives and levels are disjoint orthogonal and thus not directly connected. Thus, at the end of our methodology, we fuse the predicted labels in order to get the motive-level combination of the given instance. Notice the dashed lines that go from BERT to the FC and SAE; these lines represent a series of experiments done after fine-tuning BERT on the posed task, where the newly computed embeddings are used as features to the FC and SAE learning techniques (see “Simple Transformer” to know the details of the fine-tunning process). Finally, the solid line that goes directly from the input textual description to the transformers block, it represents a series of experiments where the ST’s are configured as classifiers by adding a simple dense layer at the end.

Fig. 2
figure 2

Proposed methodology. Given a textual description, we evaluated several text representations as input for different recent learning algorithms, including simple transformers, fully connected neural networks, and our proposed deep supervised autoencoder. In the end, predicted motive and level are combined to produce the final motive-level classification output

Following sections describe in detail the applied methodology for performing all our experiments. The proposed supervised autoencoder is detailed in “Supervised Autoencoders”. The description of the fine-tuning of transformer-based architectures is depicted in “Simple Transformer”, and the configuration of the traditional fully connected neural network is described in “Fully Connected Neural Network”. “Preprocessing” details the preprocessing operations to the dataset. “Evaluation Metrics” describe the considered evaluation metrics. An finally, “Baseline and Validation Approaches” defines the considered baseline and employed validation strategies (see Fig. 3).

Supervised Autoencoders

Fig. 3
figure 3

An example of a Supervised Autoencoder, where the supervised component (y-labels) is included. The input of the SAE is any type of pre-defined features computed over the document collection, e.g., character n-grams, word embeddings, or sentence encodings

An autoencoder (AE) is a neural network that learns a representation (encoding) of input data and then learns to reconstruct the original input from the learned representation. The autoencoder is mainly used for dimensionality reduction or feature extraction [43]. Normally, AE are used in an unsupervised learning fashion, meaning that we leverage the neural network for the task of representation learning. By learning to reconstruct the input, the AE extracts the underlying abstract attributes that facilitate accurate prediction of the input.

Thus, a supervised autoencoder (SAE) is an autoencoder with the addition of a supervised loss on the representation layer (see Fig. 3). For the case of a single hidden layer, a supervised loss is added to the output layer and for a deeper autoencoder, the innermost (smallest) layer would have a supervised loss added to the bottleneck layer that is usually transferred to the supervised layer after training the autoencoder.

In supervised learning, the goal is to learn a function for a vector of inputs \(\mathbf {x} \in \mathbb {R}^d\) to predict a vector of targets \(\mathbf {y} \in \mathbb {R}^m\). Consider a SAE with a single hidden layer of size k, and with weights for the first layer defined as \(\mathbf {F} \in \mathbb {R}^{k \times d}\). The function is trained on a finite batch of independent and identically distributed (i.i.d.) data, \((\mathbf {x}_1,\mathbf {y}_1), ...,(\mathbf {x}_t,\mathbf {y}_t),\) with the goal of a more accurate prediction on new samples generated from the same distribution. The weight for the output layer consists of weights \(\mathbf {W}_p \in \mathbb {R}^{m \times k}\) to predict \(\mathbf {y}\) and \(\mathbf {W}_r \in \mathbb {R}^{d \times k}\) to reconstruct \(\mathbf {x}\). Let \(L_p\) be the supervised loss and \(L_r\) be the loss for the reconstruction error. In the case of regression, both losses might be represented by a squared error, resulting in the objective:

$$\begin{aligned} \frac{1}{t} \sum _{i=1}^{t} \Big [L_p(\mathbf {W}_{p}\mathbf {F}\mathbf {x}_i,\mathbf {y}_i) + L_r(\mathbf {W}_r\mathbf {F}\mathbf {x}_i,\mathbf {x}_i)\Big ] = \nonumber \\ \frac{1}{2t} \sum _{i=1}^{t} \Big [||\mathbf {W}_p\mathbf {F}\mathbf {x}_i - \mathbf {y}_i||_2^2 + ||\mathbf {W}_r\mathbf {F}\mathbf {x}_i - \mathbf {x}_i||_2^2\Big ] \end{aligned}$$

The addition of supervised loss to the autoencoder loss function acts as regularizer and results (as shown in Eq. 1) in the learning of the better representation for the desired task [21]. Although SAE have been extensively evaluated on image classification taks [21], its pertinence on thematic and non-thematic text classification tasks has not been extensively evaluated, being this an important contribution of this work.

Consequently, in order to perform a broad evaluation of this approach, we passed as input features to the SAE different types of text representation, namely pre-trained BERT encodings and also fine-tuned BERT encodings, in both cases using as representation the information extracted from the last hidden layer (LastHL), and the concatenation of the 4 last hidden layers (Concat4LHL).Footnote 7 Additionally, we also tested several traditional text representation techniques: word and char n-grams (with ranges 1–2 and 1–3). Finally, we also evaluate the performance of the SAE architecture using as a representation type non-contextual embeddings, in particular, we tried the German FastText embeddings trained on 2 million German Wikipedia articles.Footnote 8 All these variations can be observed at the bottom part of Fig. 2. For all our performed experiments, the overall configuration of the SAE model was done using nonlinear activation function (ReLU) with 3 hidden layers, the number of nodes in the representation layer was set to 300, and we trained to a maximum of 100 epochs.

Simple Transformer

The transformer model [44] introduces an architecture that is solely based on attention mechanism and does not use any recurrent networks but yet produces results superior in quality to Seq2Seq [45] models, incorporating the advantage of addressing the long term dependency problem found in Seq2Seq model.

For our experiments using simple transformers (ST) architectures, we setup three different state-of-the-art configurations:

  1. 1.

    Bert [46]: we use a pre-trained model referred as bert-base-german-cased, with 12-layer, 768-hidden, 12-heads, 110M parameters.Footnote 9 The model is pre-trained on German Wikipedia dump (6GB of raw text files), the OpenLegalData dump (2.4 GB), and news articles (3.6 GB). We refer to this configuration as ST-Bert in our experiments.

  2. 2.

    XLM [47]: for this configuration we use a model with 6-layer, 1024-hidden, 8-heads, which is an English-German model trained on the concatenation of English and German Wikipedia documents (bert-base-german-cased). We refer to this configuration as ST-XLM in our experiments.

  3. 3.

    DistilBert [48]: for this model we used a model with 6-layer, 768-hidden, 12-heads, 66M parameters (distilbert-base-german-cased). We refer to this configuration as ST-DistilBert in our experiments.

For all the previous configurations, in order to perform the fine-tuning of the ST architecture as a classifier, a simple dense layer with softmax activation is added on top of the final hidden state h of the first token [CLS], through a weight matrix W, and we predict the probability of label c the following way:

$$\begin{aligned} p(c|\mathbf {h}) = \text {softmax}(W\mathbf {h}) \end{aligned}$$

Then, all weights, including the model’s ones and W, are adapted, in order to maximize the log-probability of the correct label. The training is done using a Cross-Entropy loss function. To perform these experiments, we used the Simple Transformers library which allows us to easily implement this setup.Footnote 10

As main configuration parameters, we set the max_length parameter to 90,Footnote 11 and we re-trained the models up to two epochs. From here after, we refer as Bert-(FT) to the fine-tuned experiments. It is important to mention that the considered models, i.e., BERT, XLM, and DistilBERT, represent state-of-the-art language models available in the German Language. Although there are many other recent technologies, none of these are trained in German. Further details of employed models can be found at huggingface web page.Footnote 12

Fully Connected Neural Network

As an additional classification method, we configured a fully connected neural network (FC). This type of artificial neural network is configured such that all the nodes, or neurons, in one layer, are connected to all neurons in the next layer. The topology of the employed network and its configuration parameters are mentioned in Table 7.

Table. 7 Fully connected neural network configuration parameters. Notice that the number of input neurons depends on the representation type, while the number of output neurons is determined by the classification task, e.g., for the motive task, there are 5 output neurons

For the performed experiments using FCs, we passed as input features to the FC the sentence representation generated using BERT encodings. Thus, to generate the representation of the input text, we evaluate several configurations, namely: last hidden layer (LHL), concatenation of the 4 last hidden layers (Concat4LHL), min, max and mean pool of the last hidden layers. However, we only report the best performances obtained during the validation stage, i.e., LHL and Concat4LHL configurations. On the one hand, for generating the Concat4LHL representation we concatenate the last four layers values from the token [CLS]. As known, the [CLS] token at the beginning of the sentence is treated as the sentence representation. On the other hand, for the LHL configuration, we preserve as the text representation the values of the last hidden layer from the token [CLS].

For the reported experiments under the FC method (see Fig. 2), two configurations of BERT were tested for generating the LHL and Concat4LHL representation: i) pre-trained German encodings of BERT (distilbert-base-german-cased), referred as Bert(pre-trained); and ii) resultant fine-tuned BERT encodings from the re-training we explained in “Simple Transformer”, referred as Bert(fine-tuned).


For performing our experiments, we perform different preprpocessing operations depending on the selected representation type, i.e., word/char n-grams or contextual/non-contextual word embeddings (see Fig. 2).

In particular, for all the experiments performed with the pre-trained BERT embeddings, or in the fine-tuning process of the transformers architectures, we did not perform any type of preprocessing operation. Contrastly, when char/word n-grams or FastText embeddings are employed, we apply the following preprocessing operations to the input text: 1) we remove all non-alphabetical symbols, e.g., numbers, strange symbols; 2) every word is lower cased. Other preprocessing techniques, like removing stopwords, punctuation or replacing German umlauts (ä, Ä, ö, Ö, ü, Ü) and ligatures (e.g., ß) were no applied as previous research indicates that no improvement is obtained from doing it [17].

Finally, it also worth mentioning that we did not apply any type of spelling or grammar corrector. We decide not to do it given that such types of errors have shown to be important style markers in several tasks of authorship analysis [49].

Evaluation Metrics

For measuring the overall effectiveness of the classification process, we use standard set-based evaluation measures, such as precision, recall and macro F-score. This decision was based on agreement with previous work and the official OMT classification task that reports and rank results with these metrics [22], specifically using the macro F1.

Generally speaking, when evaluating a classification task, there are four types of outcomes that occur:

  1. 1.

    True positives (TP) refer to the case when the classifier predicts an observation belongs to class c and it actually belongs to that class.

  2. 2.

    True negatives (TN) refers to the case when the classifier predicts an observation not belonging to class c and it actually does not belong to that class.

  3. 3.

    False positives (TP) occur when the classifier predicts an observation belongs to class c when in reality it does not.

  4. 4.

    False negatives (FN) occur when the classifier predicts an observation as not belonging to class c when in fact it does.

Thus, precision (P) and recall (R) are defined as shown in expression Eqs. 3 and 4 respectively.

$$\begin{aligned} P = \frac{TP}{TP+FP} \end{aligned}$$
$$\begin{aligned} R = \frac{TP}{TP+FN} \end{aligned}$$

The F-score (or F1), also known as the harmonic mean of P and R, is computed as follows:

$$\begin{aligned} \text {F1} = 2\times \frac{P\times R}{P + R} \end{aligned}$$

Although the F1 score is a good metric to compare the performance of classifiers, the macro F1-score (F1-macro) is recommended to assess the quality of problems with multiple binary labels or multiple classes. Accordingly, the F1-macro is defined as the mean of the class-wise F1-scores (Eq. 6):

$$\begin{aligned} \text {F1-macro} = \frac{1}{N}\sum ^{N}_{i=0}\text {F1}_{i} \end{aligned}$$

where i is the class index and N the total number of classes. Notice that the F1-macro is not affected by the classes imbalance.

Although the Accuracy (\(\text {Acc}=\frac{TP}{TP+FP+TN+FN}\)) score is a common metric to compare performance results, the \(\text {Acc}\) is not recommended in classification problems where there is a large class imbalance. In such particular scenario, it is very likely that a model tends to predict the value of the majority class for all predictions and achieve a high classification accuracy, however, this does not mean that such model is useful in the posed task, a phenomenon known as the accuracy paradox [50].

Baseline and Validation Approaches

As a baseline, we replicated the approach proposed by the GermEval 2020 OMT task organizers, i.e., a linear Support Vector Classifier (SVC) using as a form of representation of the documents a traditional tf-idf strategy. Proposed baseline consists of a 30 (combined motive/level labels) binary SVCs (one-vs-all) classifiers.

In order to report robust and stable results, we implemented two different validation strategies. On the one hand, we performed a stratified k cross-fold validation strategy with \(k=5\) using the entire dataset (train+dev+test); we refer to this configuration as ‘5CFV’ experiments. And, on the other hand, we report results on the dev and test partitions, which allows direct comparisons with the GermEval 2020 shared task participants.

Results and Discussion

The results of each of the considered approaches (see Fig. 2) are reported in Tables 8 and 9 for the fivefold validation strategy and for the dev partition, respectively. Results are reported in terms of F1-macro, precision, and recall metric.

Table. 8 Average performance (\(\mu\)) obtained across the 5-cross-fold-validation strategy (5CVF); the number between parenthesis represents the standard deviation (\(\sigma\))

For the results reported in Table 8, we can observe that the proposed baseline is able to reach an F1-macro of 64.5%, even though this baseline faces the OMT task as a 30 class problem. A similar behavior is observed in Table 9, where the SVC baseline classifier yields good performance on the dev partition (F1 = 63.9%). Hence, we can conclude that the proposed SVC represents a hard baseline, showing stable results in both validation strategies.

In general, we can observe that our proposed supervised autoencoder is not able to generalize well in comparison to ST and FC methods. Observe that while in the 5CFV configuration the best result is achieved when fine-tuned encodings are used as text representation technique (Table 8, SAE(Bert-FT)) F1 = 67.4%; in the dev partition the best performance is obtained when input features are defined by word n-grams from range 1 to 2 (F1 = 63.4%). As known, word n-grams are useful to capture the identity of a word and its context. Thus, these results indicate, to some extent, that the SAE attempts to exploit this information when solving the classification task. A similar performance is obtained when character n-grams are used as input features, specifically n-grams of size 1-3. These results are also interesting, as they are aligned with previous research findings, demonstrating the relevance of character n-grams in different non-thematic classification tasks [51, 52]. Char n-grams are capable of providing an excellent trade-off between sparseness and word’s identity, while at the same time they combine different types of information: punctuation, morphological makeup of a word, lexicon, and even context. As main observations of the SAE performance we can highlight that, using fine-tuned BERT encoddings produced the best results under the 5CFV strategy (outperforming the proposed baseline), but word and character n-grams are not capable to improve the baseline performance. Similarly, for the experiments on the dev partition (Table 9) where even though the results obtained with the fine-tuned BERT encodings are similar to those obtained with word and char n-grams, none of these configurations were able to improve the SVC baseline (63.9%).

Table. 9 Obtained results on the dev partition of the OMT classification task

Regarding the performance of the FCs, the best performance is obtained when we use as features the fine-tune BERT encodings extracted from the last hidden layer (F1 = 69.3%) for the 5CFV experiments, and when the concatenation of the 4 last hidden layers is used (F1 = 67.5%), in the dev partition. In both cases, Tables 8 and 9, we can observe an important difference on the performance of the FC when pre-trained or fine-tuned encodings are used. Generally speaking, the impact of the fine-tuning allows a better performance of the neural networks (as expected), outperforming the proposed baseline in both cases.

Overall, based on these experiments (Tables 8 and 9), the best performance (in terms of classification F1 score) was obtained by a simple transformer using BERT embedding. The attention-based architecture was found effective in comparison to FC and SAE methods. Consequently, during GermEval 2020 competition, we submitted a subset of what we found were the most effective configurations. Table 10 shows the performance of our submitted systems.

As can be observed in Table 10, our best performing system was the simple transformer architecture using BERT encodings. Specifically, this was our configuration that obtained the second place during the GermEval 2020 competition [22]. As a reference, we put at the bottom of the table the performance of the baseline system, and the performance obtained by the first and second places. As expected, the SAE were not able to improve the baseline system. However, our configuration based on the fully connected network, using the fine-tuned BERT encodings was able to outperform the proposed baseline. It is worth mentioning that the wining approach during GermEval 2020 is based on a BERT methodology as well [17], with the pre-trained Digitale Bibliothek Münchener Digitalisierungszentrum (DBMDZFootnote 13) German model, validating the positive impact of transformer-based methods. Although the employed methodology by the winning approach [17] and our best configuration is the same, there are a few variations that are important. We consider as base model the BERT pre-trained on German Wikipedia, Open-Legal-Data and news articles. On the contrary, [17] used the DBMDZ\(^{13}\) model. We did not apply any data correction process, while [17] made a data exploration to find all non-German texts, they applied an automatic translation process of all of these into German, and applied a spellchecker to correct spelling mistakes. Nevertheless, in spite of these extra effort, obtained results are very close for both techniques.

Table. 10 Obtained results on the test partition of the OMT classification task. Performance results are reported as given by the GermEval 2020 organizers [22]
Fig. 4
figure 4

Confusion matrices for the MOTIVES classification task: top-baseline performance; bottom-ST performance

Fig. 5
figure 5

Confusion matrices for the LEVELS classification task: top-baseline performance; bottom-ST performance

Results Analysis

In this section, we present a more detailed analysis of the obtained results by our best configuration. Accordingly, Figs. 4 and 5 show detailed classification results between the SVC baseline, and our best configuration (ST-BERT). For ease of understanding we split the problem into detection of motives (Fig. 4) and detection of levels (Fig. 5). It can be observed that the ST architecture significantly increases the number of correctly classified instances in motives M (+2.5%), L (+6.2%), F (+14.38%), and A (+5.9%), however, this situation is not the same for motive zero (-4.5%). A similar situation occurs in the levels detection task (Fig. 5). For all the levels’ categories, the ST is able to increase the number of correctly classified instances: 5 (+9.7%), 4 (+2.0%), 3 (+29.5%), 2 (+0.07%), 1 (+7.4%), 0 (+2.3%).

As mentioned in “Methodology”, instead of solving a 30 class problem, we split the OMT classification task in two separated problems, i.e., motives and levels classification problems. Thus, obtained results are aligned with the OMT theory [11], since according to our experiments it is possible to detect motives and levels separately, reinforcing the fact that motives and levels are not directly connected. Nevertheless, some of the methods presented during the GermEval 2020 did face the OMT problem as a 30 class classification task [22], which indicates that the OMT theory has to be revised and compared against what the NLP community has found.

In addition to the previous analysis, and given that a common concern is the lack of transparency of many deep learning architectures, we perform an analysis of what is the attention mechanism focusing on when solving the OMT task. The result of this analysis is shown in Table 12. The main intention of this type of analysis is to provide a better understanding of the connection between machine learning algorithms and language usage. To perform this analysis, we randomly select 5 sample texts produced by evaluated subjects in the dev partition. We show the results only for the distinct motives (A, F, L, M, 0). For a fair comparison, we selected textual samples belonging to the same level class, in this case, all samples belong to level 4.

Table. 11 Followed criteria of highlight colors used in the visualization of the attention mechanism
Table. 12 Attention mechanism visualization for the OMT classification task

To visualize the attention mechanism, we extract the attention weights based on [CLS] token from the last layer, average each token weight across all attention heads, and finally normalize weights across all tokens so that each weight is a value between 0.0 (very low attention) and 1.0 (very high attention)Footnote 14. The highlight criteria of the words are shown in Table 11.

Table 12 demonstrates the visualization of how the attention mechanism works in the Operant Motive Test classification task. Attention weights are extracted after the fine-tuning of the BERT method. As an important observation, notice the attention of functional words: ist, und, an, der, das, zu, sie, sind, ein, einer, von (is, and, at, the, the, to, they, are, a, one, from). This indicates that for the simple transformer architecture the writing style is becoming more relevant at the moment of solving the classification task.

And additional observation is the attention paid to the word’s ending. As known, during the tokenization process, unknown words are split into smaller tokens. When this is the case, the symbol ‘#’ is added to the generated tokens. Especially for motive L, we can notice many cases where tokens with the symbol ‘#’ are receiving the attention from the simple transformer architecture. Notice that negation words are resulting very important, e.g., nicht, as well as some punctuation marks, e.g., ‘.’, ‘?’.

Furthermore, in Fig. 6, we show the usage given to the top 25 terms with higher attention values. For performing this analysis, we obtained the top most important words, i.e., words with higher attention values, for each motive category (A, F, L, M). Then, to obtain these 25 words, we intersected the corresponding sets. The figure illustrates the relative frequency given to each of these words according to the motive class. As can be seen in the figure, subjects from different categories use these words with different frequency values. This frequency analysis also explains the good performance of the SVM classifier; which is based on a traditional tf-idf vectorial representation. However, even though frequency counts are helping the SVM to accurately separate among classes, the context in which these words appear is important, i.e., how users are employing these lexical units is relevant for solving the task.

Fig. 6
figure 6

Top 25 terms with higher attention across MOTIVES categories

Accordingly, in Figs. 7 and 8, we illustrate the context in which words nicht (not) and sind (are) are employed in our dataset. For this analysis, we took all the text generated by users from the same category (i.e., M, A, F, and L), and perform a collocation analysis fixing a target word (in this case: nicht/sind). As known, a collocation is a sequence of words that co-occur with high frequency within some corpus. Thus, for generating the visualization of each tree, we kept the most frequent collocations from each category. From this analysis, it is possible to observe that, even though these target words are frequently used by subjects, the employed contexts by each category are very different from each other. For example, subjects labeled with motive M (power) use the nicht words in an imperative/control fashion, e.g., kommen nicht auf (do not come up), while subjects categorized with A (affiliation) motive use it to show concern about others, e.g., sie nicht alline (they are not alone). Similarly, for F (freedom) class, subjects use this negation to indicate concern about themselves, e.g., möchte nicht mitbekommen (don’t want to notice); and in the L (achievement) motive, the common context denotes insecurity, e.g., sie nicht weiss (she doesn’t know). Hence, the good performance of transformer-based NN architectures is explained by this analysis, as the attention mechanism of BERT is able to learn contextual relations between words (and sub-words) from the input text. Notice that these findings are aligned with previous psychometrics research (see Table 2).

This analysis provided interesting insights that we can summarize as: ST architecture pays higher attention to the use of personal pronouns, stop words, negation, punctuation marks, unknown words, and some conjugation styles, filtering out most of the unimportant elements such as content words. But not just isolated words, the context in which these words appear are providing important information to the transformer-based NN methods at the moment of detecting motives-levels. In other words, the writing style (how we write) is more relevant than content words (what we write) for solving the OMT classification task. Additionally, it is particularly interesting how the usage of negations words (nicht, and ##t which correspond to “–n’t” contractions) are frequently used by the power (M) and the freedom (F) motives. This finding is partially aligned with previously reported from the psychological theory [53], where it has been showed that so-called activity inhibition (AI) trait is mainly described as negations in combination with the power motive. Finally, our performed collocations analysis helped to understand and visualize the type of context that is helping the transformer-based NN architecture to solve the problem more accurately. Overall, these findings could foster implicit psychometrics theory, and consequently, advanced aptitude diagnostics supported by NLP technologies.

Fig. 7
figure 7

Contextual tree of word nicht. Words in the leaves represent the most frequent words appearing next to the target word, in this case, nicht. Below each tree, its corresponding motive class is mentioned

Fig. 8
figure 8

Contextual tree of word sind. Words in the leaves represent the most frequent words appearing next to the target word, in this case, sind. Below each tree, its corresponding motive class is mentioned

The aptitude test (i.e., OMT) is a type of psychological test that could affect the subjects’ lives, specially if performed automatically without any human intervention [54]. Hence, there is an important urgency for understanding how this type of automatic decisions are being done by recent machine learning technologies. As stated in [55], explainable methods are becoming more relevant, particularly in the health-care domain. Thus, it is necessary to consider many aspects when designing explainable ML methods, e.g., who is the domain expert?, who are the affected users?, among others [56, 57]. Accordingly, and as part of our future work, we plan to extend our interpretability analysis towards the design of responsible artificial intelligence algorithms (i.e., explainable and transparent) in the context of mental health automated analysis by applying some of the proposed recommendations in [56].


This paper represents a first step towards the analysis of recent NLP technologies for solving the OMT classification task. To this end, we performed a comparative analysis among state-of-the-art simple transformer-based architectures, e.g., BERT, XLM, and DistilBert, very recent generalization techniques as supervised autoencoders and traditional machine learning techniques. Notably, transformer-based methods exhibit the best empirical results, obtaining a relative improvement of 7.9% over the baseline suggested as part of the GermEval 2020 challenge [22]. We performed an exploration on how the attention mechanism is working in this particular task, and obtained results revealed that features associated with the writing style are more important that content-based words. Some of these findings shown strong connections to behavioral research made on the implicit psychometrics theory. For example, as the result of our performed analysis, we observed that the usage of negations in combination with the power motive it is supported by the research made by [53]. As future work, we plan to evaluate the impact of hyperparameter tuning through optimization methods, such as Bayes optimizer [58], evaluate the impact of early-fusion strategies in the performance of the SAE, and to perform further analysis on how the attention mechanism from the transformers architecture is working in the OMT task.

Finally, we would like to emphasize the importance of the ethical necessity of carefully understanding the research being done in the field of NLP & psychology. Although NLP technologies indicate that solving this type of tasks is, to some extent, possible, further research needs to be conducted to carefully explain the relation between psychological tests and subjects aptitudes. The authors would like to clearly state that we are against the use of this type of technology to discriminate against people in any type of our daily life situations. Even though we believe that this research is important, as can be useful for psychologists professionals, claiming that the NLP/ML community is able to accurately classify users according to their professional aptitudes and personality traits is not something we agree on. We support the idea that this type of research can help to validate previous theories as well as to support mental health care practitioners to evaluate or get important insights from closed and controlled studies.