Keywords

1 Introduction

One rather newly opened application field for natural language processing (NLP), is NLP for predicting psychological traits, which we call NLPsych. Due to computer systems in clinical psychology, massive amounts of textual interactions in social networks, as well as an uprising of blogs and online communities, the availability of massive amounts of data has catalyzed research of psychological phenomena such as mental diseases, connections between intelligence and use of language or a data-driven understanding of dream language – to name but a few – with NLP methods.

Possible applications range from detecting and monitoring a course of mental health illnesses by analyzing language [1], finding more objective measures and language clues on subsequent academic success of college applicants [2] or discovering that dream narratives show highly complex cognitive processes [3]. Promising possible scenarios for future work could explore connections of personality traits or characteristics with subsequent development or research on current emotional landscapes of people by their use of language.

Even though the potential is high, the sub-field of natural language processing in psychology we call NLPsych henceforth is a rather fragmented field. Results of included works vary in accuracy and often show room for improvement by using either best-practice methods, by shifting the research focus onto e.g. function words, by using data-driven methods or by combining established approaches in order to perform better. This survey provides an overview over some of the recent approaches, utilized data sources and methods, as well as findings and promising pointers. Furthermore, the aim is to hypothesize about possible connections of different findings in order to draw a ‘big picture’ from those findings.

1.1 Research Questions

Even though the most broadly employed research questions target mental health due to the high relevance of findings in clinical psychology, this survey ought to have a broader understanding. Thus the following research questions, which have been derived from included work, approach mental changes, cognitive performance and emotions. Three exemplary works, that address similar questions, are mentioned as well. Important ethical considerations are out of scope of this paper (see e.g. dedicated wokrshopsFootnote 1).

Research Question (i) Does a change of the cognitive apparatus also change the use of language and if so, in what way? (e.g. [3,4,5])

Research Question (ii) Does the use of language correspond to cognitive performance and if so, which aspects of language are indicators? (e.g. [2, 6, 7])

Research Question (iii) Is a current mood or emotion detectable by the use of language besides explicit descriptions of the current mental state? (e.g. [1, 7, 8])

1.2 Structure of This Paper

Firstly, this survey aims to grant an overview of some popular problem domains with an idea of employed approaches and data sources in Sect. 2 and the development of broadly utilized tools in Sect. 3. Widely employed measurements of different categories will be discussed (Sect. 4). Section 5 describes utilized methods and tools, and is divided into two strands: on the one hand, data-driven approaches that use supervised machine learning and on the other hand, statistics on statistics on manually defined features and so-called ‘inquiries’, i.e. counts over word lists and linguistic properties. Secondly, this work surveys parallels and connections of important findings that are utilized in order to conclude a ‘bigger picture’ in Sect. 6.

1.3 Target Audience

This paper targets an audience that is both familiar with basics of natural language processing and psychology, with some experience in machine learning (ML), Deep Learning (DL) or the use of standard tools for those fields, even though some explanations will be provided.

1.4 Criteria for Inclusion

Criteria for the inclusion of surveyed work can be divided into three aspects: Firstly, often cited work and very influential findings were included. Secondly, the origin of included work such as well established associations, authors or journals. And lastly, the soundness of the content with this survey’s focus in terms of methodology or topic. Only if a work suits at least one – if not all – of those aspects, said work has been included in this survey. E.g. work published by the Association for Computational Linguistics (ACL)Footnote 2, which targets NLP, was considered. Well established journals of different scientific fields such as e.g. Nature, which dedicates itself to natural science, were considered. Search queries included ‘lexical database’, ‘psychometric’, ‘dream language’, ‘psychology’, ‘mental’, ‘cognitive’ or ‘text’. Soundness was considered in terms of topic (e.g. subsequent academic success, mental health prediction or dream language), as well as innovative and novel approaches.

2 Popular Problem Domains, Approaches and Data Sources

This section presents popular NLPsych problem domains and is ordered by descending popularity. Approaches will be briefly explained.

2.1 Mental Health

Mental health is the most common problem domain for approaches that use NLP to characterize psychological traits as some of the following works demonstrate.

Depression Detection Systems. Morales et al. [9] summarized different depression detection systems in their survey and show an emerging field of research that has matured. Those depression detection systems often are linked to language and therefore have experienced gaining popularity among NLP in clinical psychology. Morales et al. [9] described and analyzed utilized data sources as well. The Distress Analysis Interview Corpus (DAIC)Footnote 3 offers audio and video recordings of clinical interviews along with written transcripts on depressions and thus is less suitable for textual approaches that solemnly focus on textual data but can be promising when visual and speech processing are included. The DementiaBank database offers different multi media entries on the topic of clinical dementia research from 1983 to 1988. The ReachOut Triage Shared Task dataset from the SemEval 2004 Task 7 consists of more than 64,000 written forum posts and was fully labeled for containing signs of depression. Lastly, Crisis Text LineFootnote 4 is a support service, which can be freely used by mentally troubled individuals in order to correspond textually with professionally trained counselors. The collected and anonymized data can be utilized for research.

Suicide Attempts. In their more recent work, Coppersmith et al. [10] investigated mental health indirectly by analyzing social media behavior prior to suicide attempts on Twitter. TwitterFootnote 5 is a social network, news- and micro blogging service and allows registered users to post so-called tweets, which were allowed to be 140 characters in length before November 2017 and 280 characters after said date. As before in [11], the Twitter users under observation had publicly self-reported their condition or attempt.

Crisis. Besides depression, anxiety or suicide attempts, there are more general crises as well, which Kshirsagar et al. [12] detect and attempt to explain. For their work they used a specialized social network named Koko and used a combination of neural and non-neural techniques in order to build classification models. KokoFootnote 6 is an anonymous emotional peer-to-peer support network, used by Kshirsagar et al. [12]. The dataset originated from a clinical study at the MIT and can be implemented as chatbot service. It offers 106,000 labeled posts, with and some without crisis. A test set of 1,242 posts included 200 crisis labeled entries, i.e. \(\sim \) 16%.

RedditFootnote 7 is a community for social news rather than plain text posts and offers many so-called sub-reddits, which are sub-forums dedicated to certain, well defined topics. Those sub-reddits allow for researchers to purposefully collect data. Shen et al. [13] detected anxiety on Reddit by using depression lexicons for their research and training Support Vector Machine (SVM, Cortes et al. [14]) classifiers, as well as Latent Dirichlet Allocation (LDA, Blei et al. [15]) for topic modeling (for LDA see Sect. 3). Those lexicons offer broad terms that can be combined with e.g. Language Inquirie and Word Count (LIWC, Pennebaker et al. [16]) features in order to identify different conditions in order to be able to distinguish those mental health issues. Shen et al. [13] used an API offered by Reddit in order to access sub-reddits such as r/anxiety or r/panicparty.

Dementia. In their recent work, Masrani et al. [17] used six different blogs to detect dementia by using different classification approaches. Especially the lexical diversity of language was the most promising feature, among others.

Multiple Mental Health Conditions. Coppersmith et al. [11] researched the detection of a broad range of mental health conditions on Twitter. Coppersmith et al. [11] targeted the well discriminability of language characteristics of the following conditions: attention deficit hyperactivity disorder (ADHD), anxiety, bipolar disorder, borderline syndrome, depression, eating disorders, obsessive-compulsive disorder (OCD), post traumatic stress disorder (PTSD), schizophrenia and seasonal affective disorder (SAD) – all of which were self-reported by Twitter users.

2.2 Dreams Language in Dream Narratives

Dream Language. Niederhoffer et al. [3] researched the general language of dreams from a data-driven perspective. Their main targets are linguistic styles, differences between waking narratives and dream narratives, as well as the emotional content of dreams. In order to achieve this, they used a community named DreamsCloud. DreamsCloudFootnote 8 is a social network community dedicated to sharing dreams in a narrative way, which also offers the use of data for research purposes. There are social functions such as ‘liking’ a dream narrative or commenting on it, as Niederhoffer et al. [3] describe in their work. There are more than 119,000 dream narratives from 74,000 users, which makes this network one of the largest of its kind. Since DreamsCloud is highly specialized, issues such as relevance or authenticity are less crucial as they would be on social networks like FacebookFootnote 9.

LIWC and Personality Traits. Hawkins et al. [18] layed their focus on LIWC characteristics especially and a correlation with the personality of a dreamer. Data was collected by clinical studies in which Hawkins et al. [18] gathered dream reports from voluntary participants. Their work is more thorough in terms of length, depth and rate of conducted experiments on LIWC features. Dreams could be distinguished from waking narratives, but – as of said study – correlations with personality traits could not be found.

2.3 Mental Changes

As we will be showing in Sect. 6, mental changes and mental health problems are seemingly connected. However, natural changes such as growth or life-changing experiences can alter the use of language as well.

Data Generation and Life-Changing Events. Oak et al. [19] pointed out that the availability of data in the clinical psychology often is a difficulty for researchers. The application scenario chosen for a study on data generation for clinical psychology are life-changing events. Oak et al. [19] aimed to use NLP for tweet generation. The BLEU score measures n-gram precision, which can be important for next character- or next word predictions, as well as for classification tasks. Another use case of this measure is the quality of machine translations. Oak et al. [19] use the BLEU score to evaluate the quality of their n-grams for language production of their data generation approach of life-changing events. Even though the generated data would not be appropriate to be used for e.g. classification tasks, Oak et al. [19] nonetheless proposed useful application scenarios such as virtual group therapies. 43% of human annotators thought the generated data to be written by real Twitter users.

Changing Language Over the Course of Mental Illnesses. A study by Reece et al. [1] revealed that language can be a key for detecting and monitoring the whole process from onsetting mental illnesses to a peak and a decline as therapy shows positive effects on patients. participants involved in the study had to prove their medical diagnosis and supply their Twitter history. Different techniques were used to survey language changes. MTurk was used for labeling their data. Reece et al. [1] were able to show a correlation between language changes and the course of a mental disease. Furthermore, their model achieved high accuracy in classifying mental diseases throughout the course of illness.

Language Decline Through Dementia and Alzheimer’s. It is known that cognitive capabilities decline during the course of the illness dementia. Masrani et al. [17] were able to show that language declines as well. Lancashire et al. [20] researched the possibility of approaching Alzheimer’s of the writer Agatha Christie by analyzing novels written at different life stages from age 34 to 82. The first 50,000 words of included novels were inquired with a tool named TACT, which operates comparable to LIWC (shown in Sect. 3) and showed a decline in language complexity and diversity. During their research, Masrani et al. [17] detected dementia by including blogs from medically diagnosed bloggers with and without dementia. Self reported mental conditions, as it is often used for research of social networks, are at risk of being incorrect (e.g. pranks, exaggeration or inexperience).

Development. Goodman et al. [8] showed that the acquisition and comprehension of words and lexical categories during the process of growth correspond with frequencies of parental usage, depending on the age of a child. Whilst the acquisition of lexical categories and comprehension of words correlates with the frequency of word usage of parents later on in life, simple nouns are acquired earlier. Thus, whether words were more comprehensible was dependent on known categories and a matter of similarity by the children.

2.4 Motivation and Emotion

Emotions and motivations are less common problem domains. Some approaches aim to detect general emotions, further researchers focus on strong emotions such as hate speech, others try to provide valuable resources or access to data.

Distant Emotion Detection. In order to better understand the emotionality of written content, Pool et al. [21] used emotional reactions of Facebook users as labels for classification. Facebook offers insightful social measurements such as richer reactions on posts (called emoticons) or numbers as friends, even though most available data is rather general.

Hate Speech. Serrà et al. [4] approached the question of emotional social network posts by surveying the characteristics of hate speech. In order to tackle hate speech usually containing a lot of neologism, spelling mistakes and out-of-vocabulary words (OOV), Serrà et al. [4] constructed a two-tier classification that firstly predicts next characters and secondly measures distances between expectation and reality. Other works on hate speech include [22,23,24].

Motivational Dataset. Since data sources for some sub-domains such as motivation are sparse, Pérez-Rosas et al. [25] created a novel contributing a motivational interviewing (MI) dataset by including 22,719 utterances from 227 distinct sessions, conducted by 10 counselors. Amazon mechanical turk (MTurk) is a crowdsourcing service. Research can define manual tasks and define quality criteria. Pérez-Rosas et al. [25] used MTurk for labeling their short texts by crowdsourcers. They achieved a high Intraclass Correlation Coefficient (ICC) of up to 0.95. MI is a technique in which the topic ‘change’ is the main object of study. Thus, as described in Subsect. 6.3, this dataset could also contribute to early mental disease detection. MI is mainly used for treating drug abuse, behavioral issues, anxiety or depressions.

Emotions. Pool et al. [21] summarized in their section on emotional datasets some highly specialized databases on emotions, which the authors analyzed thoroughly. The International Survey on Emotion Antecedents and Reactions (ISEAR)Footnote 10 dataset offers 7,665 labeled sentences from 3,000 respondents on the emotions of joy, fear, anger, sadness, disgust, shame and guilt. Different cultural backgrounds are included. The Fairy TalesFootnote 11 dataset includes the emotional categories angry, disgusted, fearful, happy, sad, surprised and has 1,000 sentences from fairy tales as the data basis. Since fairy tales usually are written with the intention to trigger certain emotions of readers or listeners, this dataset promises potential for researchers. The Affective TextFootnote 12 dataset covers news sites such as Google news, NYT, BBC, CNN and was composed for the SemEval 2007 Task 14. It offers a database with 250 annotated headlines on emotions including anger, disgust, fear, joy, sadness and surprise.

2.5 Academic Success

Few researchers in NLPsych have approached a connection between language and academic success. Some challenges are lack of data and heavy biases as some might assume that an eloquent vocabulary, few spelling mistakes or a sophisticated use of grammar indicate a cognitive skilled writer. Pennebaker et al. [2] approached the subject in a data-driven fashion and therefore less biased. Data was collected by accessing more than 50,000 admission essays from more than 25,000 applicants. The college admission essays could be labeled with later academic success indicators such as grades. The study showed that rather small words such as function words correlate with subsequent success, even across different majors and fields of study. Function words (also called closed class words) are e.g. pronouns, conjunctions or auxiliary words, which tendentially are not open for expansion, whilst open class words such as e.g. nouns can be added during productive language evolution.

3 Tools

In this section we discuss some broadly used tools for accessing mainly written psychological data. Included frameworks are limited to the programming language Python, since it is well established – especially for scientific computing – and included works mostly use libraries and frameworks designed for Python.

3.1 Word Lists

LIWC. The Language Inquiry and Word Count (LIWC) was developed by Pennebaker et al. [16] for the English language and has been transferred to other language such as e.g. German by Wolf et al. [26]. The tool was psychometrically validated and can be considered a standard in the field. LIWC stands for a tool that operates with recorded dictionaries of word lists and a vector of approximately 96 metrics (depending on the version and language) such as number of pronouns or number of words associated with familiarity to be counted in input texts.

CELEXFootnote 13 is a lexical inquiry database, that was developed by Baayen et al. [27] and later on enhanced to a CELEX release 2. The database contains 52,446 lemmas and 160,594 word forms in English and a number of those in Dutch and German as well. It is regularly used by researchers such as Fine et al. [28] did in order to research possible induced biases in corpora, which used CELEX for predicting human language by measuring proportions of written and spoken English based on CELEX entries.

Kshirsagar et al. [12] used the Affective Norms for English Words (ANEW), which is an inquiry tool such as LIWC, as well as labMT, used by Reece et al. [1], which is a word list score for sentiment analysis.

3.2 Corpus-Induced Models

LDA. Blei et al. [15] developed a broadly used generative probabilistic model called Latent Dirichlet Allocation (LDA) that is able to collect text through a three-layered Bayesian model that builds models on the basis of underlaying topics.

SRILM. The SRI Language modeling toolkit (SRILM), produced by Stolcke [29] is a software package that consists of C++ libraries, other programs and scripts that combine functionality for processing, as well as producing mainly speech recognition and other applications such as text. Oak et al. [19] used SRILM for 4-gram modeling for language generation of life-changing events.

3.3 Frameworks

NLTK. The Natural Language Toolkit (NLTK) is a library for Python that offers functionality for language processing, e.g. tokenization or part-of-speech (POS) tagging. It is used on a general basis. E.g. Shen et al. [13] use NLTK for POS tagging and collocation.

Scikit-Learn. The tool of choice of Pool et al. [21] and Shen et al. [13] was scikit-learn [30], a freely available and open sourced library for Python. Since scikit-learn is designed to be compatible with other numerical libraries, it can be considered one of the main libraries for machine learning in the field of natural language processing.

3.4 Further Tools

Further tools that are being used in included work in some places are the cross-linguistic lexical norms database (CLEX) [31] for evaluating and comparing early child language, the Berlin affective word list reloaded (BAWL-R) [32] that is based on the previous version of BAWL for researching affective words in the German language and lastly HMMlearn, used by Reece et al. [1], which is a Python library for Hidden Markov Models (HMM).

4 Psychometric Measures

When conducting research on NLPsych, the selection of psychometric measurements are crucial for evaluating given data on psychological effects or to detecting the presence of target conditions before a classification task can be set up. Therefore, in this section we describe broadly used psychometric measurements of included work. Psychometrics can be understood as a discipline of psychology – usually found in clinical psychology – that focus on ‘testing and measuring mental and psychology ability, efficiency potentials and functions’ [33].

There are measures for machine learning as well such as e.g. the accuracy score, which will not be covered due to broadly available standard literature on this matter.

4.1 Questionnaires

BDI and HAM-D. The Beck Depression Inventory (BDI) and HAM-D are used and described by Morales et al. [9] and Reece et al. [1] for measuring the severity of depressions. The HAM-D is a questionnaire that is clinically administrated and consists of 21 questions, whilst the BDI is a questionnaire that consists of the same 21 questions, but is self-reported.

CES-D. The Center for Epidemiological Studies Depression Scale (CES-D) is a questionnaire for participants to keep track on their depression level and has been used in their work by Reece et al. [1].

4.2 Wordlist Measurements

MITI. The Motivational Interviewing Integrity Treatment score (MITI) measures how well or poorly a clinician is using MI (motivational interviewing), as Pérez-Rosas et al. [25] described in their work. The Processes related to ‘change talk’, thus the topical focus, is the crucial part of this measurement. Global counts and behavior counts distinguish the impact on this measure. Words that encode the MITI level are e.g. ‘focus’, ‘change’, ‘planning’ or ‘engagement’.

CDI. The Categorical Dynamic Index (CDI), used by Niederhoffer et al. [3], Jørgensen et al. [31], as well as Pennebaker et al. [2], is described as a bipolar continuum, applicable on any text, that measures the extend of how categorical or dynamic thinking is. Since those two dimensions are said to distinguish between cognitive styles of thinking, it therefore can reveal e.g. whether or not dreamers are the main character of their own dream [3]. The CDI can be measured by inquiring language with tools such as e.g. LIWC and weighting the categories.

5 Broadly Used Research Methods

Since there are two main approaches for performing NLPsych – data-driven approaches and manual approaches from clinical psychology – this section will be divided into those two strands, beginning with feature approaches and ending with data-driven machine learning approaches. Within those strands, the methods are ordered by their complexity.

5.1 A General Setup of NLPsych

Even though there are detailed differences between approaches of included works, there is a basic schema in the way NLPsych is set up. Figure 1 illustrates a classification setup. Firstly, after having collected data, pieces of information are read and function as input. Different measures or techniques can be applied to the data by an annotator to assign labels to the input. Whether or not annotation takes place, depends on the task and origin of the data.

Secondly, after separating training, test, and sometimes development sets, features get extracted from those data items, e.g. LIWC category counts, the ANEW sadness score or POS tags. A feature extractor computes a nominal or numerical feature vector, which will be described in Subsect. 5.3.

Thirdly, depending on the approach, this feature vector is directly used in rule based models such as e.g. defined LIWC scores that correlate with dream aspects, as Niederhoffer et al. [3] did. A different approach uses the feature vector on a machine learning algorithm in order to compute a classifier model, that thereafter can be used to classify new instances of information, as Reece et al. [1] demonstrated in their work.

Finally, for both of the approaches, the accuracy of the classification task is determined and researchers analyze and discuss the consequences of their findings, as well as use the models for classification tasks.

Fig. 1.
figure 1

A general setup for classification tasks in NLPsych

5.2 Supervised Machine Learning Approaches

SVM. Support Vector Machines (SVM) are a type of machine learning algorithm that measure distances of instances to so-called support vectors that map said examples in order to form a dividing gap. This gap separates said examples into categories to perform classification or regression tasks. This broadly utilized standard method has been used by e.g. Pool et al. [21] for BOW models via scikit-learn.

HMM. Reece et al. [1] used Hidden Markov Models (HMM), which are probabilistic models for modeling unseen events, as well as word shift graphs that visualize changes in the use of language [1].

RNN. The Recurrent Neural Networks (RNN) are an architecture of deep neural networks that differ from feed forward neural networks by having time-delayed connections to cells of the same layer and thus possesses a so-called memory. RNNs require for the input to be numeric feature vectors. Words or sentences typically get transformed by the use of embedding methods (e.g. [34]) into numerical representations. Some authors that use RNNs are Cho et al. [35] who used encoder and decoder in order to maximize the conditional probabilities of representations. Kshirsagar et al. [12] used RNNs for word embeddings and Serrà et al. [4] trained character based language models with RNNs.

LSTM. A Long short-term memory neural network (LSTM, Hochreiter et al. [36]) is a type of RNN in which three gates (input, forget and output) in an inner, so-called memory cell, are employed to be able to learn the amount of retained memory depending on the input and the inner state. LSTMs are capable of saving information over arbitrary steps, thus enabling them to remember a short past for sophisticated reasoning. LSTMs nowadays are the method of choice for classification on sequences and can be considered as established standard. Long short-term memory neural networks often are used when calculation power, as well as big amounts of data are available and a memory is needed to train precise models. The latter often is the case when working with psychological data. E.g. Oak et al. [19] used an LSTM for training language models for language production of life-changing events.

5.3 Features for Characterizing Text

Features serve as characteristics of texts and are always computable for every text, e.g. the average rate of words per sentence. Some of said features are numerical, some are nominal. Those features usually are stored in a feature vector that serves as input for classifiers but can be used directly, e.g. in order to perform statistics on them and to draw conclusions. Not every presented feature is being used as such. On the one hand, LIWC, tagging and BOWs are used as characteristics of text and thus are classically used as features. On the other hand, LDA targets the data collection process and n-grams, CLMs, as well as next character predictions can be utilized for modeling.

LIWC. In Sect. 3 the LIWC is described as a set of categories for which word lists were collected. The core dictionary and tool with its capability of calculating a feature vector for language modeling is well established and can be categorized as method of choice in psychological language inquiry. The way LIWC is used, is very common. However, researchers usually focus on some selected aspects of the feature vector in order to grasp psychological effects. Coppersmith et al. [11] used LIWC for differentiating the use of language of healthy people versus people with mental conditions and diseases. Hawkins et al. [18] and Niederhoffer et al. [3] researched the language landscape of dream narratives. Scores, such as the LIWC sadness score were the basis of the work of Homan et al. [5] on depression symptoms. Morales et al. [9] also surveyed the broad use of LIWC in depression detection systems. Pennebaker et al. [2], which partly developed LIWC used the tool to research word usage in connection with college admission essays. Reece et al. [1] captured the general mood of participants by using LIWC and Shen et al. [13] surveyed the language of a crisis with LIWC.

LDA. Latent Dirichlet Allocation (LDA) is a probabilistic model for collecting text corpora on the basis of underlaying topics in a three layered bayesian model, as described in Sect. 3. Some researchers that used the LDA are Niederhoffer et al. [3] for topic modeling in order to explore the main themes of given texts and Shen et al. [13] which used LDA to predict membership of classes by a given topic.

BOW. Bag of words (BOW – sometimes called vector space models) are models that intentionally dismiss information of the order of text segments or tokens and thus e.g. grammar by only taking into account presence resp. absence of word types in a text. Usually, BOW models are used for document representation where neither the order nor grammar of tokens are crucial but rather their frequency. Shen et al. [13] use so-called continuous bag-of-word models (CBOW, [34]) with a window size of 5 in order to create word embeddings. Homan et al. [5], Kshirsagar et al. [12] and Serrà et al. [4] use BOW for embedding purposes. Tf-idf is a measure for relevance that quantifies the term frequency (tf) inverse document frequencies (idf) by using said BOW models [12].

Part of Speech Tagging (POS). POS is the approach to assign lexical information to segmented or tokenized parts of a text. Those tags can be used as labels and hence be used as additional information for e.g. classification tasks. Some authors that used tagging were Masrani et al. [17] and Reece et al. [1].

N-grams. A continuous sequence of n tokens of a text is called n-gram. The higher the chosen n, the more precise language models on the basis of n-grams can be used for e.g. classification or language production while training becomes more excessive with higher n. Some of the authors that use either word-based n-grams or character based n-grams are Kshirsagar et al. [12], Homan et al. [5], Oak et al. [19], Reece et al. [1] and Shen et al. [13].

CLM. A Character n-gram Language Model (CLM) is closely related to n-grams and is a term for language models that use n-gram frequencies of letters for probabilistic modeling, used by Coppersmith et al. [11] as model that models emotions on the basis of character sequences.

Next character prediction is the prediction of words of characters on the basis of probabilistic language models, which have been used by Serrà et al. [4] for determining the soundness of an expectable use of language with actual language usage in order to detect hate-speech.

Table 1. Overview of included works.

6 Findings from Included Works

In the following, we will mainly focuses on firstly some important findings of the included work for the research questions, and secondly on granting a ‘big picture’ of a possible general connection between language and cognitive processes. An overview of the problem domains (without the approaches and data sources, as they are task specific), tools, psychometric measures and research methods can be found in Table 1.

6.1 Language and Emotions

Hate speech detection has been a popular task ever since the recent discussion of verbal abuse on social networks has dominated some headlines [4]. Hate speech is especially prone to neologism, out-of-vocabulary words (OOV) and a lot of noise in the form of spelling and grammar mistakes. Furthermore, a known vocabulary of words that can be considered part of hate speech gets outdated rapidly. Serrà et al. [4] proposed a promising two-tier approach by training next character prediction models for each class as well as training a neural network classifier that takes said class models as input in order to measure the distance of expectation and reality. They achieved an accuracy of 0.951. Thus, in order to detect hate speech, it is more important to focus on how people alter their use of language rather than to focus on the particular words.

Dreams. Niederhoffer et al. [3] researched dream language by analyzing the content with an LDA topic model [15], categorizing emotions by the emotional classification model [10] and linguistic style via LIWC [16]. Dreams can be described as narratives, that predominantly describe past events in a first person point of view via first person pronouns with a particular attention to people, locations, sensations (e.g. hearing, seeing, the perceptional process of feeling). Since those dream narrations often exceed observations that are explainable by the dreamers (e.g. different physical laws of the observable world), complex cognitive processes can be assumed. Due to lexical categories revealing those connections, it can be concluded that it is more important how people express their dreams, rather than what they state.

Distress. In order to detect distress on Twitter, Homan et al. [5] asserted the so-called sadness score from LIWC together with keywords and could show a direct link to the distress and anxiety of Twitter users. Homan et al. [5] also analyzed the importance of expert annotators and showed that their classifier, trained with expert annotator labels, achieved an F-score of 0.64. This direct link adds to the impression, that the way people express themselves is connected to cognitive processes.

All of the above mentioned findings and their direct conclusions lead to an answer of the research question (i) on the connection between emotions and language, which can be reacted upon with approval.

6.2 Cognitive Performance and Language

Works on subsequent academic success often induce strong biases such as the intuition that spelling mistakes indicate cognitive performance. The study of language and context that has been undertaken by Pennebaker et al. [2] indirectly tackles those biases, as the study targets a connection between the use of language and subsequent academic success by investigating college proposal essays with LIWC. Pennebaker et al. [2] could show that cognitive potential was not connected to what applicants expressed but rather to how they expressed themselves in terms of closed class words such as pronouns, articles, prepositions, conjunctions, auxiliary verbs or negation. However, correlations over four years of college measured each year, ranged from \(r = 0.18\) to \(r = 0.2\), which are significant, but not very high. The second research question (ii) targets a connection between cognitive performance and the use of language. Closed class words such as function words have shown a connection with subsequent academic success. Therefore, the research question can be confirmed, that cognitive performance can be connected with the use of language.

6.3 Changing Language and Cognitive Processes

As Goodman et al. [7] pointed out, many phenomena in natural language processing such as implication, vagueness, non-literal language are difficult to detect. Some aspects of the use of language even stay unnoticed by speakers themselves: at times the use of language on social media platforms indicate early staged physical or mental health conditions, which even holds true when the speaker is not yet aware of the health decline [1] him- or herself, which induces the importance for early detection via use of language. By using aspects of informed speakers and game theory, Goodman et al. [7] achieved a correlation of \(r = 0.87\).

Reece et al. [1] were able to detect an early onset of dementia through tweets (Twitter posts) up to nine months before the official diagnosis of participants has been made (F1 \(= 0.651\)). Moreover, the word shift graphs of Hidden Markov Models (HMM) on time series in a sliding window could show a course of disease from early changes in language to stronger changes and a diagnosis until a normalization of language use as the condition was treated. This change has been detected by the labMT happiness score, which is a sentiment measurement tool for psychologically depicted scores on a dictionary, similar to LIWC. Thus, the connection of mental changes and the use of language, subject to research question (iii) can be confirmed as well.

7 Conclusion

Across most studies of included works, there are two main conclusion.

Reduction of Bias. It has shown that function words can be the key factors of grasping the psyche of humans by surveying their use of language. Fine et al. [28] showed in their work that some corpora unknowingly induced strong biases that alter the objectivity of said corpora – e.g. the corpus of Google for n-grams over-predicts how fast technological terms are understood by humans. Researchers tend to resort to strong biases when designing e.g. data collection for corpora or classification tasks, since experiences seemingly foretell e.g. cognitive performance with biased measures such as the usage of a complex grammar, eloquence or of making few spelling mistakes – as explained by Pennebaker et al. [2] –, thus leading us to the following, second conclusion:

Focus on how People Express Themselves, Rather Than What They Express. The three research questions and their answers have lead to a hypothesis based on findings of included works: in order to grasp the psyche by the use of language, it is more important to survey how people express themselves rather than which words are actually used. Most important findings when looking at NLPsych have in common, that a possible key for accessing the psyche lies in small words such as function words or with dictionaries developed by psychologists that focus on cognitive associations with words rather than the lexical meaning of words, such as LIWC, labMT or ANEW. Furthermore, function words are more accessible, easier to measure and easier to count than e.g. complex grammar. Thus leading us to the conclusion that in order to access the psyche of humans through written texts, the most promising approaches are data driven, aware of possible biases and focus on function words rather than a content-based representation.

8 Future Work

This section discusses some possible next steps for research in NLPsych.

A Connection of Scientific Fields and Sub-fields. As shown in Sect. 2, natural language processing in the sub-field of psychology is mainly about the study of language in clinical psychology and thus connected to mental conditions and diseases. Findings from other application areas such as dream language or the connection of language and academic success as indicators for cognitive performance could be valuable if connected to, or if used in other domains.

Researchers Should Rely More on Best Practice Approaches. Some included work such as Reece et al. [1] demonstrate the advantages the sub-field can experience if state of the art methods are used and connected in order to access the full potential of natural language. As Morales et al. [9] pointed out, it is promising to enhance promising research approaches with state of the art and best practice methods, as well as a connection to other sub-fields for future development of a natural language processing.

Use Established Psychometrics Combined with NLPsych. Whilst the already mentioned perceptions for future work – the connection of sub-fields and the usage of best practice approaches – are rather natural and known by many researchers, one possible research gap of NLPsych, the operant motive test (OMT) – developed by Scheffer et al. [37] –, illustrates the potential that NLPsych holds. The OMT is a well established psychometrical test that asserts the fundamental motives of humans by letting participants freely associate usually blurred images. Said images show scenarios in which labeled persons interact with each other. Participants are asked to answer questions on those images.

Since trained psychologists do not solemnly rely on provided word lists but rather develop an intuition for encoding the OMT – nonetheless showing high cross-observer agreement – that enables them to access the psyche, there has yet to be a method to be developed for this intuition by using best practice approaches and connecting scientific fields. This way, artificial intelligence might become even better at ‘reading between the lines’.