Introduction

The linguistic characteristics and semantic complexity of the scientific discourse can be daunting for automated systems to tackle. Indeed, scientific language spans a diverse vocabulary with unique properties and connotations, often differentiating significantly from everyday language. This gap is especially noticeable in scientific disciplines’ distinct vocabularies, with a stark contrast between “harder” sciences such as physics and engineering and “softer” sciences like social sciences, economics, and political sciences [1, 2]. Our understanding of such linguistic characteristics is of fundamental importance for various computational linguistic tasks such as information extraction, machine translation, and text classification.

The vocabulary of scientific language can be divided into three categories [3, 4]. The first category includes terms that are also used in everyday language. In the field of social sciences, these could be words like “society”, “work”, “actor” and “education”. In the second category fall all those terms that are science-specific, but do not belong to a certain field. These include terms like “hypothesis”, “comparison”, “proof” or “analysis”. The third category includes terms that are specific to a field, such as “fertility”, “habitus”, “assortativity” and “softmax”. Since the terms in the third category are often subject-specific, they rarely or never appear in everyday texts. The terms of the first two categories also occur in everyday language but are often used in a different semantic context in scientific texts. This means, for example, that terms may look the same symbolically but differ in content and context. Research on current neural word embedding algorithms shows that they can describe semantic relations between lexical items [5, 6]. However, these algorithms are mostly trained on domain non-specific text corpora. Suppose we use these domain-unspecific embeddings for machine learning downstream tasks in domain-specific scientific fields. In that case, they are not suitable since domain-specific terms are either missing or used in a different context. Evaluating this type of embedding is difficult since the evaluation techniques are tailored to word embeddings for everyday language. Furthermore, it can be costly to train domain-specific embeddings, which is why we are looking for a way to use domain knowledge to improve pre-trained models.

Therefore, the paper has three objectives. First, the training of a domain-specific neural word embedding model for soft science disciplines (exemplified by the field of sociology). Second, the development of an intrinsic and extrinsic evaluation strategy for domain-specific vector space models of this field. In that scope, we introduce SociRel-461, a domain knowledge dictionary for intrinsic evaluation, and a multi-label classification dataset for extrinsic evaluation. Third, creating an optimization strategy enhances pre-trained domain-unspecific word embeddings for domain-specific machine-learning tasks.

The nature of scientific, and technical vocabulary differs between scientific disciplines. The vocabulary of disciplines classified as harder sciences (e.g., physics, engineering, and medicine) overlaps less with the vocabulary of everyday language [1, 2]. Disciplines categorized as soft sciences (e.g., social sciences, business and economics or political sciences) tend to overlap more. Dang [1] has created a hard-science (HSWL) and soft-science (SSWL) word list for second language (L2) learners that includes the most important terms for understanding the scientific discourse in hard and soft scientific disciplines.

Terms used in the social sciences are often symbolically the same as terms in everyday language but differ in meaning. Thus, the context in which these terms are used differs, as well as their semantic relations to each other. Let us look at the terms “actor” and “actress” as examples. The word “actor” could be described with the categorical characteristics “human”, “male,” “charismatic”. “actress”, on the other hand, would be “human”, “female”, “charismatic”. If we compare the two terms, they are similar on a symbolic and semantic level. The word “entertainer”, on the other hand, differs symbolically but is similar on a semantic level. The term “actor” can look symbolically similar in the sociological vocabulary, but if we refer to the social “actor", the term describes semantically something completely different.

Defining individual features by hand can be attributed to structuralism in linguistics, and structural linguistics is also the main theory on which distributional semantics is largely based [7].

We aim to utilize distributional semantics to model the scientific vocabulary of the social sciences, which significantly differs from everyday language. Our objective is to learn a representation of scientific terms that facilitates a generalized semantic relational view. The distributional hypothesis [8], suggesting that words in similar contexts possess similar meanings, serves as a theoretical starting point for the goal of deriving such relational meanings.

For example, let’s look at a large volume of scientific literature from the field of sociology. We will recognize that words like ’discrimination’ and ’prejudice’ often appear in similar contexts, whereas words like ’technology’ or ’email’ appear in other contexts. Proceeding from this spatial proximity in the text, a semantic closeness can be derived via the distributional hypothesis. We thus model each word in the social science corpus in a spatial relation to every other word, given the position of all other words in the corpus. Since we cannot assign categorical features (male, female) for the individual terms manually (as in our ’Actor’ and ’Actress’ example) as we do in structuralism, we used machine learning algorithms that learned these features directly from our natural language. According to Boleda [9], vector similarity correlates with distributional similarity, which correlates with semantic relatedness.

The contribution addresses three questions.

  1. 1.

    How does our domain-specific model, trained on “soft” science vocabulary, perform against a domain non-specific model in the scope of our intrinsic and extrinsic evaluation.

  2. 2.

    Does the application of the retrofitting post-processing method [10], using the SociRel-461 domain-knowledge dictionary, result in the enhancement of domain-specific word embeddings for machine learning downstream tasks?

  3. 3.

    To what extent can the SociRel-461 domain knowledge dictionary enhance the performance of domain non-specific word embeddings for machine learning downstream tasks when used for retrofitting?

Section Related work, following the introduction, delves into the current research landscape surrounding domain-specific and domain-unspecific word embeddings. Section Word Embeddings-State of Research expands upon this by addressing the broader research context of word embeddings. The description of the dataset and the details of the preprocessing pipeline are provided in Sects. Dataset Description and Pre-Processing Pipeline, respectively. Section Model Implementation outlines our implementation of the word2vec algorithm using TensorFlow, including the hyper-parameter settings for training our word embedding models. Section SociRel-461 for Domain-Knowledge Intrinsic Evaluation elucidates the development of SociRel-461 and our intrinsic evaluation strategy, which is specifically tailored for social science texts. Section 8 provides a comprehensive explanation of the domain-specific extrinsic evaluation procedures and the architecture of our extrinsic evaluation model. Section Retrofitting discusses the retrofitting method and our specific implementation of it. The outcomes of our experiments are presented and analyzed in Sect. Results. Section Discussion discusses the limitations of our approach, while Sect. Outlook outlines potential for future research.

Related work

Neural Word embeddings are a computational linguistic method that represents human spoken or written natural langue in a machine-comprehendible way. In the case of the word2vec algorithm [5, 6], a shallow neural network converts the symbolic representation of words into a set of real numbers. Each word can be seen as a categorical feature. The advantage of this method is the representation of semantic relations between terms, compared to traditional embedding methods that represent words only as indices of a vocabulary. This technique allowed significant advances in semantic NLP tasks such as Named Entity Recognition (NER) [11, 12], NER in domain-specific fields [13,14,15], Sentiment Analysis [16,17,18], as well as syntactic NLP tasks such as Part of Speech Tagging (POS) [19,20,21].

With word embeddings based on everyday language, a wide variety of NLP tasks (to name just a few), such as the analysis of product reviews of customers [22, 23], Twitter Sentiment Classification [17, 24], Twitter messages [25], web pages [26], newspaper headlines [27], the detection of fake news [28], or the analysis of gender and ethnic stereotypes [29, 30] in a text can be achieved. These examples are mostly based on unstructured natural everyday language.

Embeddings of domain-specific language would be weblogs [31], medical annotations of proteins [32], diagnostic reports for clinical research [33], financial documents [34], US patents [35] or documents of the Oil and Gas domain [36]. Natural language differs from everyday language due to the contextual variability of symbolically identical words. This implies that the same word can possess different meanings based on the context in which it is used. Additionally, natural language often incorporates specialized terminology that is typically absent in everyday discourse.

Researchers have trained word embeddings on domain-specific scientific text in various fields. Most of the contributions are based on natural language from hard science disciplines. The more advanced language models are extensions of the BERT model. These include SciBERT [37], FinBERT [38] and BioBERT [39].

The SciBERT corpus was trained on full-text documents from the computer science (18%) and biomedical field (82%) and thus emphasizes the hard science vocabulary. The authors trained their model on full texts only [37]. BioBERT was trained on PubMed abstracts and PubMed Central full-text articles (PMC). Furthermore, the English Wikipedia and book corpus were included in the BioBERT model [39]. In contrast, news articles from the financial sector form the database for FinBERT. These are financial texts but not scientific publications, distinguishing FinBERT from the other two models. FinBERT uses over 46k documents [38] (or 46k words) and is thus significantly smaller than SciBET and BioBert. All SciBERT models have in common that they have a strong focus on the hard science vocabulary.

In their contribution, Dridi, Gaber, Azad, and Bhogal [40] have addressed the issue of hyperparameter tuning in scientific word embeddings. The hyperparameters vector dimensions, size of n-grams, and size of the sliding window were adjusted, with the result that bigrams and dimensionality of the hidden embedding vectors between 300 and 500 dimensions give the optimal result. The authors used a subset of 2789 full-text articles in the area of Machine learning published in NIPS (Neural Information Processing Systems) between 2012 and 2017. The results were evaluated quantitatively (analogy evaluation) and qualitatively (t-sne).

Heffernan and Teufel [41] trained word embeddings on 22,878 scientific full-texts from computational linguistics and natural language processing to build a scientific problem/solution classifier. The ground truth was defined by the initial authors of the articles used in the datasets.

Another Contribution focused on identifying the domain-specific functional structure of scientific documents [42]. As a database, 130.000 publications from computer science were used. As an evaluation strategy, the scientists used the keywords of the publications as a gold standard, which the authors determined.

Kim, Hullman, Burgess & Adar [43] used scientific word embeddings trained on 500.000 publications listed at the Public Library of Science (PLOS) and PubMed Central (PMC) to develop a strategy for technical terminology simplification. As with our contribution, the authors used the word2vec algorithm.

Domain-specific word embeddings can be used to discover similarities between texts. In the work of Al-Natsheh and colleagues [44], scientific articles from the ISTEX scientific digital library (SDL) were used to develop a semantic search engine for similar text. Two experts in the discipline evaluated the model by defining a ground truth.

Word embeddings have been successfully trained to analyze rhetorical categories in scientific publications [45]. For this purpose, the sentences of 125 scientific publications of the ACL Anthology were examined and classified into 16 rhetorical categories. The authors used a Word2Vec and GloVe model pre-trained on the Google news dataset.

Naili, Chaibi & Ghezala [46] run experiments on topic segmentation on Arabic and English-written scientific articles using LSA, Word2Vec, and GloVe algorithms. The English corpus deals with computer science topics, and the Arabic text corpus includes scientific publications from economics, politics, and life science. The authors showed that word2vec with negative sampling gave better results than LSA. For the English language, better results were obtained than for the Arabic articles.

In conclusion, domain non-specific embeddings are primarily used for generalistic tasks involving mainly natural everyday language. This includes tasks such as fake news detection [28] or the analysis of newspaper headlines [27]. For the analyses of speech and its individual elements, in the field of parts of speech tagging, domain non-specific embeddings are also preferred [19,20,21]. However, when discipline-specific machine learning tasks need to be accomplished, such as described in the work of Heffernan and Teufel [41], it is worthwhile to train domain-specific embeddings since the word representations are dependent on their context. This, in particular, constitutes a problem since the context in which words are used in specific disciplines may differ considerably from everyday language. Most embeddings based on scientific literature (like the contribution of Lu, Huang, Bu and Cheng, [42] or Kim, Hullman, Burgess, and Adar [43]) are mainly in the realm of hard sciences, whose vocabulary differs not only from everyday language but also from that of soft sciences, as described at the beginning of the paper.

Examining the evaluation methodologies, it is evident that most domain-specific studies employed a combination of human-annotated ground truth and quantitative extrinsic methods to assess the quality of their embeddings. Intrinsic evaluation methods like SimLex-999 [47] or WordSimilarity-353 [48] are less used because they have been tailored to domain unspecific embeddings. The study by Nooralahzadeh [36] and colleagues was very interesting, as domain-specific knowledge extracted from a glossary was used to evaluate and later improve the models.

Our contribution ties into this idea by developing an intrinsic evaluation procedure for word embeddings of social science texts that can be used as ground truth.

Following this example, we have developed the dataset SociRel-461 for evaluating and optimizing domain-specific soft scientific word embeddings based on the Open Education Sociology Dictionary.

Word embeddings - State of research

Word embeddings are a computational linguistic method that represents natural language in a machine-comprehensible way. The symbolic representation of words is converted into a set of real numbers in a vector. Each vector element can then be seen as the realization of a latent feature.

In the context of advances in neural networks, the word2vec algorithm [5, 6] has received a lot of attention in the mid-2010s. The idea of using neural networks for word embeddings was already adopted by Miikkulainen and Dyer [49] in the early 90 s. Neural word embeddings have the advantage of producing lower dimensional data [50]. According to Altszyler, Sigman, Ribeiro, and Slezak [51], neural word embeddings outperform LSA in terms of memory requirements. As a result, a larger amount of data can be used for model training. The word2vec algorithm is one of the most efficient methods for processing massive volumes of text data when combined with the negative sampling extension [52].

The GloVe (Global Vectors for Word Representation) [53] is an interesting alternative to word2vec as it outperforms word2vec in four out of five different word similarity tasks when trained on a non-domain-specific corpus such as Gigawarod5 or the CommonCrawl. Pennington [53] and colleges calculated accuracy improvements on analogy tasks including WordSim-353 [48] by 3 %, MC (Miller & Charles 30) [54] by 7.5%, RG (Rubenstein-Goodenough) [55] by 8.1% and RW (rare words dataset) [56] by 0.9%. For SCWS (Stanford’s contextual word similarities) [57], word2vec outperformed GloVe by 4.2%. Compared to word2vec, GloVe is a count-based model based on word-word co-occurrences.

Until recent years, using the attention mechanism in neural networks required the implementation of RNN encoder-decoder models. Therefore, a breakthrough constituted the transformer network architecture [58], which could be used in isolation without convolutions or recurrence [59] and has led to several new models. The BERT Model simplified the approach and increased the amount of data that could be used for input. Compared to the word2vec, the ’standard’ BERT Model uses sub-word tokenization (Wordpiece tokenization) instead of word-level tokenization for the input sequence. Frequently occurring character pairs are represented as whole word tokens and more unknown words are split into their sub-word parts [60]. Then the bigram character pairs are merged to maximize the likelihood of the data.

With the GPT-2 [61, 62] and GTP-3 [63], the focus has shifted away from machine translation towards text generation. GPT-2 consists of a stack of transformer blocks with masked self-attention, using position embeddings like BERT. The model is trained on an input sequence with the aim of predicting the following word in the sequence. GPT-2 was pre-trained on the WebText dataset [64], which consists of all websites linked in high-rated Reddit posts. Since these models were pre-trained on non-domain-specific texts and their training is very cost-intensive, these approaches are not considered for our question. The algorithm we employed is particularly adept at identifying both taxonomic and thematic relations, as supported by literature [65, 66], and has demonstrated robust performance in handling large datasets [67]. Our focus on taxonomic relationships and the strengths of the chosen algorithm in this area were decisive for our decision. Furthermore, GPT-2 employs Byte Pair Encoding (BPE), a sub-word tokenization technique [68, 69]. This implies that GPT-2 operates with vectors at the sub-word level, rather than on a whole-word basis. For our research, it is crucial to have vectors that represent entire words. Our project aimed to create a model steeped exclusively in scientific language to ensure the highest relevance and accuracy for our domain-specific applications. While fine-tuning GPT-2 offers remarkable advantages in general contexts by enhancing non-scientific language capabilities, our objective was to develop a model that operates within a strictly scientific text corpus. This focus stems from our desire to minimize the introduction of biases and inaccuracies that could arise from a broader linguistic base not dedicated to scientific discourse.

Most recently, GPT\(-\)3.5, a GPT-3 language model variant, was developed. GPT\(-\)3.5 is an improved version of GPT-3 that utilizes a larger training dataset and fine-tuning techniques to generate more accurate language outputs. Based on this iteration stage of the gpt model, the chatbot system chatgpt has become well-known beyond the scientific community. Later in 2023, chatgpt got enhanced with the even newer GPT-4 [70] model. Particularly interesting here is the approach of incorporating human feedback within the framework of a reward network in order to align the model with its users [63].

In summary, a large text corpus is needed to represent the heterogeneity of a scientific discipline. Therefore, the efficiency of the algorithm is of central importance for our research. Altszyler and colleagues [51] concluded that neural word embeddings outperform LSA for a corpus size of 10 million words or more. Since our database exceeds this limit by far with 250 million words, we decided to use the word2vec algorithm. The precise implementation of the algorithm for GPUs is described in the methods section, as it differs from the conventional solutions.

Dataset description

We decided to use scholarly articles as our primary data source since we assume that these articles’ published content is a proxy of the content and topics of the discipline itself.

The number of published scientific articles has increased exponentially over the years. Every 24 years, the number of all scholarly papers even doubles [71]. This trend can be observed with nuances across many disciplines and countries [72].

Nowadays, publications in scientific journals are essential for sociological career opportunities. It has become, for example, more common to take postdoctoral positions before a tenure track professorship [73, 74]. According to Warren [74], this development increases the pressure to publish more, as the number of publications can also determine future career paths. The value of the individual publication is diminished in proportion if all scientists competing for positions publish quantitatively more. From 1986 to 2016, the number of sociological scholarly journals listed on the SSCI (Social Sciences Citation Index) nearly tripled from 64 to 163 [74]. The expansion of the SSCI, in particular, is probably due to a higher volume of publications and the progressive digitization of scientific discourse. The number of published articles in sociology has increased by 35% from 2011 to 2019, while the number of published books has decreased by 23% in the same period [75]. Notably, the number of written scholarly articles per author, listed in Academic Analytics, LLC (ACA), has increased on average from 4.47 to 6.05 per 5 years.

As a data basis, we use 94, 741 full-text documents (data collection as of 09/22/2020) listed in the SSCI that have the property article, were submitted in English language, and originate from sociology. The period covers 51 years, from 1969 to 2020. However, the majority of the scholarly papers in our dataset were published between 1985 and 2020. The time between 1969 and 1985 accounts for only 3% of our text data. Therefore, our analysis considers the published documents from 1985 to 2019 inclusive.

To represent the field as heterogeneous as possible, we have considered 200 different journals. Only 12 % of the journals have less than 300 papers. Due to availability, we have a different number of publications for all journals. Since the algorithm to be trained performs better with a larger text corpus, we decided to use as many texts as possible without biasing toward one specific journal. The most prominent single journal (\(Social \ Indicators \ Research\)) accounts for only \(4.1\%\) of all text data.

170,025 authors wrote the texts covered by our database, resulting in an average of 1.7946 (std : 1.1456) authors per publication. Our results are consistent with Henriksens’ [76] and Macfarlanes’, Devines’, Drakes’, Gilberts’, Robinsons’ & Whites’ [77] calculations. Henricksen [76] computed a value of \(1.8 - 1.9\) authors for Sociology, Social Issues, and Ethnic Studies. Macfarlanes et al. [77] found that 27% of the articles are written without co-authors, 61% have \(2-3\) authors, and 11% have \(4-5\) authors per publication.

Fig. 1
figure 1

Line graph representing the annual mean count of authors and publications within the Social Sciences Citation Index (SSCI). The blue line represents the mean author count for SSCI publications per year. The area surrounding the blue line represents the standard deviation of the author per year count. The green line indicates the number of articles listed in the SSCI per year

Figure 1 shows that the number of publications in our text corpus has increased over the years. The same applies to the average number of authors per paper, consistent with Henriksen’s [76] findings. In the last 20 years, from 1990 to 2020, the number of articles published and listed in the SSCI has tripled. We omitted the data points in Fig. 1 for 2020 for authors and publication count since our data collection spanned only the first three months of that year.

Therefore, we take our findings of increasing author numbers per publication and the steadily growing trend of papers published and listed in the SSCI per year to indicate that our sample of text data represents the scholarly discourse as a proxy in sociology. Furthermore, we assume that the natural language used in the SSCI-listed publications is similar on average to those used in other databases such as Google scholar or JSTOR.

Pre-processing pipeline

This section explains how we get from unstructured natural language to a structured data form that a computer can process. No uniform standardized procedure exists, meaning the pre-processing pipelines differ depending on research projects. However, some pre-processing steps are essential for a better model performance later on.

The authors Rameshbhai and Paulose [78] used a multi-stage pre-processing procedure for their paper on opinion mining from newspaper headlines. First, they removed all stop words from the text corpus. Then the headings were transformed into lowercase. After that, bi-grams were generated and used in a (Support Vector Machine) SVM classification task [27].

Dobrokhotov et al. [32] used another pre-processing approach for the medical annotation of proteins. They first divided the text corpus into smaller segments and then distinguished between biological and general English terms. Following this, they derived and normalized text tokens, removing punctuation and numbers. In the final step, they cleaned the tokens of any remaining letters. With an automated classification of diagnostic reports for clinical Research [33], the pre-processing pipeline consists of two pre-processing modules.

(1) Text tokenization, (2) transformation to lower cases, and (3) stop word removal takes place as part of the basic module. The authors converted all umlauts in the first module, given that the medical diagnostic reports were written in German. They checked the text token terms for spelling errors in a specific reprocessing module. Moreover, they added the most frequently occurring medical term abbreviations as distinct text tokens.

A six-part pre-processing pipeline was used for the sentimental analysis of financial texts [34]. First (1), the authors removed hyperlinks and numbers from the text corpus. Second (2) abbreviations like “aren’t” were extended to “are not”. Third (3) punctuations were removed. Fourth (4) negotiation words were deleted. Fifth (5) pronouns, prepositions, and conjunctions are removed since only verbs, nouns, adjectives, adverbs, and interjections contain sentiment information [34]. In the last (6) step, the remaining words are lemmatized.

Based on the examples, it becomes clear that different research focuses also entail a different approach to pre-processing.

In order to systematize the pre-processing pipeline of scientific literature, we have divided the process into four modules, which will be explained in the following section, including the import module, the regular expression module, the normalization module, and the SkipGram module.

In the import module, we process our raw data, which consists of scientific articles available in PDF format. While these documents appear highly structured to human observers, they are not directly suitable for machine learning applications. This is due to the fact that the algorithms employed in our model are designed to handle numeric data exclusively. A PDF document can contain different data types, including single characters, strings, numeric data, graphics, vector graphics, tables, and bullet lists [79]. These individual parts are then arranged into larger structures. Sequences of characters form, words form sentences, and sentences combine into paragraphs and text blocks. Compared to other Text documents like ’Word-documents’ or HTML, we can describe a PDF document more as a graphical representation of text.

We import PDF documents in parallel batches of 2500 files, leveraging the simultaneous use of 40 CPU cores. To focus on the body of the scientific papers, we discard the first page, which typically contains the abstract, headings, and meta information about the journal. We also remove the abstracts of the publications, as they summarize the respective paper and could introduce redundancy. Our objective is to learn from naturally written language, so we can disregard automatically generated meta-information, such as details about the journals commonly found on the first page.

Additionally, we analyzed each document to determine whether it has a one- or two-column format. In the latter instances, we merged the two columns into one. Concluding the first module, we converted the PDF documents into plain text files, a more space-efficient format due to the omission of text formatting.

The regular expression module of the pre-processing pipeline takes the previously processed plain text files as input. We further condense the information content of the natural language using a series of chained regular expression patterns. Maintaining the order of these chained regular expressions is crucial, as we rely on simpler elements (such as single digits and punctuation) to identify more complex structures (such as headers, footers, and citations). This strategy follows a top-down approach, implying that we should first remove complex text elements, followed by less complex ones.

We implemented the following steps in the second module of our pre-processing pipeline in descending order:

  1. 1.

    We removed all in-text citations and other literature references.

  2. 2.

    We deleted words using split ligatures from the corpus.

  3. 3.

    We concatenated words separated by line-breaking hyphens.

  4. 4.

    We removed all other non-Unicode characters.

  5. 5.

    We eliminated single digits and punctuation, which were initially needed to identify more complex text structures such as in-text citations.

  6. 6.

    We also removed spaces and single letters resulting from previous transformations.

  7. 7.

    We transformed the text corpus into lowercase, single-word tokens.

  8. 8.

    We removed tokens that contain three consecutive identical characters.

The output of the second module consists of token arrays representing sentences, nested within arrays that represent documents. We implemented this module in a vectorized batch format and utilized parallel computations to perform the transformations as time efficiently as possible.

In the normalization module, we normalize the text tokens to eliminate redundant information by extending abbreviations like the approach taken by Sun et al. [34].

We then calculated the tf-idf (term frequency-inverse document frequency) to detect misspelled tokens and other artifacts. When embedding words with the tf-idf method, minor changes in terms result in entirely new word tokens, as described earlier [80]. This fact, which is often perceived as a weakness of the method, is a strength in this case, as it allows us to identify particularly rarely occurring words.

After we removed the 1% and 10% rarest tokens, all words were put into their base form using ’spaCy lemmatizer’ [81].

Since information entropy increases with sentence length but decreases for very long sentences [82], we further delete sentences consisting of less or equal to four-word tokens. These short token arrays account for about 4% of all sentences. As an upper bound, we identified and removed the 5% longest token arrays.

As a last step, tokens with low information content were removed by using a stop-word list. The normalization module yields token arrays representing sentences nested in arrays that represent documents (Fig. 2).

Our research focuses on scientific articles written in natural language, specifically English. Natural languages, often characterized by their organic evolution through interpersonal interactions, are readily comprehensible to humans. However, their inherent complexity and variability present unique challenges for computational analysis. On the other hand, scientists created formal languages without that property and for a specific purpose, such as describing logical relationships in mathematics through chains of symbols or characters. Formal languages differ from natural ones at the semantic level, meaning that formal languages are unambiguous and not open to interpretation [83]. Since we are only interested in natural language, we removed the formal language from our database during the previous three pre-processing steps. Thus, the corpus contains only the natural language, with an underlying semantic structure to derive thematic content relations. According to Katz and Fodor [84], semantic structures can be understood as relations between lexical items. In this context, we distinguish between taxonomic relations, dissimilar concepts, and syntagmatic relations [65]. Taxonomic relations are particularly interesting because they describe the similarity of lexical items [54, 85]. These relations are based on lexical-syntactic patterns like “A such as B” [86]. A unique distinction must be made here: Taxonomic relations are formed only between concepts with the same inherent properties [87]. Thematic relations additionally describe complementary properties of the concepts [85]. For example, income and revenue could have thematic similarities in the scientific discourse, as capitalism and proletarian are associative relations.

In our Skip-Gram module, we employed the famous Skip-Gram algorithm with negative sampling [5]. This algorithm is particularly adept at mapping both taxonomic and thematic relations [65, 66], and it also demonstrates robust performance when handling large datasets [67]. Also, predictive models outperform count-based models [88]. Therefore, we decided to calculate the vector space representation of the text corpus by using a shallow neural network.

Skip-Gram uses a center word \(w_{ce}\) to predict its surrounding context words \(w_{co}\). Since we have to predict multiple context words per central word, this results in a multi-classification problem. The computation of multinomial classification problems is much more cost-intensive than, for example, binary classification problems since we have a softmax activation function (Eq. 1) in the second (\(\beta\))-dense-layer within this architecture. The normalization factor in the denominator of Eq. 1 (\(v \in V\)) must be recalculated for each pair of words.

$$\begin{aligned} \begin{aligned} P(w_{co}|w_{ce};\theta ) = \frac{exp(W_{\beta }^{T} \cdot W_{\alpha }^{T}x)}{\sum _{v=1}^{V}exp(W_{\beta _{v}}^{T} \cdot W_{\alpha }^{T}x)} \end{aligned} \end{aligned}$$
(1)

To solve this problem, we change the architecture of the neural network in such a way that we only have to solve a binary classification problem instead of a multinomial classification model [6, 52]. We achieve this by splitting the text corpus into center word \(w_{ce}\) plus context word \(w_{co}\) pairs before feeding them into the final neural network described in Eq. 5. The center word \(w_{ce}\), context word \(w_{co}\) split can be seen in Fig. 3. The shallow neural network does feature two input vectors \(x_1\) and \(x_2\).

$$\begin{aligned} \begin{aligned} \underset{\theta }{\textrm{argmax}} \sum _{(w_{ce},w_{co}) \in D} log\ P(D=1|w_{ce},w_{co};\theta ) \end{aligned} \end{aligned}$$
(2)

Our goal is to calculate a parameter \(\theta\) that maximizes the probability of each word pair appearing in the text corpus \(P(D = 1|w_{ce},w_{co})\) together. One row of the parameter \(\theta\) is, in fact, the vector representation of a single word. To do this we use the maximum likelihood function in Eq. 2.

Since \(W_{\beta }^{T}\) in Eq. 1 can be seen as \(\theta ^{(k)T}\), where k denotes a single word, we can rewrite Eq. 1 to the Objective Function \(L(\theta )\) (3).

$$\begin{aligned} \begin{aligned} L(\theta ) = \frac{1}{D} \ \sum _{d=1}^{D} \sum _{k=1}^{K} \frac{exp(\theta ^{(k)T} \cdot W_{\alpha }^{T}x)}{\sum _{v=1}^{V}exp(\theta ^{(v)T} \cdot W_{\alpha }^{T}x)} \end{aligned} \end{aligned}$$
(3)

\(d \in D\) is one center word and K are all possible context words. In this way, the window around the center word would be as large as the entire text corpus D since \(D=K\).

To add a window size j we substitute k in \(\theta ^{(k)T}\) by \(d+j\) to \(\theta ^{(d+j)T}\). This leads to,

$$\begin{aligned} \begin{aligned} L(\theta ) = \frac{1}{D} \ \sum _{d=1}^{D} \sum _{-c\le j \le c, j \ne 0} \frac{exp(\theta ^{(d+j)T} \cdot W_{\alpha }^{T}x)}{\sum _{v=1}^{V}exp(\theta ^{(v)T} \cdot W_{\alpha }^{T}x)} \end{aligned} \end{aligned}$$
(4)

where c is the window size.

Two hyperparameters were tuned to generate the split center-word-context-word-dataset.

Window Size: We used a window size \(c = 3\). This implies that for each center word, three words before and after it are used as context words. The authors Li et al. [67] used a window size of \(c = 2\) when scaling word2vec on a big corpus.

Negative Sampling Ratio: The negative samples v were drawn from the Unigram distribution based on Mikolov et al. [6] recommendation. We achieved the best results with a balanced dataset of equal numbers of positive and negative samples (\(Ratio = 1.0\)).

We get three data sets as a final result of the Skip-Gram module. \(x_1\) with numeric representations for all center words \(w_{ce}\), \(x_2\) containing the numeric representations for the context words \(w_{co}\) (positive and negative samples) and Y representing the labels.

Model implementation

In this section, we describe our word embedding implementation for which we mainly used Goldberg and Levy’s [52] equations using Tensorflow [89].

The model described in Fig. 3 and Eq. 5 takes two input vectors \(x_1\) and \(x_2\), where \(x_1\) are the center words and \(x_2\) are the context words.

$$\begin{aligned} \begin{aligned}&h=W_{1}^{T}x_1 \\&j=W_{2}^{T}x_2 \\&z = h \cdot j \\&{\hat{y}} = sigmoid(z) \end{aligned} \end{aligned}$$
(5)

The matrix–vector product h is the embedded representation of the center words \(w_{ce}\), j the embedded representation of the context words \(w_{co}\). Each row of the embedding matrix h and each row of j represents a word as 300-dimensional row vector. According to Chungh and colleagues [90], each corpus has its optimal embedding size. We tried 100, 200, 300 dimensions for the hidden layer, with 300 dimensions giving the most likely result for a corpus of scientific articles. Since we initially split the data into two vectors, we have to calculate the similarity between center word embedding h and context word embedding j. The dot product is a reliable way to calculate the similarity between two vectors. As a result, we obtain a column vector z which is then passed into a sigmoid function (Eq. 6). In order to extract the final embedding vectors later, the implementation with TensorFlow requires a transformation of h before calculating the dot product \(z = h \cdot j\). This is not apparent in Eq. 5, but is necessary to create a so-called “lookup table” of the vectors where each row represents a word and each column a parameter. The lookup table can be seen as the third step in Fig. 3 with a dimensionality of (vocabulary size \(\times\) 300).

$$\begin{aligned} \begin{aligned} P(w_{d+j} | w_{d};\theta ) = {\hat{y}} = \frac{1}{1+exp(-z)} \end{aligned} \end{aligned}$$
(6)

Since we are facing a binary problem, we rewrite the MLE-Eq. 2 to the binary cross entropy equation. The negative log-likelihood equals the cross entropy, and binary cross-entropy is a special case of cross entropy.

$$\begin{aligned} \begin{aligned} J(\theta )&= -\frac{1}{D} log L(\theta ) \\&= - \frac{1}{D} \ \sum _{d=1}^{D} \sum _{-c\le j \le c, j \ne 0} logP(w_{d+j} | w_{d};\theta ) \\&= - \frac{1}{D} \ \sum _{d=1}^{D} \sum _{-c\le j \le c, j \ne 0} y_d \cdot log({\hat{y}}_{d}) \\&\quad +(1-y_d)\cdot log(1-{\hat{y}}_d) \end{aligned} \end{aligned}$$
(7)

Furthermore, the sum \(\frac{1}{D} \ \sum _{d=1}^{D}\) in the equation Eqs. 3, 4 and Eq. 7 comprises millions of words and would therefore be very costly to compute. Instead of updating the weights after iterating over all words, we substituted \(\frac{1}{D} \ \sum _{d=1}^{D}\) by the mini-batches \(b = 1024\), following the recommendation according to Li, Drozd, Guo, Liu, Matsuoka & Du [67].

$$J(\theta ;w^{{(b)}} ) = - \sum\limits_{{ - c \le j \le c,j \ne 0}} {y_{d} } \cdot log(\hat{y}_{d} ) + (1 - y_{d} ) \cdot log(1 - \hat{y}_{d} ){\text{ }}$$
(8)

We used backward propagation and stochastic gradient descent to update the weight matrix \(\theta\) in \(J(\theta ;w^{(b)})\) in Eq. 8. The weights were initialized using the Glorot uniform initializer.

$$\begin{aligned} \begin{aligned} \theta _{update} = \theta _{old} - \eta \cdot \nabla _{\theta } J(\theta ;w^{(b)}) \end{aligned} \end{aligned}$$
(9)

We implemented two variants of the model. The first variant underwent a basic stop-word removal and the removal of the rarest 1% of tokens. This model’s final weight matrix, \(\theta\), comprises 31,391,100 parameters (104,637 \(\times\) 300). Likewise, we applied a basic stop-word removal process for the second variant and eliminated the rarest 10% of tokens. This model’s final weight matrix \(\theta\) consists of 7,541,700 parameters (25,139 \(\times\) 300). As illustrated in Fig. 2, the first model achieved an accuracy of 0.931 (93.1%) after the initial training epoch, which spanned 4 h. The accuracy increased to 0.943 (94.3%) after 40 h of training and to 0.951 (95.1%) after 64 h. Concurrently, the training loss decreased from 0.15 to 0.148 and finally to 0.143. Given the marginal improvement in accuracy with additional training time, we used the result after the first epoch for efficiency. Consequently, our final model predicts the correct context word, given a center word, with an accuracy of 93.1%.

Fig. 2
figure 2

Model training loss and accuracy graph. The green curve in the graph represents the training loss, while the blue curve signifies the training accuracy. At the zeroth epoch, the model achieved an accuracy of 0.498, which is not surprising given the balanced nature of our dataset. After one epoch of training, the accuracy improved significantly to 0.931. This accuracy saw a slight increase to 0.943 at the 10th epoch and further to 0.951 at the 15th epoch. Concurrently, the training loss, which started at 0.693, decreased to 0.150 after one epoch of training. This loss further reduced to 0.148 after 10 epochs and to 0.143 at the 15th epoch

The modelFootnote 1 was trained on the bwUniCluster (2.0), using four NVIDIA Tesla V100 GPUs with 32 GB VRAM each.

Fig. 3
figure 3

Embedding model architecture. The figure illustrates our word2vec implementation, moving from left to right. The three inputs are represented by yellow boxes, which include the center words \(x_1\), context words \(x_2\), and the labels Y. The light green boxes represent the embedding layers, processed in batches with a batch size (bs) of 1024. The dark green boxes serve as lookup tables, from which we extract the final word vectors for each batch. The dot product between h and j in Eq. 5 is depicted by the dark green circle. The light blue box represents the dense layer, while the light blue circle signifies the sigmoid activation function as per Eq. 6. The binary cross-entropy loss, calculated as per Eq. 8, is represented by the dark blue box

SociRel-461 for domain-knowledge intrinsic evaluation

Word vector representations are usually evaluated using two different methods. Intrinsic evaluation takes human-annotated word similarities as a baseline and compares it to the calculated word vectors. In extrinsic evaluation, nlp downstream tasks such as text classification, part-of-speech (POS) tagging, or named entity recognition (NER) are performed and assessed for their performance. Intrinsic evaluation can test for thematic relations, taxonomic relations (similarity), or a mixture of thematic and taxonomic (similarity) relations. Wordsim-353 [48] consists of two subsets (WS-sim, WS-rel) testing for taxonomic (similarity) and thematic relations. Simlex-999 [47], on the other hand, tests for pure taxonomic (similarity) relations. MTurk-287 [91] and Mturk-771 [92] is a benchmark for thematic relations [93]. Intrinsic evaluation is generally based on the hypothesis that the higher the correlation (or agreement) of the computed model to the word pair relations defined by human raters, the higher the model quality can be considered.

Intrinsic evaluation offers a key advantage over extrinsic evaluation because it is computationally less demanding, making it an ideal choice for hyperparameter tuning. It is a practical method for comparing different model architectures derived from the same corpus. However, its limitation lies in its reliable applicability only to domain non-specific language. This tool proves unsuitable because we assume that scientific language significantly deviates from non-specific language. To address this issue, we developed SociRel-461, a domain-specific intrinsic evaluation procedure based on the Open Education Sociology Dictionary [94].

The Open Education Sociology Dictionary (OESD), serving as the foundation for SociRel-461,Footnote 2. comprises 999 core sociological terms. These terms were identified through a meta-analysis of introductory sociological texts [95,96,97,98,99,100,101,102,103,104], conducted by field experts. To draw comparisons between terms, various sociology dictionaries were utilized [105,106,107,108,109,110].

First, we extracted 999 core terms from the OESD and stored them in a dictionary d, composed of key-value pairs (k, v). For instance, the key k could be a term like “monarchy”, with corresponding values v such as “government”, “republic”, or “absolute monarchy”. In our final \(SociRel-461\) dictionary, we aim to avoid duplication of words in the keys and values, as well as between the keys and values. Moreover, the keys should always consist of unique terms. The keys k \(\in\) d represent the core terms, while the values v \(\in\) d constitute a list of terms semantically related to k \(\in\) d. We can then segregate the dictionary d into single-term keys (461) and multi-term keys (538), with the latter further split into single terms. From this second category, we retained only those multi-term keys (409) that consist of precisely two terms for subsequent use. This operationalization step is crucial as each additional term in a multi-term key amplifies the semantic uncertainty of the overall key. If a term from the union of all single keys and multi-keys appears in key k, it is incorporated into a new dictionary, \(d\_sm\). The keys in the \(d\_sm\) dictionary contain terms present in the single-word list s or the multi-word list m. This step allows us to extract keys appearing in the single or multi-term list and append them to the new \(d\_sm\) dictionary. For instance, consider the term ’monarchy’, which appears in the key ’monarchy’ and the key ’absolute monarchy’. Both keys and values are then combined into a single ’monarchy’ key-value pair in the \(d\_sm\) dictionary. Subsequently, duplicate key-value pairs in the \(d\_sm\) dictionary are joined by the key value.

In the final step, we identify key terms that appear in the corresponding value lists and eliminate these entries from the value list. For example, the former ’absolute monarchy’ key previously contained the word ’monarchy’ in its value list v. This procedure ensures that corresponding key-value pairs cannot take on the same term. Ultimately, we incorporate all key-value pairs into the \(d\_all\) dictionary. Refer to Appendix A for a detailed description of this process.

In the domain of knowledge-based intrinsic evaluation, we examine our assumption that word embeddings trained on our social scientific text corpus generate semantic word pair relations that align closely with the semantic relations defined by field experts. Our approach diverges from methods such as SimLex-999 and MTurk-287 in several ways. Firstly, the semantic relations in our method are defined by discipline experts, as opposed to multiple raters assigning a numerical value to the similarity of the terms. Secondly, our rating scale is different from conventional methods. While SimLex-999 employs a scale ranging from zero to ten and MTurk-287 utilizes a scale from zero to five (with ten and five indicating the highest semantic similarity between terms), our word pairs are binary scaled, indicating the presence or absence of a semantic relation. Lastly, we do not compute correlations or cosine distances between the rating value of individual terms.

Instead, we create an unweighted, undirected bipartite network \(G = (S, V, E)\) for all \(S = \{s|s \in S\}\) words in \(SociRel-461\) and all \(V = \{v|v \in V\}\) n-semantically most similar terms in our trained model. All potential edges E are defined as \(E = \{e_{(s,v)} | s \in S, v \in V\}\). Realized edges are defined as \(E_{(s,v)} = \{e_{(s,v)} | (s \in S, v \in V)\cap s = v\}\), when words appear in \(SociRel-461\) and in the n-semantically most similar term set of our model.

Then we calculate the overlap coefficient,

$$\begin{aligned} \begin{aligned} overlap(S,V)= \frac{|S \cap V|}{min(|S|,|V|)} \end{aligned} \end{aligned}$$
(10)

between the nodes sets S and V.

The overlap coefficient offers an advantage over the Jaccard similarity by accounting for varying numbers of nodes in S and V. This is relevant to our study as the values of the key-value pairs of SociRel-461 possess differing quantities of values. This method allows us to measure the embedding quality of a domain-specific language model derived from the mean overlap coefficient between the SociRel-461 value terms (of varying number) relevant to the sociological discipline and the word similarities predicted by the model for a specific key term. We state that a higher overlap coefficient between the node types S and V for a given key term implies superior embedding quality, as it aligns more closely with the relational word similarities defined by field experts. We compute the overlap coefficient for all 461 terms encompassed in SociRel-461 and subsequently derive the average of these values. The resultant mean overlap coefficient constitutes our final SociRel-461 intrinsic evaluation score. This forms the basis for the following hypothesis.

Hypothesis 1: The domain-specific embedding model shows a higher average overlap coefficient between node sets S and V across all 461 SociRel-461 terms than a domain-unspecific embedding model, indicating a better match of the domain-specific embedding model with the relational word similarities defined by experts in the field.

Fig. 4
figure 4

In the bipartite graph \(G = (S, V, E)\), ’discrimination’ (left) and ’democracy’ (right) are visualized. For ’discrimination’, twelve SociRel-461 terms (\(S = {s|s \in S}\)) are green boxes, and the top ten semantically similar terms (\(V = {v|v \in V}\)) from our embedding model are blue circles. Similarly, for ’democracy’, nine SociRel-461 terms are green boxes, and their top ten semantically similar terms are blue circles

In Fig. 4, we can see 4 realized edges \(E_{(s,v)}\) between the two vertex types S and V as an example for the key term ’discrimination’ on the left side. Calculating the overlap coefficient according to Eq. 10 results in a value of 0.4. The high overlap coefficient indicates that the vector representation of the term ’discrimination’ is relatively close to the meaning defined by the field experts.

In the bipartite graph (on the right side) representing the term ’democracy’ shown in Fig. 4, there are only two realized edges \(E_{(s,v)}\), resulting in a significantly lower overlap coefficient of 0.1.

Comparing the two bipartite graphs in Fig. 4, we see that the number of V, (n-most similar terms in the vector space model) remains the same (n=10), whereas the number of S (SociRel-461 closest defined words) varies between the two examples.

We then calculated the mean overlap coefficient for different sizes of V \((n=10, 50, 250)\) most similar terms of three embedding models.

Table 1 SociRel-461 evaluation results

The upper section of Table 1 presents the word vector representation of the domain-specific (d.s) scientific discourse, using 300 dimensions per token, trained over one epoch, with removed stop words and removed 1% of rarest tokens. The computed mean overlap coefficient in Table 1 (for a specific set size of V) denotes the arithmetic mean across all individual overlap coefficients of the 461 sociological terms in SociRel-461. By calculating the mean overlap coefficient between all S and V sets, with a size (n-most similar words) of 10, we obtain a mean overlap coefficient of 3.51%. With a fixed set size of n = 50, this value increases to 6.31%. The model in which only 1% of the rarest words are removed marginally performs better on smaller set sizes than the one that removes 10% of the rarest words. However, commencing from a set size of n = 220, the model that removes 10% of the rarest words outperforms the model removing only 1% of the rarest tokens. At first glance, the mean overlap coefficients for both models may appear small. However, when comparing our domain-specific embedding with a non-domain-specific word embedding, the latter only achieves a mean overlap coefficient of 0.78% for a set size of n = 10 and a mean overlap coefficient of 1.24% for a set size of n = 250. A set size of n = 10 resulted in a 4.5-fold higher overlap between V and S than the domain-unspecific model. For a large set size of n = 250, the match was almost 9.7 times higher. Our vector space model, trained on the socio-scientific discourse, has a significantly higher overlap coefficient than the domain-unspecific embeddings. As depicted in Table 1, the domain-unspecific model also employs 300 dimensions per token. This model was trained on multiple data sources, including WordNet 3.0 [111], OntoNotes 5 [112], ClearNLP, Wikipedia, OpenSubtitles [113], and the WMT News Crawl [113]. The domain-unspecific model will be referred to as “Wordnet 300” for brevity and clarity in that research paper.

Based on the results presented in Table 1, we can conclude that the domain-specific embedding model shows a higher average overlap coefficient compared to a domain non-specific embedding model. This indicates a superior alignment with the relational word similarities defined by experts in the field. The domain-specific model, which removes only 1% of the rarest words, performs marginally better on smaller set sizes than the model that removes 10% of the rarest words. However, for larger set sizes, starting from n = 220, the model that removes 10% of the rarest words outperforms. The domain-specific model achieves a significantly higher mean overlap coefficient than a non-domain-specific word embedding, indicating a better match with the SociRel-461. For a set size of n = 10, the domain-specific model achieved a 4.5-fold higher overlap. Furthermore, for a set size of n = 250, the overlap was almost 9.7 times higher. Therefore, the hypothesis 1 is supported by the results. The domain-specific embedding model demonstrates a higher average overlap coefficient, indicating a better alignment with the relational word similarities defined by experts in the field.

Domain-knowledge extrinsic evaluation: concepts and model architecture

There are a variety of nlp downstream tasks that can be applied to evaluate word embedding models extrinsically.

Part-Of-Speech (POS) tagging classifies words into their respective grammatical categories, such as nouns, verbs, adjectives, adverbs, particles, and prepositions. This task necessitates a syntactic and word embedding dataset as input, as suggested by Liu et al. [114]. Li, Mao, and Wang [115] suggest that Long Short-Term Memory (LSTM) models yield superior results for POS tasks when combined with word embeddings. Utilizing labeled syntactic datasets as a baseline allows for evaluating the effectiveness of the pre-trained word embeddings in supporting the POS task [116]. In light of our focus on domain-specific transfer learning tasks, POS tagging may serve as a less appropriate evaluation method. This is primarily because this task does not inherently incorporate domain-specificity.

In contrast, text classification emerges as a superior extrinsic evaluation strategy, primarily due to its adaptability to domain-specific tasks. Numerous scholarly publications have leveraged this method to evaluate domain-specific embeddings. One notable application of text classification is evaluating word vector representations trained on foreign languages, a domain where standardized intrinsic procedures are yet to be established. For instance, Hossain et al. [117] utilized text classification to evaluate embeddings derived from a Bengali corpus encompassing 180 million word tokens. The authors developed their classification model using Convolutional Neural Network (CNN) architecture.

Similarly, Avetisyan and Ghukasyan [118] used text classification to assess the quality of their word embeddings, which were trained on Armenian news articles. The same approach was used to evaluate the Brazilian Portuguese LIWC dictionary for Sentiment Analysis [119].

Another excellent example of the extrinsic evaluation was conducted by Elhadad, Gabay, and Netzer [120]. The authors used text classification to assess the quality of search ontologies within the film domain. Like in our approach, the researchers emphasized the prerequisite of domain-specific or expert knowledge for extrinsic evaluation.

The authors trained a text classification algorithm to automate the evaluation process based on the utilized search ontologies and text documents associated with the films. They then tested whether documents associated to movies are clustered according to model dimensions trained on the ontologies. The more clearly the documents are clustered according to such ontology, the higher their quality can be assumed.

Like Elhadad et al. [120], our goal was to develop an automated domain-specific strategy to evaluate the performance of multiple embedding models. For this purpose, we implemented a multi-label transfer learning text classification procedure to compare the performance of our domain-specific embeddings with pre-trained and baseline models by classifying 27,383 domain-specific scientific abstractsFootnote 3. according to 10 classes described in Table 2.

The classification process can be performed without the previously trained language model, based on pure pattern recognition, which will act as a baseline for us. A second option is to use a language model trained on natural everyday language. suppose our domain-specific language model better comprehends the language used in the abstracts, leading to a better classification result. In that case, we assume that our trained domain-specific model ceteris paribus represents the scientific language more precisely. For us, this translates into a higher quality of embeddings. Based on this idea, we formulate hypothesis 2.

Hypothesis 2: The domain-specific model classifies the abstracts more clearly (higher accuracy and lower hamming loss) than the domain-unspecific models under identical conditions.

The ten categories constituting our classification dataset were derived from the keywords assigned by authors to their respective abstracts, with only those cases featuring fewer than 20 keywords per abstract being considered. Subsequently, we analyzed the co-occurrence of these keywords. To facilitate this, we constructed a co-occurrence matrix encompassing the 1, 000 most frequently occurring keywords. Before this, the keywords underwent a pre-processing pipeline, wherein they were tokenized, converted to lowercase, and subject to removing misspellings and stop words. Finally, we obtained a co-occurrence matrix C with dim. \((1000\times 1000)\).

In the subsequent phase, we used Principal Component Analysis (PCA) to reduce dimensionality, thereby reducing the co-occurrence matrix C from 1, 000 to 10 dimensions, corresponding to the ten principal components. We then extracted the terms indicating the highest loadings on each factor. Given that we utilized a fixed threshold value for loading strength, the number of keywords per latent category varied. For instance, the terms ’women’ and ’men’ both showed significant loadings on the latent factors ’gender’ and ’family’. To ensure disjoint categories, we implemented a qualitative selection procedure to assign keywords to a single category loaded on multiple groups simultaneously. This selection process also involved the removal of keywords with ambiguous meanings. Ultimately, we generated a dataset, split into a training- (80%) and test- (10%) and validation set (10%). Both comprised of two columns: ’abstract’ and ’sociological category’. Table 2 details the categories and keywords. The dataset is approximately balanced, with each category representing 8–12% of the data. It is crucial to note that abstracts can be associated with multiple categories concurrently. For instance, out of the total, 19,215 abstracts are linked to a single category, 6927 are associated with two categories simultaneously, 1144 encompass three categories, 90 span four categories, and 7 cover five categories.

Table 2 Keywords assigned by the authors for each Class and Label of the 27,383 abstracts in the domain-knowledge extrinsic evaluation dataset
Fig. 5
figure 5

Schematic illustration of our CNN architecture used for extrinsic evaluation. From left to right (1) the embedding layer (with a dimensionality of (\(\frac{104,636}{batch size}\), 149,300)). (2) The convolution block containing 6 (Conv1D) filter sizes (1024, 512, 256, 128, 64, 32) followed (3) their respective max pooling (MaxPooling1D) layers and one global max pooling (GlobalMaxPooling1D) layer. The last block (4) is a fully connected dense layer with 10 output neurons and a softmax activation function

After outlining the process of building the classification dataset, we proceed to provide a concise overview of the architecture of our Convolutional Neural Network (CNN) designed for multi-label text classification.

Figure 5 illustrates our implemented CNN architecture, consisting of an embedding layer, multiple convolution layers (Conv1D) with corresponding max pooling operations (MaxPooling1D), and a final global max pooling layer (GlobalMaxPooling1D) compressing the output to scalar values. Finally, the output of the convolution and pooling block was fed into a fully connected dense layer with 10 nodes and a sigmoid activation function.

For our embedding layer in Fig. 5, we used an input size of 104, 636, a batch size of 149, and an output embedding vector with a size of 300, thus resulting in \((\frac{104,636}{149}\) \(\times\) 149 \(\times\) 300) = 31, 390, 800 non-trainable parameters. The embedding layer with these non-trainable parameters is a reflection of the comprehensive vocabulary found within the text corpus. These are non trainable, since they are our models themselves, which we do not want to change, but evaluate. The input is a word in text matrix \(X \in {\mathbb {R}}^{N \times d}\). N denotes as the word sequence length and d is the embedding size.

The size and number of convolutional layers were determined by the aim to extract and hierarchically organize textual features at various levels of abstraction. The selection of filter sizes, kernel sizes, and strides for each convolutional block was based on a mix of theoretical considerations, as cited in [121,122,123], and practical experiments. We employed k-fold cross-validation on unseen data to determine filter sizes, feature maps, and pool sizes, as well as their configuration. Our aim was to fine-tune the model to achieve a balance between complexity and performance, ensuring the model remains easily reproducible.

The first (purple) 1D convolution block in Fig. 5 consists of 1, 024 filters, a kernel size of 2, and a stride of 1. All further convolution 1D layers have the same number of strides and kernel size. Taking the embedding vector input dimension of 300, the number of filters and the kernel size, as well as the bias term dimension of 1, 024 into account, we receive \(1,024 \cdot (2 \cdot 300) + 1,024 = 61,5424\) trainable parameters for the first 1D convolution block, followed by a Max pooling 1D layer with a pool size of 3.

The second (blue) convolution layer described in Fig. 5 consists of 512 filters resulting in \(512 \cdot ( 2 \cdot 1,024) + 512 = 1,049,088\) trainable parameters, also followed by a Max pooling 1D layer with a pool size of 3.

The third (green) CNN1D Layer in Fig. 5 contains 256 filters resulting in another \(256 \cdot ( 2 \cdot 512) + 256 = 262,400\) trainable parameters. The third layer’s output is pooled by a Max pooling 1D layer with a pool size of 2.

The fourth 1D convolution layer (orange color in Fig. 5) consists of 128 filters resulting in \(128 \cdot ( 2 \cdot 256) + 128 = 65,664\) trainable parameters, followed by a Max pooling 1D layer with a pool size of 2.

In our model, we incorporated two additional 1D convolution layers with 64 and 32 filters, respectively, which are not depicted in the schematic representation in Fig. 5. This results in a model comprising 2, 013, 482 trainable parameters. The model was trained using a batch size of 1024, and the ADAM optimizer [124] was employed for optimization. Our final (vanilla) CNN architecture does not incorporate any regularization techniques such as Drop-Out Layers or equivalent L2 regularization. Early stopping was only utilized for the retrofitted domain-specific model (Table 3).

As performance measures, we employed validation accuracy and validation Hamming loss. Validation accuracy is a quality metric quantifying the proportion of predicted values that match the actual true values. In contrast to training accuracy, validation accuracy refers to the 20% of cases in the test data set, i.e., the cases the model has not yet seen in training. On the other hand, the validation loss quantifies the uncertainty of our prediction based on the extent to which the predictions for the test dataset deviate from the true values. For our multi-label classification problem, we utilized the Hamming loss function [125, 126] as the cost function, as described in eq. 11.

$$\begin{aligned} \begin{aligned} HammingLoss(n,L)=\frac{1}{n} \sum _{i=1}^{n}\sum _{j=1}^{L}\frac{{\hat{y}}_{j}^{(i)} \oplus y_{j}^{(i)}}{L} \end{aligned} \end{aligned}$$
(11)

Since we have multiple labels per sample, we calculate the false positives and false negatives per validation sample by using the logical \(\oplus\) (XOR) operator. Variable n in Eq. 11 is the number of validation samples, L the number of labels, \(y_{j}^{(i)}\) the true label for the \(i-th\) validation sample and \(j-th\) label. \({\hat{y}}_{j}^{(i)}\) represents the true labels for the \(i-th\) validation sample and \(j-th\) label. By using the Hamming-Loss function, every label is weighted equally.

Retrofitting

The method of retrofitting, a graph-based approach, refines pre-trained word vector embeddings by incorporating lexical knowledge [10]. Unlike other strategies for optimizing word embeddings, retrofitting is applied as a post-processing step, thus allowing it to be utilized on any pre-existing word vector model. This approach encourages semantically linked words to have similar vector representations, thereby enhancing the semantic richness of the embeddings. The effectiveness of retrofitting has been demonstrated through its superior performance over previous techniques that incorporate semantic lexicons into word vector training algorithms. These evaluations were conducted on various standard lexical semantic tasks across multiple languages, including English, German, French, and Spanish outperforming other semantic enrichment models, such as those proposed by Yu and Dredze [127] and Xu et al. [128]. Additional analysis suggested that as the dimensionality of word vectors increases, allowing for capturing higher orders of semantic information, the benefits of retrofitting may be less pronounced. However, overall, the findings affirm the efficacy of retrofitting as a method for refining word vector representations using semantic lexicons, highlighting its broad applicability and superior performance [10].

In Fig. 6 we can’t see an undirected graph object \(G=(Q,E_{(q_i,q_j)})\) where the green \(Q = \{q_1,\ldots ,q_n\}\) vertices, representing terms in SociRel-461. The edges \(E_{(q_i,q_j)} = \{e_{(q_i,q_j)} | q_i \in Q, q_j \in Q, i \ne j\}\) denote the semantic relations in our lexicon. The trained word vectors are stored in the matrix \({\hat{Q}} = \{\hat{q_i},\ldots ,\hat{q_n}\}\), where every row \(\hat{q_i} \in \mathbb {R}^{300}\) will be matched to a \(q_i\) word in SociRel-461.

Faruqui et al. [10] defined an optimization problem in Eq. 12,

$$\begin{aligned} \begin{aligned} \Psi (Q)=\sum _{i=1}^{n}{[\alpha _{i}||q_i-\hat{q_i}||^2+ \sum _{(i,j)\in E}^{}{\beta _{ij}||q_i-q_j||^2]}} \end{aligned} \end{aligned}$$
(12)

where the euclidean distance between (in SociRel-461) neighboring vector pairs will be minimized by calculating the derivative \(\frac{\partial \Psi (Q)}{\partial q_i}\) in Eq. 13.

$$\begin{aligned} \begin{aligned} q_i = \frac{\sum _{j:(i,j)\in E}^{} \beta _{ij} q_j + \alpha _{i} {\hat{q}}_i}{\sum _{j:(i,j)\in E}^{}\beta _{ij}+\alpha _{i}} \end{aligned} \end{aligned}$$
(13)

If two words \(q_i\) and \(q_j\) are close to each other (in the SociRel-461 dictionary), the Euclidean distance between the corresponding calculated word vectors will be reduced. Thus, their relation is interpreted as semantically more similar. The variables \(\alpha\) and \(\beta\) control for the degree that adaptation.

The authors [10] have published an algorithm implementation, which was used in this contribution. The optimization procedure is highly efficient, even for \({\hat{q}}_i \in \mathbb {R}^{300}\) and \(i > 100,000\). We have retrofitted our self-trained domain-specific vectors from the field of sociology as well as the set of domain-unspecific vectors.

Fig. 6
figure 6

Word graph of closely related words. The blue \({\hat{q}}-nodes\) are observed word vectors, trained on our sociological text corpus. Green squared \(q-nodes\) are lexical items inferred from \(SociRel-461\)

This leads us to two further hypotheses in which we want to test the performance of retrofitted domain-specific models compared to retrofitted domain-nonspecific models.

Hypothesis 3: The retrofitted domain-specific model classifies the abstracts more clearly (higher accuracy and lower hamming loss) compared to the not retrofitted domain-specific models under identical conditions since retrofitting the already domain-specific model with further domain knowledge enhances the classification ability of the overall model.

Hypothesis 4: The retrofitted domain-unspecific model classifies the abstracts more clearly (higher accuracy and lower hamming loss) compared to the not retrofitted domain-unspecific models under identical conditions since retrofitting the everyday language model with domain knowledge enhances the classification ability of the overall model.

Results

In this research project, we developed both intrinsic and extrinsic evaluation methods for domain-specific word embeddings and further optimized our domain-specific and domain-unspecific models through subsequent retrofitting.

Given its cost-effectiveness in computational time, the intrinsic approach proves particularly advantageous for the hyper-parameter tuning of embedding models. Contrarily, while computationally more demanding, the extrinsic approach allows for testing trained language models within real-world applications. Given that intrinsic tests for semantic similarities, such as Wordsim-353, rely on domain-unspecific ratings for their ground truth, we needed to develop our test for thematic similarity derived from domain-specific literature. The word relations tested in our approach are specifically tailored to the discipline, leading us to assert that a high degree of agreement (as indicated by a high overlap coefficient) between the trained model and SociRel-461 signifies a good fit of the model to the domain-specific language. SociRel-461, based on the Open Education Sociology Dictionary [94], encompasses 461 key sociological concepts and their most closely related set of terms. Experts in the field defined the ground truth within SociRel-461, which indicates a high quality of the content of the selected key-value pairs

In scope of the first research question (How does our domain-specific model, trained on “soft” science vocabulary, perform against a domain non-specific model in the scope of our intrinsic and extrinsic evaluation?), we tested domain-specific vector models against a state-of-the-art freely available embedding model of everyday language, both intrinsic and extrinsic.

To test Hypothesis 1 (The domain-specific embedding model shows a higher average overlap coefficient between node sets S and V across all 461 SociRel-461 terms than a domain-unspecific embedding model, indicating a better match of the domain-specific embedding model with the relational word similarities defined by experts in the field.), we calculated the mean overlap coefficient for domain-specific and domain-unspecific embedding models.

The mean overlap coefficients, calculated across all 461 bipartite graphs \(G = (S, V, E)\) with a node set V size of the n-most similar words, are shown in Table 1. As previously outlined, for our domain-specific models with only 1% of the rarest tokens removed, we observe a mean overlap value of 3.51% for a set size (n-most similar words) of \(n=10\), a mean overlap value of 6.31% for \(n=50\), and a mean overlap value of 12.12% for \(n=250\). Comparable values are achieved for the domain-specific model with 10% of the rarest tokens removed.

While these may initially appear to be modest overlap values, it is essential to note that the domain-specific model, with only 1% of the rarest terms removed, encompasses a vocabulary size of approximately 100,000 unique word tokens. To attain an overlap coefficient of at least 40% for \(n=10\) tokens via a random experiment, we would need the draw four out of the ten most semantically similar words for a single SociRel-461 keyword (for instance, the term ’discrimination’) from a vocabulary comprising 100,000 words. This could be accomplished by employing the hypergeometric distribution function, denoted as \(P(X = k) = \frac{{C(K, k) \cdot C(N-K, n-k)}}{{C(N, n)}}\). In our specific scenario, \(P(X \ge 4) = \sum _{k=4}^{10} \frac{{C(10, k) \cdot C(100000-10, 10-k)}}{{C(100000, 10)}}\), which yields a probability of approximately \(1.06 \times 10^{-14}\). Consequently, an overlap of 40%, as exemplified in Fig. 4 for the term ’discrimination’, is a quite good result. Upon comparing the domain-specific model with the domain-unspecific model in Table 1, it is evident that the domain-specific model, for a set size of \(n = 10\), could achieve an overlap that is 4.5 times higher. For a larger set size of \(n = 250\), the overlap was almost 9.7 times higher, further emphasizing the superiority of the domain-specific model. Hypothesis 1 can therefore be accepted. The domain-specific model has a significantly higher mean overlap coefficient across all node set sizes than the domain-unspecific model.

In addressing the second component of our first research question, we aimed to validate the results of our intrinsic evaluation externally. This would suggest that word embeddings with a higher average overlap coefficient would yield superior validation accuracy and lower validation loss values in an extrinsic evaluation than vector models with a lower average overlap coefficient.

Fig. 7
figure 7

Illustration of the validation accuracy for each epoch across five distinct models. This figure graphically illustrates the change of accuracy during the training phase for the validation dataset across several models: the baseline model, denoted as Without transfer learning, the domain non-specific model, referred to as Wordnet 300, the domain-specific model, designated as Skip-Gram 300, and the retrofitted models, namely Skip-Gram 300 RetroFitted and Wordnet 300 RetroFitted

The green curve delineates the baseline accuracy that the Convolutional Neural Network (CNN) Model (designated as Without transfer learning in Fig. 7) can achieve. We employ the term transfer learning to denote using pre-trained machine learning models (such as word embeddings) as input in subsequent models. As transfer learning input, we incorporated seven previously developed models detailed in Table 3.

Table 3 Evaluation metrics for domain specific multi-label text classification validation set

As depicted in Table 3 (1st row), the baseline model represents a pure text classification task devoid of any transfer learning embedding model. Our second model (Table 3, 2nd row) utilizes a pre-trained domain non-specific word embedding, which was trained on data from WordNet 3.0 [111], OntoNotes 5 [112], ClearNLP, Wikipedia, OpenSubtitles [113], and the WMT News Crawl [113]. The third model under evaluation (Table 3, 3rd row) is our domain-specific language model, which underwent training for one epoch and had 1% of the rarest tokens removed. The fourth and fifth models (Table 3, 4th and 5th rows) are identical to the third model but underwent an extended training period (10 and 16 epochs, respectively). Model six (Table 3, 6th row) was trained for one epoch and had 10% of the rarest tokens removed, resulting in a significantly smaller model. The final two models (Table 3, 7th and 8th rows) underwent a retrofitting process.

Figure 8 outlines the validation hamming losses across 15 training epochs. The baseline model underwent a training period of 15 epochs, resulting in a validation accuracy of 0.421 and a validation hamming loss of 0.111 (Table 3). Notably, the loss value increased after the fourth epoch, potentially indicating overfitting. The baseline model shows the most pronounced increase in loss value with extended training, particularly when comparing it with models incorporating word embeddings as transfer learning input. The domain-unspecific model, which we refer to as Wordnet 300 in Fig. 7, achieved a maximum validation accuracy of 0.581 and a hamming validation loss of 0.087 (Table 3). This performance surpasses the baseline model, which does not incorporate word embeddings as transfer learning input. The large domain-specific model, Skip-Gram 300 (with only 1% of the rarest words removed and trained for 1 epoch), reached an accuracy of 0.691 and a validation hamming loss of 0.068 (Table 3). More extended training of the word embeddings for 10 and 16 epochs resulted in a slight decrease in validation accuracy and increased validation loss. The smaller domain-specific model, Skip-Gram 300 (with 10% of the rarest words removed and trained for 1 epoch), underperformed in comparison to the larger model (with 1 epoch of training) as outlined in Table 3.

Fig. 8
figure 8

Illustration of the validation hamming loss for each epoch across five distinct models. This figure illustrates the change of hamming loss during the training phase for the validation dataset across several models: the baseline model, denoted as Without transfer learning, the domain non-specific model, referred to as Wordnet 300, the domain-specific model, designated as Skip-Gram 300, and the retrofitted models, namely Skip-Gram 300 RetroFitted and Wordnet 300 RetroFitted

Additionally, it is noteworthy that the dark blue and green curves, representing the domain-unspecific (Wordnet 300) and the baseline model (Without transfer learning) respectively (Figs. 7 and  8), exhibit a more gradual increase in validation accuracy and a more subtle decrease in validation loss compared to their domain-specific counterparts. Moreover, their loss values increased significantly over time (from Epoch 4 to 15) relative to the domain-specific model.

The red curve, indicative of the most superior performance across all models, exhibits no tangible overfitting. The results of the retrofitted models are also intriguing. By retrofitting the domain-specific model (represented by the orange curve in Figs. 7 and 8), the performance of the multi-label classification slightly deteriorates, with the validation hamming loss increasing from 0.065 to 0.068 and the validation accuracy decreasing from 0.691 to 0.682. Despite these minor differences, retrofitting did not enhance the performance of the domain-specific model. Conversely, the performance of the domain non-specific model (Wordnet 300 RetroFitted) was significantly improved by retrofitting with SociRel-461 domain knowledge. However, compared to the purely domain-specific model, the retrofitted domain-unspecific model (light blue curve in Fig. 8) tends to exhibit an increase in loss values with extended training duration.

What are the implications of these findings for our second research question: “Does the application of the retrofitting post-processing method [10], utilizing the SociRel-461 domain-knowledge dictionary, enhance the performance of domain-specific word embeddings in machine learning downstream tasks?”

Concerning Hypothesis 3 (“The retrofitted domain-specific model classifies the abstracts more accurately (higher accuracy and lower Hamming loss) compared to the non-retrofitted domain-specific models under identical conditions, as retrofitting the already domain-specific model with additional domain knowledge enhances the classification ability of the overall model.”), we aimed to ascertain whether a further retrofitted domain-specific model would lead to a more precise classification of the abstracts according to our defined sociological categories. This hypothesis cannot be confirmed because the validation loss marginally increases after retrofitting the domain-specific model with SociRel-461 domain knowledge. It is evident that our domain-specific model (represented by the red curve in Figs. 7 and 8) could not be further enhanced by retrofitting (as indicated by the orange curve model in Figs. 7 and 8).

Interestingly, the situation differs for the domain-unspecific model, which leads us directly to the third research question: “To what extent can the SociRel-461 domain knowledge dictionary enhance the performance of domain non-specific word embeddings in machine learning downstream tasks when used for retrofitting?” Corresponding to Hypothesis 4, we aimed to determine whether the retrofitted domain-unspecific model classifies the abstracts more accurately (higher accuracy and lower hamming loss) compared to the non-retrofitted domain-unspecific models under identical conditions, as retrofitting the general language model with domain knowledge enhances the classification ability of the overall model. As a result, the retrofitted Wordnet 300 model (light blue curve in Figs. 7 and 8) achieved significantly higher accuracy and significantly lower hamming validation loss compared to the non-retrofitted domain-unspecific model. The performance improvement in the classification task achieved by retrofitting the domain-unspecific model is considerably more substantial than, for instance, the minimal performance increase observed between the domain-specific and retrofitted domain-specific models.

Discussion

Domain-specific word embeddings can be used in various applications in social sciences and sociology, with ongoing projects already benefitting from our models. By deriving domain-specific semantic knowledge graphs from these embeddings, we’ve been able to map the intricate semantic networks within different disciplines. These knowledge graphs, with their inherent graph properties, allow us to analyze structural features and network measures such as density, among others. Employing social network analysis measures enables the examination of paths within these semantic graphs, shedding light on the interconnectedness of various areas within disciplines. Utilizing community detection algorithms reveals clusters of tightly knit concepts, which may denote distinct subfields or ideological schools within the social sciences. Influencer detection, metaphorically speaking, identifies nodes (topics) with substantial connectivity, pinpointing influential concepts or theories pivotal to shaping the discipline. Additionally, these knowledge maps facilitate gap identification by highlighting under-researched areas or topics with scant interconnections within the broader disciplinary framework, thus indicating potential avenues for groundbreaking research.

A practical application in sociology focuses on examining the interconnectivity of different dimensions of inequality, such as e.g. income inequality and technological inequality. Domain-specific embeddings enable this analysis by identifying the shortest paths within these semantic sub graphs, thereby elucidating the relationships between various areas of study.Another intriguing application involves constructing base vectors from words with contrasting meaning such as poverty and wealth, and then calculating the angles between these vectors and different socio-structural elements, offering a insights into the interplay between economic statuses and social structural components. Furthermore, the evolution of domain-specific word embeddings over multiple time points presents an opportunity to examine how the language of a discipline changes over time. This temporal analysis, particularly relevant to computational linguistics, can identify emerging key terms within a discipline’s subfield or uncover interdisciplinary links.

An additional, more apparent use case lies in transfer learning for machine learning tasks. By utilizing domain-specific embeddings, we can enhance classification systems. This approach has been demonstrably beneficial for various text classification tasks, as evidenced by our findings presented in Sect. Results, Table 3.

We assume that diminishing returns in validation accuracy and hamming loss occur when retrofitting an embedding model, which is already trained on a domain-specific text corpus and has efficiently captured the crucial characteristics and nuances of that domain, with further domain knowledge.

Further addition of domain knowledge might not substantially enhance the model’s performance due to its existing optimization for the domain, offering limited avenues for notable improvement. In contrast, domain-unspecific models, pre-trained on extensive, varied datasets not tailored to specific domain nuances, can significantly benefit from retrofitting with domain-specific knowledge as shown in Table 3. Retrofitting these models with domain-specific knowledge allows them to fine-tune their understanding within the target domain, leading to significant improvements in validation accuracy and validation loss.

However, the additional benefit of enriching domain-specific models with further specialized knowledge is likely marginal. These models begin with a rich domain-relevant information, and retrofitting more domain-specific insights may not markedly enhance their performance. Models already specialized in a particular domain, retrofitting could introduce an overfitting risk. Introducing more domain-specific details might constrain the model’s generalizability, risking reduced efficacy on unseen data. In other words that means that the embeddings have encapsulated the domain’s semantic space as effectively as possible, and additional information may not alter the semantic representation in a way that improves performance.

We further assume that, adding further domain-specific knowledge can lead to noise issues. It could be that as a result of overfitting, subtleties and nuances are given too much weight. That sort of overweighting can reduce the model’s ability to generalize or focus on the most relevant features for its tasks.

Outlook

In this study, we trained a vector space model for a soft scientific discipline and developed an intrinsic and extrinsic evaluation strategy tailored to the social science scientific language. Additionally, we have enhanced the trained models using the retrofitting post-processing method. We introduced SociRel-461, a domain-knowledge dictionary for social sciences, thereby establishing an effective evaluation method for domain-specific word embeddings. This technique facilitated the optimization of pre-trained domain-unspecific word embeddings for domain-specific tasks.

SociRel-461 has two primary applications. Firstly, it can optimize pre-trained domain-unspecific word embeddings using the domain knowledge encapsulated in SociRel-461 during the retrofitting process. In this scenario, the word pairs that the retrofitting algorithm optimizes are those defined by domain experts. This approach enables the domain-specific refinement of domain-unspecific vector models. Secondly, SociRel-461 can be employed in K-fold cross-validation (CV) as a quality metric for domain-specific embeddings trained on social sciences literature. This allows for the testing of various combinations of hyperparameter values, such as sliding window size, embedding dimension, and negative sampling ratio.

Our extrinsic evaluation results demonstrated that domain-specific word embeddings yield higher accuracy and lower loss values than domain-unspecific word embeddings. While retrofitting did not enhance the performance of our domain-specific model, it significantly improved the performance of the domain-unspecific model, underscoring the potential of retrofitting for transferring domain knowledge to domain-unspecific embeddings.

However, our approach has limitations. In this study, we excluded all forms of graphics and tables from our text corpus, which served as the training data for our domain-specific embeddings. These elements are also part of the text corpus and could contain important information. In-text citations and all meta-data information were also removed from the text corpus. Future work could consider incorporating citation relationships between research papers to enhance embedding models. Another potential limitation is the time span of the text corpus. Spanning over 20 years, the meanings of words may have evolved over time. The disadvantage of shorter periods is that fewer data would be available for training the word embeddings.

Future research could explore several aspects to further the domain-specific evaluation and optimization of word embeddings. One potential path is to extend the concept of SociRel-461 to other scientific disciplines, such as psychology, thereby enabling more accurate evaluation and enhancement of word embeddings in those new fields. Another approach could involve applying our findings to the fine-tuning and feature-based transfer-learning approaches for BERT models. Feature-based approaches primarily train the last layers of the BERT models while leaving the preceding layers unchanged. This allows the model to adapt to the task while retaining the pre-trained knowledge from the final classification layers. Conversely, fine-tuning involves training the entire BERT model on a new task, which often requires a substantial amount of labeled data. Although this strategy may enhance performance on the new task, it is typically more computationally intensive and data-demanding. Given our results and labeled data in this study, future projects should consider both strategies.