Introduction

Growth of science, growth of topical difference identification issues?

Classification systems of scientific literature play a central role in bibliometrics (Glänzel and Schubert 2003) and will become more and more important with the exponentially growing amount of scientific literature. From World War II to the early 2000s, the stock of scientific literature is estimated to have doubled about every 9 years (Bornmann and Mutz 2015) and in 2009 amounted to over 50 million publications (Jinha 2010). These growth rates and underlying numbers raise concerns that the large current and future stock of knowledge will become more and more difficult to structure for single scientists (Landhuis 2016) and established databases (Larsen and Von Ins 2010). Traditional classification systems rely on keyword assignments, expert-based classification of subjects, and forward and backward citations to embed a publication in the network of knowledge flows in scientific literature (De Bellis 2009). These methods include high levels of complexity reduction and therefore a loss of knowledge in the content of the scientific publications. Practically, subtle but often decisive differences between two papers on the same topic can therefore hardly be addressed without having expert-level knowledge in the respective scientific field. In the same manner, topical overlaps between loosely related papers cannot be detected without having expert knowledge in both papers’ fields. The addition of more and more papers will eventually constrain the ability of experts to detect differences and similarities between papers. The large-scale quantification and detection of thematic differences in research topics is therefore an open problem in scientometric research. Advances in machine learning, especially in the statistical analysis of large text collections, alleviate these issues under certain circumstances. In this way precise difference detection between scientific texts can be feasible without having deep knowledge in the respective field.

The case of scientific reorientation in East and West Germany

In this paper we therefore develop and apply such a machine learning approach to difference detection based on the case study of dissertation titles written at EastFootnote 1 and WestFootnote 2 German universities in economics and business administration and chemistry before and after German reunification. German reunification is especially suited for investigating differences in research topics because the transition of the political system in East Germany went hand in hand with the transition of the scientific system. German reunification led to the dismantling in East Germany of a large number of chairs, institutes and research organizations, as well as a broad institutional restructuring in academia. Reasons included political motives, but in several instances also a mismatch between what had been researched under the old (socialist) system and what was considered interesting in the new one. This change affected social sciences more severely than natural sciences and therefore provided two different structures to investigate thematic differences and topical reorientation. In these two structures, motives and incentives for individual scientists in the two disciplines and parts of Germany to change research topics differed substantially and may have manifested in minor and major thematic differences. The chapter “Historical background” will therefore elaborate on the disciplinary and general historical circumstances before and after the reunification.

Dissertation as a data source

Journal publications and their linked indicators, such as citations, are the main subject ofinvestigation in scientometric research and have contributed to substantial advances in the field (e.g., Garfield 1972; Hirsch 2005). However, under certain historical, institutional and disciplinary circumstances, such as in our case, journal articles are not the best means of inquiry.Footnote 3 We therefore use dissertation titles as an alternative source of information to identify and track the differences in the two disciplines in Germany before and after reunification. Dissertation titles offer several potential advantages for our approach and are, despite limited use in scientometrics (Morichika and Shibayama 2016), amply available in Germany. This is because every doctoral student is mandated to send in a copy of his or her dissertation to the German National Library (Deutsche Nationalbibliothek). The German National Library archives the dissertation and stores some basic author and dissertation information in its electronic catalogue. We have access to this catalogue, which provides us an almost complete list of dissertations that were submitted in both parts of Germany, since 1970. Thus, we have a good picture of the thematic landscape during our period of investigation in Germany. Our work is based on a number of presumptions: First, in Germany the doctoral advisor (often dubbed the “Doktorvater”) has a strong influence on the doctoral student and their choice of research topic. Moreover, the advisor is usually required to have a chair at a university, as only they are entitled to award PhDs. Therefore, the dissertation topics most likely represent the research topics present at a chair. Second, the title of a dissertation represents its content in a very condensed form. Together, these assumptions lead to the conjecture that the research focus of a chair is reflected in the titles of dissertations submitted at an a university with which he or she is affiliated. This allows us to draw conclusions on the general thematic landscape of university research in Germany during our period of investigation.

A structural topic model approach to differences in dissertation titles

Our main effort was in applying a probabilistic text model (“structural topic model”) to these dissertation titles, aggregating the outcomes and then incorporating them into a linear regression framework, which allows us to calculate the level of difference between dissertation titles by regional and temporal origin of the dissertation. In this way our approach demonstrates how to identify and track differences between scientific work on the level of individual researchers, but also larger entities of the scientific system, such as different scientific disciplines or parts of a country. In our case study, we find in economics and business administration research topics considerable differences between East and West Germany before reunification. After reunification, we observe a strong and rapid conformation. In chemistry there are few differences between East and West before reunification. Afterwards, the results suggest a moderate thematic convergence.

Historical background

The scientific system and doctoral education in the German Democratic Republic

Since the birth of the two Germanies in 1949, the intra-German relationship has been characterized by a competition of political (and economic) systems. Walter Ulbricht, prominent veteran socialist politician of the German Democratic Republic (GDR) was renowned for his saying “overtaking without catching up”. The early socialists strove to demonstrate the superiority of socialism over capitalism, with scientific and technological achievements playing a central role. Even the constitution of the GDR (§ 2, Abs 1) claimed that the foremost aim of a socialist society was to increase the effectiveness of scientific and technological development and labour productivity (Volkskammer der DDR 1976). This orientation of scientific advancement on aspects of productivity dated back at least to Lenin and had consequences for the academic landscape of the GDR. Industrial application of research findings was heavily emphasized. Basic research was carried out almost exclusively by universities, but free choice of the research subjects was increasingly restricted and almost non-existent beginning in the 1960s (Gruhn and Lauterbach 1977). PhD candidates had minimal freedom in choosing their research subjects. In the case of Humboldt University in Berlin shows that roughly two-thirds of dissertation topics followed the five-year research plans of the government (Wollgast 2001). Furthermore, international contact was more or less limited to other socialist states and access to Western world academics and their publications was difficult to gain (Mann 1979). Limited financial resources made internationally competitive research impossible in the majority of scientific fields. However, the conditions of career advancement in academia closely resembled those in West Germany. The average student in the GDR had to complete a basic and an advanced (or specialized) part of his study to earn a degree. Afterwards, a dissertation (Promotion A) had to be written to obtain the title “Dr.” in a scientific field. In contrast to the Federal Republic of Germany, the GDR had universal requirements for the award of a PhD degree, which included a fair amount of ideology (Deutsche Demokratische Republik 1968, §5, Abs. 1). PhD degrees could be earned through research studies (2–3 years long, similar to a graduate school), employment at a university chair (usually four years’ contract) or distinction in industrial and societal engagement (similar to an external PhD candidate) (Belitz-Demiriz et al. 1990; Guenther 1989).

Unlike in the GDR, the scientific system of West Germany during our period of investigation was (and still is) free of ideological constraints. The constitutional (basic law) “freedom of teaching and research” (§5, Abs. 3) guaranteed vast autonomy for university researchers. Regarding factors that could have implicitly constrained freedom of research in West Germany in the 1980s and early 1990s, Peisert and Framhein (1994) argue that, in the case of third party funding, there was no strong influence from semi-public and public institutions on research topic choices. The systems of doctoral education in East and West Germany closely resembled each other; both countries doctoral students were predominately employed at the chairs directly; graduate schools played a minor role. However, the level of involvement of ideology in doctoral education clearly distinguished the two.

The transition and political change in Germany in 1990 had a deep impact on academic institutions, most notably in scientific fields that were heavily affected by socialist ideology. The prime example is economics and business administration, which was almost completely dismantled and rebuilt from the ground up, often involving new personnel, structures and research agendas. Kolloch (2001) reports that by 1994 90% of the economics and business administration chairs at the biggest East German university, HU Berlin,) were replaced with West Germans.

In chemistry the historical preconditions were quite different. In the GDR, the discipline was considered to be a crucial scientific productive force that would directly and indirectly increase economic output. Chemistry and other natural sciences were therefore oriented to the requirements of the local industry (Meske 2004), which led to a much greater focus on applied research in East Germany. GDR policymakers, for example, built a technical college in the centre of the East German chemistry cluster Leuna-Buna-Bitterfeld. The GDR chemical industry and, in consequence, the discipline of chemistry was dependent on crude oil deliveries from the Soviet Union to produce precursors and final chemical products. The GDR, however, used the dominant share of crude oil deliveries from the Soviet Union to refine petrol, which was to a large extent exported in order to bring in much-needed hard, foreign currency. This petrol-focused production caused a shortage in the production of other products based on crude oil (e.g., rubber and plastic). East German chemistry therefore researched non-oil-based ways of producing such goods. Lignite was a viable alternative, since East Germany had large lignite resources and existing processing facilities dating back to World War II. For the scientific discipline of chemistry this lignite based “business model” of the GDR resulted in a strong emphasis on related research problems. Chemistry as a discipline was therefore politically determined, applied and focused foremost on the special demands of East German chemical industry. For West Germany we find no indication of any profound specialization or a general focus on applied topics in chemistry. This may be a consequence of the constitutional right of freedom in teaching and research and a conservative industrial policy.

Data

The two disciplines, economics and business administration and chemistry, and their historical background before and after German reunification are therefore suited for our analysis of identifying research topic differences. They provide two structures: for economics and business administration, a structure with substantial topical heterogeneity before and after reunification; and for chemistry, one with relative topical homogeneity. In the following section, we will describe the processing steps used to obtain the final dataset of dissertation titles (Rehs 2020a).

We use the online catalogue of the German National Library as the basis for our analysis. The catalogue lists the vast majority of PhD dissertations submitted at German universities, including the GDR. There are entries for approximately one million PhD dissertations, which are classified by subject. We use this classification to distinguish between economics and business administration and chemistry. Due to the peculiarities of German medical dissertations, we eliminate dissertations which are cross-listed in chemistry and medicine. Furthermore, we employ information on university location (cities, name of university or a combination of both) to separate East from West dissertations.Footnote 4 We assume that reorientation of research topics after the reunification continued until 2010. To obtain a picture of the thematic landscape before reunification, we consider the years 1980–1989. The years 1990–1994 are eliminated from our data, since the replacement of East German chairs took several years and the number of observations from East German university dissertations dropped significantly during this time period.

In the next step, we paste every dissertation title and subtitle into one string and standardize this string. Our pre-processing includes standard text-mining methodology: transformation to lowercase, removal of punctuation, language detection and removal of non-German titles, stemming, n-gram detection and removal of very frequent words, rare words, stopwords and short titles. Different languages in a text collection can considerably distort the outcomes of the topic modelling algorithm to be presented due to problems with (text-mining) token recognition. Although differently spelled words can have the exact same meaning in two languages, they are considered statistically as different tokens in text machines. Solutions based on translation cause more problems than they solve. Our approach is therefore to exclude all titles written in English. We are aware of the downsides of this procedure and might miss some important dissertations that are addressed to an international audience. Dissertations written in German might also differ in quality. Nevertheless, as our language identification algorithm (Ooms 2018) shows, English titles only account for roughly 10% of the dissertations. The small number of English titles would therefore distort the statistical inference based on topic modelling. All titles identified as neither German nor English are defaulted to German.

Mentioned n-grams are applied because some words are by nature bounded, like “United” and “States”. To improve the performance of the topic model to be presented, we want the algorithm to treat these words as one character. Bigrams are two bounded words and trigrams three bounded words. In both corpora we count the most frequent bi- and trigrams. We assume that only the top bi- and trigrams add relevant context for the subsequent algorithm. For both economics and business administration and chemistry, we set the boundary for relevant n-grams at the top 1%. We proceed by searching these n-grams in every string. If they occur, we add them to the string and remove the words that composed them.

We remove very frequent and very rare words for reasons of complexity reduction and minor relevance for topic modelling. Very frequent have the same properties as stopwords, but are not included in standard stopword dictionaries since they are dataset specific. They don’t add relevant context; rather, they are commonly used terms within a dataset and identically distributed across all documents (e.g., for dissertation titles, “investigation” or “method” may appear very frequently). We set the threshold for removal at the upper 0.1% limit of the most frequent words. The same holds for very rare words. Because of their low frequency, they don’t add context, and are removed if they appear fewer than 3 times in total.

Finally, we delete very short titles from our data set. Since topic modelling infers the topic distributions by drawing words from each title numerous times, titles consisting of only few words can be problematic because there is less room for randomness in each title. We therefore exclude titles containing fewer than five words.

Topic modelling in large-scale text analysis

The latent Dirichlet allocation

To address our research question we use topic modelling, which is a family of probabilistic methods for analysing large text collections. Topic modelling has found various applications in scientometrics, such as in investigating the topics that construct scientific publications (Blei and Lafferty 2007). Any topic modelling algorithm is, in general, a generative model of word counts. In our case that means we define a data-generating process for each dissertation title and then use the data to find the most likely values for the parameters within the model.

The most common topic modelling algorithm is the latent Dirichlet allocation (short: LDA, Blei 2012; Blei et al. 2003). In the LDA algorithm our dissertation titles are represented as mixtures of topics. In these mixtures, each word within a given dissertation title belongs to exactly one topic. Single dissertation titles can therefore be considered as vectors of topic proportions, which indicate the percentage of words belonging to each topic. In the following section we will describe the statistical methodology and orient on the notation and description of Roberts et al. (2014a, b, 2016).

The generative process in LDA starts by considering each dissertation title (index: Diss) as a distribution over topics (\({\theta }_{Diss}\)), which is drawn from a global prior distribution. In the next step, for each word in the dissertation title (indexed by \(n\)), the LDA algorithm draws a topic (\(z\)) for that word from a multinomial distribution based on its distribution over topics (\({z}_{Diss, n}\) ∼ Mult(\({\theta }_{Diss}\))). Depending on the topic selected, the observed word \({w}_{Diss, n}\) is drawn from a distribution over the vocabulary \({z}_{Diss, n}\) ∼Mult(\({\beta }_{{z}_{Diss, n}}\)), where \({\beta }_{k, v}\) is the probability of drawing the \(v\)th word in the vocabulary for topic \(k\).

A hypothetical pre-1990 East German title in economics might therefore be represented as a mixture over 10 topics. Topics are, again, a distribution over words that are more or less likely to be related to that topic (e.g., “Marx”, “worker”, “class” might each have high probability in the same topic). The LDA is completed by assuming a Dirichlet prior for the topic proportions such that \({\theta }_{Diss}\) ∼ Dirichlet(\(\alpha\)). However, there are disadvantages that come along with the application of LDA. The resulting posterior distributions can have many local modes. That means that different initializations can produce different solutions. In order to address this issue, we use the spectral initialization procedure described in Arora et al. (2013), which is also implemented in the R package on structural topic modelling (Roberts et al. 2014a).

Structural topic modelling

Structural topic modelling is an extension of the LDA process described above which allows covariates of interest (such as the temporal origin or university of the dissertation) to be included in the prior distributions for dissertation-topic proportions and topic-word distributions. Thus, the covariates offer a method of “structuring” the prior distributions in the topic model, including additional information in the statistical inference procedure. The topic prevalence (as described in the LDA section) can therefore be influenced by some set of covariates \(X\) through a standard regression model with covariates \(\theta\) ∼ LogisticNormal (\(X\gamma , \Sigma\)). In contrast to the described LDA algorithm, we abolish the assumption that topical prevalence (how much a topic is discussed by a covariate) is constant across all dissertation titles. This is a major improvement in comparison to LDA and allows the parameters that generated the dissertation title to be reconstructed more precisely.

We use university and year dummies of the dissertation as topical prevalence variables in our structural topic models on chemistry and economics and business administration. We argue that these variables are best suited to capture temporal and university level variation in dissertation titles and are different from the main independent variables in the regression framework to be presented. Year dummies as topic prevalence variables should capture trends and temporarily popular topics in the 25-year span of our investigation. For universities, irrespective of their East or West German background, we presume that there are regionally bound topics. This is because the chairs at universities might have inherent topics that are reflected in the dissertations they produce. Therefore, we include university dummies as the second set of topical prevalence variables in our topic model. In structural topic models, proportions (\(\theta\)) can also be correlated (see also Blei and Lafferty 2007); i.e., in a given dissertation title, the high proportion of a topic that is related to socialism might also increase the likelihood of high proportion of a related topic (e.g., a topic related to Leninism).

In our structural topic modelling, we stopped at the point where \(\theta\) can be influenced by some set of covariates \(X\) through a standard regression model with covariates \(\theta\) ∼ LogisticNormal (\(X\gamma , \Sigma\)). The next step in the structural topic model algorithm is described as: “For each word (\(w\)) in the response, a topic (\(z\)) is drawn from the response-specific distribution, and, depending on the topic, a word is chosen from a multinomial distribution over words parameterized by\(\beta\), which is formed by deviations from the baseline word frequencies (\(m\)) in log space (\({\beta }_{k}\) ∝ exp (\(m\) + \({K}_{k}\)))” (Roberts et al. 2014b, p. 4). This distribution can include a second set of covariates that can model how word frequencies between values of that covariate can differ. Within a “socialistic” topic, this allow GDR dissertations, as indicated by a variable, to use the word “Marx” more frequently than dissertations from West Germans (they might use “Engels” more often instead). Since the used version of the R package (Roberts et al. 2014a) allows the inclusion of only one variable for such “topical content” and our approach would require several other variables, we don’t include any variable of such kind.

When it comes to finally fitting the structural topic model, the major problem is in the mathematically intractable posterior distribution. To solve such a problem, Roberts et al. (2014b, 2016) developed a method for approximate inference based on variational expectation–maximization algorithms (Blei et al. 2017; Dempster et al. 1977) that, upon convergence, give estimates of the model parameters. Convergence is achieved when the change in the approximate variational lower bound between the iterations becomes very small. We accordingly set the value for convergence to 1e−06.

In conclusion, there are two major improvements that structural topic modelling provides for our setting as compared to LDA. First, topics can be correlated, which much better reflects the “true” data-generating process behind dissertation titles and science in general. The second major improvement is that each dissertation title has its own prior distribution over topics defined by covariate \(X\), rather than sharing a global mean.

Topic model application and cosine similarity regression framework

In the next step, we estimate two separate structural topic models—one for economics and business administration and one for chemistry. For both we consider the whole period of investigation from 1980 to 1989 and 1995 to 2010. For each topic model we use 75% of the dissertations to estimate the model parameters. For the remaining 25%, we apply the topic models. This separation of training and test datasets is a standard procedure in machine learning and aims to detect overfitting of our models. Overfitting means that our topic model learns the data generating process of the underlying titles too well. In this way we lose model flexibility, which has negative impacts on the performance of the topic model on new, unseen dissertation titles. The final training and test set sizes in economics and business administration are a randomly sample of processed dissertation titles and include 6855 observations for the training and 1767 for the test set. In chemistry, sizes are 10,361 and 2580. East German test titles account for 317 dissertations in chemistry and 338 dissertations in economics and business administration (training and test set). In economics and business administration this broadly reflects the population size of East Germany (about 18% that of West Germany). In chemistry we find no explanation for the proportionally smaller number of dissertations in East Germany (9%).

When finally fitting our topic models, we arrive at 76 topics in chemistry and 69 in economics and business administration. One result of the two topic model applications is that we obtain a topic distribution for every title. Figure 1 illustrates the topic distribution of two titles in our topic model for economics and business administration.Footnote 5,Footnote 6 Figure 2 now represents the top words with the highest \(\beta\) probability of two topics. We choose topics 11 and 40 in economics and show their yearly mean probability across all titles because they show how two, probably very antagonistic topics change in prominence over time. While topic 11, which may indicate socialism, loses importance after 1990, topic 40, as a probable proponent of capitalism, on average gains importance. A list of words associated with other topics can be found in “Appendix”. Since every topic is a probability distribution over words, top words may provide some indication of the underlying subject. However, interpretation should be done very cautiously, since the most probable words only represent a small fraction of the probability distribution. Moreover, most probable words are not necessarily the most exclusive words to a topic.

Fig. 1
figure 1

Two title-topic distributions of the economics and business administration topic model

Fig. 2
figure 2

Topical prevalence of two economics and business administration topics

Figure 3 is similar to Fig. 2, but shows the distribution of the mean topic probability before and after the reunification for all topics. For economics and business, we can observe a high popularity of a small number of topics in East Germany before the reunification (such as topic 11). After reunification, high mean probability for single topics in one part of the country disappear. In direct comparison to economics and business, the mean probabilities for single topics in chemistry are small. However, there are still differences in some topics between East and West before the reunification. Remarkably, topics that weren’t popular before the reunification in one part of the country became popular after the reunification. The popularity of topic 71, for example, increased considerably after the reunification in East Germany.

Fig. 3
figure 3

Mean topic prevalence before and after German reunification

In order to compare the retrieved topic distribution of every title, we now use the cosine similarity measure, which has various applications in the comparison of topic model outcomes (see e.g., Ramage et al. 2010). The cosine similarity is a measure for the distance between two vectors and is defined between zero and one; values towards 1 indicate similarity. As topic proportions per dissertation title are vectors of the same length, the cosine similarity allows a comparison of the topic distribution between two documents. For the two exemplary dissertations we obtain a cosine similarity of 0.14.

In the next step, we calculate the cosine similarity between all topic-document distribution pairs (see dataset: Rehs 2020b). This means the topic distribution of title 1 is compared to title 2, title 3 and so on. We drop duplicate observations (e.g., when cosine similarity between 2 and 3 is the same as between 3 and 2). Since we know for every observation of the cosine similarity where both dissertations titles were written, we can employ this information in creating variables that can be attached to these similarity pairs (see Table 1 for an illustration of our dataset). We create a dummy diff_part that describes whether the two underlying dissertations for every similarity score are from different parts of Germany. The dummy variable post95 indicates whether a dissertation was written after 1995.

Table 1 Dataset structure.

Finally, we add university dummies to address differences in similarity scores arising at the university level. As the similarity score is calculated between two dissertations that were most often written at different universities, we consequently add dummies for both. The dummy sameuni indicates whether both titles in a pair are from the same university. In order to ease the interpretation of our dataset, we require both titles to be from the same year.

In the next step, we build different subsets of the data in order to address the peculiarities of our case study. For dissertations in chemistry, we build five data subsets: dissertations written before 1990 in both East and West Germany, dissertations written after 1990 in both Germanies, all dissertations written in West Germany and all dissertations written in East Germany for the time period studied. For economics and business administration, we proceed accordingly.

These subsets allow us to apply a linear regression framework, where the similarity score for each pair of dissertations is the dependent variable, and diff_part, post95, same_uni and the university dummies are the independent variables. This approach aims to aggregate the cosine similarities in order to demonstrate relationships between the underlying groups of dissertation titles from East and West Germany and different periods. The regression formula is given by (1).

$${Cosine }_{i, j}= {\beta }_{0}+ {\beta }_{1}{diffpart}_{i,j}+{\beta }_{2}post95{+{\beta }_{3}diffpart*{\beta }_{4}post95+{\beta }_{5}sameuni+{\beta }_{n}{X}_{i,k}+{\beta }_{n}{X}_{j,k}+\varepsilon }_{i}.$$
(1)

\(j\) = dissertation 1 in pair, \(i\) = dissertation 2 in pair, \(k\)= university.

Table 2 and Fig. 4 show descriptive results for the cosine similarity by certain variables. In Fig. 4 we depict the mean similarity (with 95% conf. interval) of diff_part = 0 and diff_part = 1. The graph shows that the convergence in economics and business administration seems to have happened very quickly. In chemistry there was no convergence, as the average dissertation pair similarities by regional origins were never very different in our period of investigation.

Table 2 Cosine similarities by subgroups
Fig. 4
figure 4

Yearly mean cosine similarity between topic distributions in dissertation pairs

Regarding the mean similarity of diff_part = 1 in Table 2, we observe in economics and business administration a significant increase from before to after 1990 and in chemistry a slight decrease. Chemistry topics were therefore, on average, more similar between East and West before the reunification than after the reunification. Nevertheless, the visual pattern of the mean by single years presented in Fig. 4 does not obviously support this finding. The results for sameuni in Table 2 also deliver interesting insights. Within a university, topics in both disciplines were considerably more similar than topics in different universities.

Table 3 aggregates our the chemistry cosine similarities in a linear regression framework. The pre models in both tables show the differences between East and West topics before reunification. Both pre models in Table 3 arrive at significantly negative coefficients of the variable diff_part. This indicates lower cosine similarity between two chemistry dissertations written in different parts of Germany. Full period model 1 in shows the differences between East and West Germany after reunification. The interaction of diff_part and post95 in Table 3 is positive and statistically significant. This indicates increasing similarity between East and West German chemistry dissertations after the reunification. However, the effect diminishes after including university dummies and the variable sameuni, as shown in full period model 2. The last approach in chemistry concerns the thematic change within East or West German dissertations and is shown in models 5, 6, 7 and 8. The results suggest that there is no thematic change from before to after the reunification in East German chemistry dissertations. For West German chemistry dissertations, surprisingly, there is a negative change. This means that West German dissertations became more dissimilar while East ones didn’t.

Table 3 Chemistry OLS regression

For economics and business administration, the results are presented in Table 4. Here, regression results of models 3 and 4 show a large decrease in cosine similarities for topic distributions of dissertation titles written in different parts of Germany before the reunification. Models 5 and 6 of Table 4 present the regression results of topics in economics and business administration before and after reunification in East Germany. In both models we reach significance and a substantial effect of − 0.27 and − 0.22, respectively. The last approach, which is presented in full model 1 and 2 of Table 4, shows the similarity between East and West after the reunification. The positive interaction term of diff_part and post95 in full model 2 suggests that there is an increasing similarity, and the coefficient sizes of cosine similarity indicate that the effects observed in economics and business administration are of relevant magnitude. This could have been expected, as the discipline underwent a drastic reorientation after German reunification. In chemistry, the statistically significant effects are much smaller. Chemistry may serve as an example of how even minor changes can be detected by our approach.

Table 4 Economics and business administration OLS regression

Discussion and conclusion

In this paper we have shown how scientists’ research problem choices can be detected with a machine learning approach. For this purpose, we investigated the thematic change after an unexpected political transition. We used dissertation titles in the disciplines of economics and business administration and chemistry before and after German reunification in East and West Germany. We applied structural topic modelling combined with cosine similarity-based regression. We found differences between the two parts of Germany in both disciplines before the reunification. These differences decrease somewhat after the reunification. Our results suggest that East German dissertation title topics in the field of economics and business administration are significantly more different before reunification than thereafter.

The substantial differences in economics and business administration before the reunification are likely to be related to politics, and are in accordance with the historical circumstances that we described in the chapter “Historical background”. Economics and business administration as a discipline was extremely important in the ideological framework of the GDR. The research of economists and business administrators, more so than in other disciplines, had to therefore be vetted and brought in line with socialist ideology. Topics related to capitalism, which were researched in western countries like West Germany, were therefore de facto impossible to research in the GDR.

Regarding our findings after the reunification, we again refer to chapter “Historical background”. As described, massive personnel replacement, as well institutional redirection, took place in East German economics and business administration after the reunification. The free chairs were predominately filled with West Germans economists and business administration scholars (anecdotal evidence). Consequently, the dissertation topics picked by these new scientists would have been very different from the topics of the dismissed East German scientists and their predecessors. However, within the long time span we investigate after the reunification (15 years), other factors could have also led to declining differences within economics and business administration.

One potential explanation for the small differences in chemistry research topics between East and West Germany before reunification could be the industrial relevance of the discipline, which motivated the GDR government to directly and indirectly interfere with the topic choices of East German scientists. The prime example of direct influence was the official yearly plans for science and technology, which forced chemistry to meet the industry demands of East Germany (Gruhn and Lauterbach 1977). The economic and societal restrictions in the GDR also had an influence on topic choices and therefore on the topics and results that we can observe. Collaboration, for instance, was for East German scientists almost exclusively possible with researchers from other socialist countries (Weingart et al. 1991). This prevented thematic spread that could have resulted from collaboration with West German colleagues. The different characteristics of economic uncertainty of the GDR in comparison to West Germany may also have had an indirect impact on scientific topic choices. The academic field in the GDR was, for instance, fully employed at any point in time, albeit with a considerable hidden unemployment rate, as it was socialist state doctrine to employ everyone (Gutmann 1979). Picking risky research problems was possibly not associated with risky labour market outcomes for East German chemists and scientists in general. Nevertheless, the choice of risky topics was contradicted by the aforementioned science and technology plans, which forced East German researchers to pick applied topics that met industry demands. Lastly, the small differences between East and West Germany in chemistry before the reunification could also be attributed to West German peculiarities.

The method presented and developed in this paper—structural topic modelling and a cosine similarity-based regression approach—are its main contributions, and aimed to detect differences in research topics of East and West German scientists before and after German reunification. As demonstrated, this turned out to be successful; our trained model detects reasonable differences in a set of unseen titles. The inclusion of dissertation level variables, like affiliation to single universities or dissertation year information, in training a topic model can be considered as a decisive advantage of our approach. Research problem choice is dependent on various factors, such as regional and temporal origin of the dissertation. In the topic modelling process, which tries to reconstruct the data-generating process behind the dissertation title, these factors should therefore not be considered constant across all dissertations in the training set (as done by the LDA topic model algorithm).

The incorporation of paired cosine similarities into a regression approach has, to our knowledge, never been used before and is therefore a methodical innovation of our paper. The regression framework presented in this paper provides not only an easily interpretable aggregation of the cosine similarities, but also a way to test hypothesis. In this sense, other contexts and datasets in scientometric research could be addressed by our approach, which may deliver new perspectives on thematic and, therefore, scientific change in general.

From the visual inspection of the most probable words in economics and business administration, we conclude that our model was able to discover meaningful relationships. The usage of short documents—in our case, dissertation titles—did not turn out to be a problem. In the application to the unseen documents, which were the basis for validation, our algorithm worked well. As topic modelling does not aim to label the detected topics, we can sometimes only guess what the found differences and their underlying topics most likely refer to. This is a major disadvantage of any sort of topic modelling. The foundation of this problem arises from language as a dynamic, complex and strongly context-related semantic system. Topic models can only find the relations in this system, but not understand and label them accordingly. It is therefore beyond the scope of our paper to find reasonable labels for topics we detected.

The linkage of our data to measures of scientific success and impact could provide interesting further research questions. The topical choices that are associated with academic rewards for PhD students, for example, could be investigated. Also, our method could be promising for the investigation of other types of documents; abstracts and scientific articles may contain document-level information which could shift topic proportions in the same way as the variables in our paper. Because of increased document length in these cases, the topic model algorithms would exponentially increase calculation time, but gain statistical properties and topic quality. Therefore, our method of structural topic modelling combined with a cosine similarity-based regression framework offers potential, generally, for applications in scientometrics and higher education research.