1 Introduction

Industry 4.0 aims to create smarter, more efficient and flexible production systems through the integration of advanced technologies into the manufacturing and industrial sectors [1]. The term Industry 4.0 was first introduced in Germany in 2011 and has since been adopted worldwide as a concept for the future of manufacturing [2]. Industry 4.0 is characterized by the integration of cyber-physical systems (CPS), the Internet of Things (IoT), big data analytics, cloud computing and artificial intelligence (AI) into industrial processes. These technologies enable the creation of the smart factory, where machines and processes can communicate with each other in real time, increasing efficiency, productivity and quality [3].

The application areas of Industry 4.0 are vast and include many sectors such as manufacturing, logistics, healthcare, energy and transportation. In the manufacturing sector, Industry 4.0 technologies have been used to automate production processes [4], reduce waste and downtime [5] and enable mass customization [6]. In logistics, Industry 4.0 has been used to optimize supply chain management, reduce costs and increase efficiency [7,8,9]. In healthcare, Industry 4.0 is being used to improve patient care and outcomes using IoT devices and big data analytics [10, 11].

Industry 4.0 has the potential to transform the manufacturing and industrial sectors by increasing productivity, reducing costs and improving quality through scientific research [12]. Since the emergence of the Industry 4.0 concept in 2011, there has been a significant increase in the number of studies in various research fields, including engineering, computer science and business. This increase in the number of studies in the field of Industry 4.0 has led to the emergence of descriptive research that provides a comprehensive view of the field. The increase in Industry 4.0 research necessitates and emphasizes the importance of conducting descriptive research to provide a comprehensive view of the field. For this reason, various studies such as content analysis, systematic review and bibliometric analysis have been conducted in the literature on Industry 4.0 research [13, 14].

The research is structured as follows: Initially, it discusses the literature related to Industry 4.0 and the significance of the study. Subsequently, the topic modeling method used in the study is described, and information about the topic modeling process is provided. The following section presents and discusses the findings of the study. In conclusion, the paper addresses the outcomes, limitations and recommendations of the research.

1.1 Literature review

Descriptive studies provide valuable insights into the field and are conducted with great frequency. There is a wide range of topics covered, especially in bibliometric literature reviews. A common theme in the research is Industry 4.0 [15,16,17] and its impact in various fields such as manufacturing [18,19,20], supply chain management [21], construction [22, 23] and logistics [24]. These reviews aim to provide a comprehensive overview of research trends, key concepts and the development of the field over time. Some of the specific topics addressed include the role of lean principles in Industry 4.0 [25, 26], the impact of Industry 4.0 on sustainability [27, 28] and circular economies [29, 30], machine learning [31] and the application of AI in manufacturing [32] and barriers to the adoption of Industry 4.0 technologies [33]. These bibliometric studies were used to identify research gaps, potential areas of collaboration and inform future research directions.

While bibliometric analyses provide a broad overview of existing research, they often lack insights based on semantic content analysis. They have limitations when it comes to providing a deeper understanding of the literature [34]. Supplement bibliometric analysis with topic modeling analysis can achieve a more comprehensive analysis. Topic modeling, a machine learning technique, is an effective method to automatically analyze large collections of scientific articles in a systematic way [35,36,37,38]. Topic modeling can go beyond bibliometric analysis to reveal themes, research interests and trends in a field more comprehensively [39, 40]. Therefore, topic modeling analysis is a tool for understanding and uncovering the research landscape in any field [41,42,43].

When the literature is examined, few studies using topic modeling method on Industry 4.0 research draw attention. Jang et al. [44] used LDA and centrality analysis techniques to identify the main themes of Industry 4.0-related internet news in Korea. The article analyzed the impact of Industry 4.0 on the Korean economy and the uses of this technology. Janmaijaya et al. [45] used Latent Dirichlet allocation (LDA) and clustering-based theme identification techniques to identify the main themes in Industry 4.0 literature. The article analyzed the keywords and themes of different studies on Industry 4.0. Mazzei and Ramjattan [46] presented a systematic review of machine learning topics for Industry 4.0 using deep learning-based topic modeling. This systematic review aims to offer an overview by assembling a corpus of 45,783 pertinent papers from Scopus and Web of Science and analyzing it using BERTopic. The paper discussed different aspects of Industry 4.0 and evaluated the potential of using different machine learning techniques to address the use cases of this technology. Previous studies on topic modeling in the context of Industry 4.0 have mainly focused on specific countries, different topic modeling techniques and machine learning for Industry 4.0.

1.2 Importance of the study and the problem statement

Industry 4.0 is a relatively new area of research and therefore there is much to learn about its potential impact. This study of Industry 4.0 and uses an innovative approach of semantic analysis based on topic modeling for a comprehensive review of the field and its work. By examining the trend of research on Industry 4.0, it is possible to identify themes and focal points in the field, understand the approach of different disciplines and identify potential applications. It is also useful to help bridge the gap between academia and industry, gain a global perspective and identify opportunities for international collaboration. Overall, research on the trend of research papers on Industry 4.0 is important to understand the direction of the field, identify key themes and areas for future research and understand the practical applications of this emerging technology. This study aims to uncover trends in Industry 4.0-related articles. To achieve this, the study surveyed the scientific literature on Industry 4.0 using articles indexed in the Scopus database from the past to the present. The study aimed to answer the following research questions in order to describe all studies in detail and to reveal research interests and trends.

RQ1: What are the prominent topics in articles published in the field of Industry 4.0?

RQ2: How do the prominent topics in articles published in the field of Industry 4.0 change over time?

RQ3: What is the relationship between industry 4.0 and related topics? Which topics are most similar or most different?

2 Materials and methods

This study aimed to perform a topic modeling analysis to uncover hidden semantic patterns in large text sets of Industry 4.0-related literature. Topic modeling is a probabilistic method used to uncover hidden semantic patterns, called topics, in a collection of unstructured documents. These topics capture the essence of the semantic structure of documents. The approach is based on the idea that certain words occur more frequently in a document because of their association with a particular topic [47]. Topic modeling reveals these underlying semantic clusters by identifying groups of words that tend to co-occur in a document. This process involves calculating the probability distribution of each topic and the topic distribution per document, as well as the topic assignments per word in each document [48]. Many different topic modeling algorithms are available for text mining and natural language processing research, such as Latent Dirichlet allocation (LDA), hierarchical Latent Dirichlet allocation (HLDA), hierarchical Dirichlet process (HDP), non-negative matrix factorization (NMF), Dirichlet multinomial regression (DMR), dynamic topic model (DTM) and correlated topic model (CTM) [49]. While algorithms such as NMF, CTM and DMR encounter challenges in determining the optimal number of topics through conventional consistency scores, newer models like HDP and HLDA offer automated mechanisms to ascertain the ideal topic count [49]. In contrast, the LDA model permits manual adjustment of the topic quantity, an iterative process that enhances the precision of topic number estimation and their semantic consistency [50]. This flexibility, coupled with robust methods for computing coherence scores, solidifies LDA’s position as a preferred choice across various scholarly fields for its efficacy in semantic content analysis of extensive text collections [51]. Therefore, LDA is frequently preferred in many researches and applications [47]. LDA aims to determine the semantic content of a document by analyzing its hidden semantic structures [52]. It assigns words in a document to random variables and clusters them based on a recurrent probability process using a Dirichlet distribution [48]. As an unsupervised learning approach, LDA does not require any labeling or training set, which makes it efficient to analyze large document collections within a given text corpus to reveal semantic patterns [47]. Figure 1 shows the mathematical and graphical representation of LDA.

Fig. 1
figure 1

LDA mathematical and graphical representation [53]

As depicted in Fig. 1, LDA treats each document as a mixture of topics, where each topic is characterized by a distribution over words. The generative process for each document begins with the selection of topic proportions from a Dirichlet distribution with hyperparameter α. For each word in the document, a topic Zd,n is then sampled from these topic proportions θd. Subsequently, a word Wd,n is drawn from the chosen topic’s word distribution, which is itself sampled from a Dirichlet distribution governed by parameter β. The joint distribution of the topics and words can be formally expressed as p1:K1:D,Z1:D,W1:D;α,η) and factored into the product of the prior and conditional distributions of each variable in the model. This factorization facilitates the use of inference techniques such as collapsed Gibbs sampling to estimate the hidden topic structure by iteratively sampling the posterior distributions of the topic assignments based on all other current assignments and data observations [47, 48]. This structured approach allows LDA to efficiently manage large corpora by abstracting the main themes as topics represented as distributions over words, thus providing a powerful tool for exploring hidden thematic patterns in text.

2.1 Search strategy and study selection

To collect a comprehensive set of articles relevant to the scope of this study, we used the Scopus database, which includes more than 7000 publishers from around the world [54]. Scopus is the largest abstract and citation database of peer-reviewed literature, including scientific journals, books and conference proceedings. Scopus is widely used in literature review studies [35, 37, 55]. Therefore, the researchers found the Scopus database sufficient for their study. To include the maximum number of articles, Scopus was searched for title, abstract and keywords. The researchers reviewed the literature and consulted two domain experts to create the query.

To maximize the number of articles included in our dataset, we searched for the term “Industry 4.0” in the title, abstract and keywords of journal articles (i.e., research and review articles) written in English and published from the past to the end of 2022. The following query was used as the search strategy:

TITLE-ABS-KEY (“industry 4.0”) AND (LIMIT-TO (PUBSTAGE, “final”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “re”)) AND (LIMIT-TO ( SRCTYPE, “j” ) ) AND ( EXCLUDE ( PUBYEAR, 2023 ) ) AND ( LIMIT-TO ( LANGUAGE, “English” ) )

We performed this query on February 22, 2023, and retrieved a total of 8584 journal articles from Scopus. We added the title, abstract and author keywords of these articles to our dataset for LDA analysis.

2.2 Pre-processing

Datasets are collected from various sources that may have different formats. Therefore, the data must be pre-processed to ensure consistency in the format of the dataset. This involves removing unnecessary information from the collected data. Pre-processing helps to remove noisy, irrelevant and unwanted data from any corpus and thus improves the quality of the dataset. To achieve this, several steps are taken to remove noisy words and characters, resulting in a more accurate and acceptable corpus [56]. The Python NLTK library was used for data preprocessing. Natural language toolkit (NLTK) is a popular Python library for natural language processing tasks [57].

Initially, all texts were converted to lowercase to ensure uniformity. Web links and publisher information within the dataset were removed to eliminate irrelevant metadata. Subsequently, word tokenization was performed to break down the text into individual words.

Following tokenization, we removed English stop words (such as “and", “is", “or", “the", “a", “an", “for", etc.), numerical expressions, punctuations and symbols to reduce noise in the data. Additionally, generic words frequently found in academic texts—such as “article", “paper", “research", “study" and “copyright"—were filtered out to avoid the inclusion of terms that do not contribute to the formation of semantically meaningful topics [58]. Finally, the resulting words need to be lemma. Lemmatization takes the context into account and transforms the lemma words into more meaningful base words. This step aims to remove inflected words and create a lexical form of a word [59].

In order to identify high-frequency terms within the industry 4.0 corpus, an N-gram model at the word level was utilized on the previously stemmed texts. This model considers unigrams (single words), providing a comprehensive view of term co-occurrence patterns within the corpus. Finally, each article in the dataset was transformed into a word vector using the “bag of words” method, which facilitates a numerical representation of the words in the corpus [48]. These vectors were then used to create a document term matrix (DTM), providing a suitable numerical matrix form for conducting the topic modeling analysis [60]. This matrix is pivotal for the subsequent application of statistical models to discern the latent topics within the corpus. All these preprocessing steps were performed, and the resulting corpus was used to create the final corpus for LDA analysis.

2.3 Data analysis and fitting topic modeling

The implementation of the LDA algorithm was carried out through the Gensim library in Python [61]. First, optimal values were chosen for the parameters that enable the optimization of the model. Default values (“Symmetric", “Symmetric") were set for α, which determines the distribution of topics in the documents, and β, which determines the distribution of words in the topics. Symmetric LDA is a variation of the LDA model and assumes that a given distribution of parameters is symmetric. This assumes that each document belongs to each theme with a similar probability. That is, it considers that documents contribute equally to each theme.

An iterative and heuristic process was applied for model fitting [34]. To empirically determine the ideal number of topics (K) in LDA-based topic modeling analysis, a model was created for all K values between 10 and 40, and consistency values were calculated for each K value. The consistency value was taken into account in determining the appropriate number of topics in LDA analysis. The consistency value closest to 0.7 was considered optimal [47]. As a result of the analysis, a model with 12 topics was selected as shown in the number of topics and consistency value graph in Fig. 2.

Fig. 2
figure 2

Number of topics-coherence score graph

With topic modeling analysis, topics and the terms that make up the topics were created. LDA’s term ranking process starts by first determining the probability that documents belong to topics. The probability that each document belongs to each topic is calculated. Then, the probabilities of the terms contained in each topic are determined. That is, for each topic, the probabilities of terms associated with that topic are calculated. Terms representing topics were generated according to their representation rates and then used to label and name topics [35, 36]. The Python pyLDAvis [62] library was used for these operations. Overall, pyLDAvis is a useful tool to help understand and interpret the results of topic modeling and can help identify patterns and trends in large text datasets. Topics were named by the researchers with input from two domain experts who reviewed and finalized topic names. In addition, the percentage of each topic per document, the word distribution within each topic and the distribution of topics within all articles were calculated. Finally, the top 15 terms with the highest frequency representing each of the 12 topics were identified.

The percentage change and acceleration values presented in this study were calculated using Microsoft Excel. Percent change values on a temporal basis are a metric used to show changes over time on a particular topic or term. These values express the change from one period to another as a percentage. Acceleration values are used to measure changes in the rate of change, which usually refers to the rate of change of rates of change. In Excel, the slope of a data series can be calculated using the SLOPE function, which shows the slope of the linear trend between time series data. The formula is as follows:

 = SLOPE(y-values; x-values)

In this study, x-values represent temporal periods (years) and y-values represent quantities of terms or topics of interest in specific time periods. When performing the calculations, it is assumed that the data points are a good fit to a linear model. In this way, the slope of each topic over time was determined, and an Acc value was calculated accordingly. A positive or negative Acc value indicates an increase or decrease in the number of publications in a given topic. Graphs were created to visualize the volume and slope of each topic over time and their relationship with other topics.

3 Results and discussion

3.1 Identifying the topics (RQ1)

In LDA-based topical modeling analysis, researchers conducted experiments on different numbers of topics. Eventually, they decided to use 12 topics in their study. The reason for choosing this number is that when the number of topics is increased too much, low-frequency topics may be formed, leading to moving too far away from the research field. Conversely, reducing the number of topics may result in the risk of overlooking major topics in the research field. Therefore, the researchers determined that 12 topics would be the optimal number for their analysis. After deciding on the number of topics, they obtained the terms related to each topic to be named first and calculated the total frequency percentage of these terms. Then, they consulted with three experts in the field to determine appropriate topic names, resulting in the creation of a Table 1. The related terms for each topic were listed in order of density.

Table 1 Topics, topic terms and frequencies

The resulting Table 1 contains the names of the topics, their related terms and the frequency percentages for each term. The related terms for each topic are listed in order of their density, which means that the most frequently occurring terms for each topic appear at the top of the list.

When Table 1 is examined, the “smart cyber-physical systems” topic ranks first with a percentage of 50.06. “Digital Transformation and Knowledge Management” and “Data Science in Energy” are other topics that stand out in terms of intensity. Understanding how these 12 topics relate to related terms is important for the intelligibility of the study. In this respect, information about the relevant titles is given in order of density.

According to Varadarajan et al. [63], “smart cyber-physical systems are large-scale software-intensive and pervasive systems, which by combining various data sources and applying intelligence techniques, can efficiently manage real-world processes and offer a broad range of novel applications and services”. Moreover, smart cyber-physical systems include anything smart like cities, manufacturing, production, etc. [64].

Although digital transformation and knowledge management are two separate topics, they have started to be used together with Industry 4.0. Digital transformation can be defined as a process that aims to improve an entity by triggering significant changes to its properties through combinations of information, computing, communication and connectivity technologies [65]. On the other hand, according to the Koenig [66], “knowledge management (KM) is a business process that formalizes the management and use of an enterprise’s intellectual assets. KM promotes a collaborative and integrative approach to the creation, capture, organization, access and use of information assets, including the tacit, uncaptured knowledge of people”. Currently, some studies use digital transformation and knowledge management together. Some of the topics covered in these studies include the public sector [67], property management [68], the economy [69], impact on organizations [70] and SMEs [71].

Data science is a field that brings together various disciplines such as mathematics, statistics and computer science. Combining the knowledge and skills from these disciplines provides meaningful results that users can understand [72]. Data science studies are used in various fields, including energy, for different improvements. For instance, they can be used to address the energy crisis [73], forecast energy usage [74] and predict load [75]. Naturally, data science methods such as neural networks and deep learning are effectively used in such studies.

Sustainable supply chain management involves integrating environmentally and financially viable practices into the complete supply chain lifecycle, from product design and development to material selection [76].

The Internet of Things (IoT) is a network infrastructure that enables objects to communicate with each other via the internet. This allows objects with specific tasks to exchange information among themselves and with system users [43].

A manufacturing system is the organization and processing of relevant elements to produce a physical product, a service or information [77]. Manufacturing system design, on the other hand, concerns the function, structure and form of workpieces in the manufacturing process [78]. Additionally, it contributes to the creation of equipment selection, job design and standardization [79].

According to Atasoy [80], Education 4.0 can be explained as “the period in which the education settings integrate ICTs to develop instructional, pedagogical and technological processes. It also improves operational processes through new learning and teaching methods, innovative solutions to current and future challenges in society”. Especially in engineering education, Education 4.0 is being effectively used and studies are being conducted [81, 82].

Small and medium-sized enterprises (SMEs) are economic entities defined in regulations as micro-enterprises, small businesses or medium-sized enterprises. Such businesses should have between 10 and 250 employees, and their total turnover should not exceed 50 million euros [83]. SMEs contribute to the progress of entrepreneurship and innovation and prevent monopolization [84].

Sustainable development is the pursuit of progress that satisfies our current requirements while also safeguarding the potential of forthcoming generations to satisfy their own necessities [85]. Sustainable development initiatives are being carried out in various fields, such as health [86, 87], transportation [88, 89], agriculture [90] and energy [91].

Human–robot interaction refers to “interactions between humans and robots by means of cooperation in a task or assisting one another” [92]. Human–robot interaction is interconnected with various scientific fields, including artificial intelligence, robotics and psychology [93].

Quality management involves effectively managing, planning and organizing processes to continuously improve them while maximizing customer satisfaction with the lowest total cost [94].

Forgionne & Russell [95] described the decision-making process as “the process of developing a general problem understanding, formulating the problem explicitly, evaluating alternatives systematically and implementing the choice”.

3.2 Examining the topics by dividing them into periods (RQ1)

To show how topics changed over time, data were divided into five periods, each consisting of two years. The reason for not taking data on an annual basis is that the amount of data in the early years is low. Table 2 shows the distribution of themes according to the relevant periods.

Table 2 Distribution of topics according to periods

Upon examining the Table 2, “Smart Cyber-Physical Systems” is the topic with the most studies in all periods. Fifty percent of the studies were conducted under this title. “Digital Transformation and Knowledge Management” and “Data Science in Energy” are the most studied topics after “Smart Cyber-Physical Systems”. The interest in the topics of “Digital Transformation and Knowledge Management”, “Data Science in Energy” and “Sustainable Supply Chain Management” has increased with the 2019–2020 period. Likewise, while there was an increase in interest in the “Education 4.0 in Engineering” title in 2019–2020, a slight decrease in interest was observed in the 2021–2022 period. While there were no studies on “Sustainable Supply Chain Management”, “Internet of Things” and “Manufacturing System Design” in the first period, studies were conducted on these topics from the 2015–2016 period onwards. In addition, while there were no studies on “Education 4.0 in Engineering”, “Small and Medium-sized Enterprises (SMEs)”, “Sustainable Development”, “Human–robot Interaction”, “Quality Management” and “Decision-making Process” in the 2013–2014 and 2015–2016 periods, studies on these topics have also been conducted since 2017–2018 period.

On the other hand, the slope value in the table represents the average annual growth of each subject. This value shows how much the number of publications of a topic has increased annually over a certain period. In a way, it can be used to measure the value of interest. For example, the slope value of the topic “Intelligent Cyber-Physical Systems" is “551.7". This shows that the subject is becoming increasingly popular and has aroused serious interest among researchers in the field of Industry 4.0.

On the other hand, the slope value of the Decision-making Process topic is “2.1’" This shows that the annual publication increase is quite low. This may indicate that the topic is less researched than other topics and receives less attention among Industry 4.0 researchers.

In general, slope values can be used as an indicator to evaluate the research trend of each topic in the field of Industry 4.0. Higher slope values may indicate topics that are growing faster and are considered more important; low slope values may reflect less important or popular topics.

In this context, when the relevant table is examined, it is seen that the topics with the highest slope values are Smart Cyber-Physical Systems, Digital Transformation and Knowledge Management and Data Science in Energy.

3.3 Examining the percentage changes of the topics according to the periods (RQ2)

In this part of the study, the aim was to illustrate how each topic changed over time. To achieve this, the percentage of work done on each topic in each period was presented. For example, the work done on “Smart Cyber-Physical Systems” accounted for 0.33% of the total work in the 2013–2014 period, 3.03% in the 2015–2016 period, 13.75% in the 2017–2018 period, 33.72% in the 2019–2020 period and the remaining 49.71% in the 2021–2022 period. This way, we can see how the work on each topic developed in each period. Finally, acceleration values are shown in Table 3 to provide an overall picture of how each topic evolved over time.

Table 3 Percentage change and acceleration values of the topics on a temporal basis

According to the Table 3, all topics have shown substantial development in the 2021–2022 period. Only the topic of “Education 4.0 in Engineering” appears to have been more concentrated in the 2019–2020 period. Moreover, the percentage of work on the “Decision-making Process” was equal in both the 2019–2020 and 2021–2022 periods. The volume ratios given in Table 3 are visualized in Fig. 3.

Fig. 3
figure 3

Percentage change values of the topics

Figure 3 shows that there has been a significant increase in the number of article publications between 2013 and 2022, especially between 2019 and 2022. The topics with the highest increase include Smart Cyber-Physical Systems, Decision-making Process, Human–robot Interaction, Education 4.0 in Engineering, Small and Medium-sized Enterprises (SMEs), Internet of Things, Manufacturing System Design, Digital Transformation and Knowledge Management, Sustainable Supply Chain Management, Sustainable Development, Data Science in Energy and Quality Management.

When the acceleration values are examined, “Quality Management”, “Data Science in Energy” and “Sustainable Development” stand out as the topics with the highest acceleration, respectively. Studies on these topics will cover more ground over time, and scientific research may be more directed toward these topics. On the other hand, the interest in “Smart Cyber-Physical Systems”, “Decision-making Processes” and “Human–robot Interaction” may decrease compared to other topics. It is also worth noting that there are not very significant differences between the acceleration values. In this context, visuals related to the three titles with the highest and lowest acceleration values are given below to increase understanding.

As can be seen in Fig. 4, there are not very significant differences in the acceleration values of the titles. The field is open to development in all topics.

Fig. 4
figure 4

Headings with the highest and lowest acceleration values

3.4 Examining the change of topics relative to each other depending on the periods (RQ2)

In this phase of the study, the researchers analyzed how the different topics related to each other and determined the percentage of research that focused on each topic annually. For example, during the 2021–2022 period, the breakdown was as follows: 44.53% of the studies were on “Smart Cyber-Physical Systems”, 0.15% on “Decision-making Process”, 0.36% on “Human–robot Interaction”, 0.46% on “Quality Management”, 0.48% on “Small and Medium-sized Enterprises (SMEs)”, 0.53% on “Sustainable Development”, 1.24% on “Education 4.0”, 3.35% on “Internet of Things”, 3.22% on “Design and Manufacturing Systems”, 7.59% on “Sustainable Supply Chain Management”, 18.31% on “Data Science in Energy” and 19.77% on “Digital Transformation and Knowledge Management”. The relevant data are provided in detail in Table 4, allowing for a more effective analysis of changes between topics.

Table 4 Volumes of topics by period

When examining Table 4, the total coverage of “Smart Cyber-Physical Systems” in terms of publications has gradually decreased over the relevant periods. While 87.50% of the total studies were conducted in this area in the 2013–2014 period, this rate dropped to 44.53% in the 2021–2022 period. The main reason for this is the increasing number of studies conducted on other topics. On the other hand, the areas covered by the topics “Data Science in Energy”, “Digital Transformation and Knowledge Management” and “Sustainable Supply Chain Management” are gradually increasing. The volumes of topic according to the periods given in Table 4 are visualized in Fig. 5. There are some notable trends in article publications between 2013 and 2022. In the field of Smart Cyber-Physical Systems, there was a high start in the period 2013–2014, but decreased in the following years. In contrast, areas such as Sustainable Supply Chain Management and Digital Transformation and Knowledge Management saw significant increases in 2019–2022. Similarly, Data Science in Energy experienced a significant increase in the 2019–2022 period. In general, there is a growing interest in digital transformation, knowledge management and sustainability. To increase clarity, Fig. 6 showing the change in momentum values of the topics is provided.

Fig. 5
figure 5

The volumes of topic according to the periods

Fig. 6
figure 6

The acceleration values of the topics

As shown in Fig. 6, the acceleration value is an important indicator to show the changes in the field. Thus, topics that may become more prominent or less effective can be identified. When considering the acceleration values, only “Smart Cyber-Physical Systems” has a negative acceleration value, while all other topics have positive acceleration values. This means that the impact of studies on “Smart Cyber-Physical Systems” will decrease over time, while the impact of other topics will increase. We can especially say that the three topics with the highest acceleration values will begin to take up more space in Industry 4.0 studies. Other topics may not be as strong, but they will continue to progress.

3.5 LDAvis (RQ3)

LDAvis was developed by Carson Sievert and Kenneth E. Shirley in 2014. “LDAvis, visualization system allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model” [96].

The analysis consists of two parts: The first part, which is the left side of the screen, provides a general view that answers the questions “How prevalent is each topic?” and “How do the topics relate to each other?” In this view, topics are represented by circles. The centers of the circles are calculated based on the distances between the topics and placed on a two-dimensional graph. The size of the circles indicates the density of the amount of work related to the topic. In other words, as the circle grows, the density of the amount of work related to the topic increases, while it decreases as the circle shrinks. The second part displays representations of the terms that will be used to determine what the topics are. This reveals the meanings of the topics. In the previous sections of the study, the topics were determined using these terms. In this part, the density of the terms within the relevant topic is also displayed by stacking them on top of each other [96].

The relevance metric value in LDA is responsible for modifying the significance assigned to a word’s probability of being associated with a particular topic, as noted by Sievert and Shirley [96]. In this study, the lambda value of the relevance metric was set to 0.6. This was done to better assist researchers in interpreting the topics. Shrader et al. [97] stated that if λ equals 1 in a given context, then any word that emerges in the corpus is deemed relevant, irrespective of the number of topics it relates to or where it appears. The visual below displays the result of LDAvis. The left side of the visual shows the topics, while the right side displays information on the 30 most important terms in the field.

The titles corresponding to the numbers in Fig. 7 are as follows: (1) “Smart Cyber-Physical Systems”, (2) “Internet of Things”, (3) “Digital Transformation and Knowledge Management”, (4) “Sustainable Supply Chain Management”, (5) “Data Science in Energy”, (6) “Manufacturing System Design”, (7) “Sustainable Development”, (8) “Decision-making Process”, (9) “Small and Medium-sized Enterprises (SMEs)”, (10) “Education 4.0 in Engineering”, (11) “Human–robot Interaction” and (12) “Quality Management”. The prevalence of the topics can be seen in the visualization. The size of the circles is used to indicate the prevalence of the titles. “Marginal Topic Distribution” helps us here. “Marginal Topic Distribution” shows which words or terms the LDA model uses for a particular topic and how much weight these terms have in total. According to this information, topics which are numbered as 1, 2, 3, 4 and 5 can be considered as prominent topics in the field of industry 4.0.

Fig. 7
figure 7

The layout of LDAvis (intertopic distance map and top-30 most salient terms)

The relationships between the titles can also be understood from the left side of the visual. It can be observed that (2) “Internet of Things” and (5) “Data Science in Energy” appear to overlap in PC1 negative and PC2 positive regions. When two topics overlap in the negative PC1 and positive PC2, it suggests that they possess opposing features. These topics can potentially represent distinct subjects with unique vocabularies and terminologies in the LDA model. On the other hand, (4) “Sustainable Supply Chain Management” and (9) “Small and Medium-sized Enterprises (SMEs)”, and (3) “Digital Transformation and Knowledge Management” and (7) “Sustainable Development” overlap in the positive side of PC1 and PC2. If there are two topics overlapping in PC1 positive and PC2 positive, it may indicate that these topics exhibit similar characteristics and are closely related to each other. It means that regions on the intertopic distance map where similar topics are located indicate that the LDA model has learned specific topics with similar words and terms, representing different aspects of the same topic within a particular field. On the other hand, the distance between topics on the intertopic distance map reflects the dissimilarity between topics learned by the LDA model, with different terms or words representing distinct aspects of the same topic. These terms or words capture the multiple dimensions of a specific topic and help distinguish it from other topics on the intertopic distance map.

According to the analysis of the most frequent terms in the field, the top 10 terms are “system”, “technology”, “manufacturing”, “data”, “process”, “model”, “digital”, “production”, “development” and “management”. These terms have been used to name the topics.

4 Conclusions

4.1 Implications of this study

The topic modeling analysis identified “Smart Cyber-Physical Systems”, “Digital Transformation and Knowledge Management” and “Data Science in Energy” as the most extensively studied topics with an increasing number of publications over time.

The percentage distribution of the topics according to years and the development acceleration of these topics over time were analyzed. According to the results of the research, “ Smart Cyber-Physical Systems” is the most studied topic in all periods, while interest in “Data Science in Energy”, “Digital Transformation and Knowledge Management” and “Sustainable Supply Chain Management” has increased in recent years. “Quality Management”, “Data Science in Energy” and “Sustainable Development” have been identified as having the highest momentum values and are expected to attract more attention in the coming years. However, as there are not large differences between the momentum values, it can be concluded that the field has room for improvement in all topics.

To show how the topics have changed over time, the data are divided into five periods of 2 years each. With the 2019–2020 period, interest in “Digital Transformation and Knowledge Management”, “Data Science in Energy” and “Sustainable Supply Chain Management” increased. While there were no studies on “Sustainable Supply Chain Management”, “Internet of Things” and “Production System Design” in the first semesters, studies on these topics have started to be conducted since 2015–2016 semester.

Furthermore, analysis of the topics with the highest increase in research intensity compared to other topics reveals that “Digital Transformation and Knowledge Management”, “Data Science in Energy” and “Sustainable Supply Chain Management” are the top three topics. It is worth monitoring these topics in the coming years to follow their growth and development.

With the continuous development and evolution of technology, industries are undergoing constant change and seeking to leverage these technological advancements. Naturally, they are undergoing a continual transformation process toward greater intelligence, efficiency and sustainability. As new technologies such as the internet of things, artificial intelligence, cloud computing, big data, augmented reality and virtual reality are being adopted, industrial processes operate with increased efficiency. Additionally, areas such as sustainability and data management have now become more crucial for industries. This dynamic structure brought about by technological progress presents significant opportunities for both society and factories. However, its potential to create challenges in the future, especially for members of society, should not be overlooked.

In addition, machine learning is one of the areas with high potential in industrial processes. It seems that it has become necessary to make improvements in many issues such as extracting meaningful information from industrial data, optimizing production processes, error control and quality control. In particular, machine learning techniques, together with the digitalization and automation processes that lie at the core of Industry 4.0, can enable the development of smarter and adaptive production systems in the industry and significantly increase efficiency. Additionally, it can help businesses stand out in the industry by providing a competitive advantage. Therefore, it is recommended to focus more on machine learning technologies for the successful adoption and implementation of Industry 4.0.

4.2 Limitations and future works

This study aimed to examine the Industry 4.0 field from 2013 to the end of 2022. Using topic modeling analysis, the study identified the research interests and trends of the field and revealed its current development. The innovative aspect of the study is the use of topic modeling analysis, which takes bibliometric studies one step further. However, the study also has some limitations: First, the dataset is limited to peer-reviewed articles from the Scopus database, which excludes other databases and document types. Future work could address this limitation by expanding the corpus to include a wider range of databases and document types. Second, to ensure the quality of the research, it only considered journal articles, which may have excluded other relevant document types. Future studies could include a wider range of document types for a more comprehensive analysis. Third, the study used the widely accepted LDA algorithm for topic modeling analysis, but future comparative studies with different algorithms could provide additional insights. Fourth, periodic replication of such studies on the Industry 4.0 field’s most voluminous or fastest accelerating topics or subtopics could provide valuable insights into how trends change over time. Finally, such insights could be crucial for directing future research toward areas that could potentially yield high impacts, such as specific technological integrations, policy implications or industry applications that are under-researched. This approach not only helps in advancing the understanding of Industry 4.0 but also aligns academic and practical efforts to address the most pressing and relevant issues as this industrial revolution evolves. Additionally, programs such as Biblioshiny, Vosviewer, SciMat and CiteSpace can be used to uncover significant topic headings in various fields. In our next study, we plan to conduct a comparative analysis of prominent topics in Industry 4.0 using these software packages.