1 Introduction

Not only is there a growing interest in Artificial Intelligence (AI) in academic research, as illustrated by the quadrupling of peer-reviewed AI-related journal publications in the last two decades [1, 2], but the technological developments in the field of AI also have an enormous impact on everyday life and industry [3]. This process is driven by advances in digitized data acquisition, computing infrastructure, and Machine Learning (ML) [4]. One of these areas, in which AI is increasingly being applied, is healthcare [5]. Rajpurkar and colleagues [6] declare AI ready to transform medicine sustainably and broadly, improving the experience on the part of patients and clinicians. During the last few years, medical AI algorithms step-by-step improved and reached a new level of maturity, for example, in disease detection using medical images [6, 7]. The challenges on the way to the successful adoption of AI lie now more in its application in routine clinical care and are linked to the safety and effectiveness of AI [6, 8, 9]. Healthcare experts from China and Germany named insufficient traceability and causality of AI decision-making processes, but also reliability related to the AI accuracy and the needed supervision as complex topics to address [10].

Particularly in view of these challenges, as well as the goal of broader implementation of AI in the healthcare environment, digital media such as podcasts are an increasingly sought-after option for disseminating relevant information and current research findings related to healthcare and making them accessible to a broad mass of people in a relatively easy, but also understandable way [11, 12]. The rising popularity of the medium podcast, in general, can be observed in the exponential growth of the number of new podcasts and episodes by year over the last decade, especially from 2015 onwards. From around 23,000 new podcasts and almost a million new episodes in the year 2010, the numbers increased tremendously to more than 223,000 published podcasts and more than 26 million news episodes across all genres in 2022 [13]. Among the most popular genres, the platform Listen Notes list besides Society & Culture or Business also Health & Fitness [13, 14]. In the United States, an annual growth rate of 17 per cent can be observed in terms of monthly listeners between 2019 and 2023, and monthly listeners are expected to increase up to 164 million next year [15].

Previous research in the field of AI addresses mostly industrial applications and technological progress in healthcare. In literature, AI trend research mainly works with academic data sources. We argue that in order to investigate an increased interest in society towards AI, it is necessary to use digital media, like podcasts, as sources that are accessible to a broader mass of people. To fill this gap, we propose a web mining approach to create a novel data set based on podcasts and illustrate a data-driven rather than a methodological approach. In this study, we collect data from in total 29 English-spoken healthcare podcasts. Within this work, we are addressing the following research questions (RQ):

  • RQ1: Are podcasts a suitable research medium for trend detection in general as well as related to the field of AI, especially in healthcare?

  • RQ2: Are we able to detect an AI trend and examine its development in healthcare podcasts between 2015 and 2021?

  • RQ3: Can we identify unknown topics within the multiple podcast data sources using topic clustering?

  • RQ4: Is it possible to detect the speakers’ sentiments towards specific AI-related keywords by applying sentiment analysis?

In order to address the research questions, we evaluate and select Speech-to-Text APIs to process the data and transcribe the audio files to text data. AI-related buzzwords are extracted from multiple sources like glossaries and hype cycles. We utilize these buzzwords for the purpose of trend detection and trend analysis on the collected data by the application of machine learning-based approaches. In the further course of the study, we employ state-of-the-art algorithms based on Deep Learning (DL) to perform topic clustering and sentiment analysis. In addition, a pre-trained transformer model based on the BERT architecture was fine-tuned on the healthcare podcast data. We used OCTIS, an open-source technology based on the Hugging Face BERT models, to build the topic clustering pipeline [16]. In the following, we give an overview of the main contributions of this study:

  • We describe a web mining approach that was used to create a novel data set including 29 healthcare podcasts (in total 3449 episodes from 2015 until 2021).

  • We identify 102 AI-related buzzwords and use them to successfully detect an AI trend and analyse its development in healthcare.

  • We identify unknown topics in healthcare based on podcasts as data sources.

  • We exemplify how the novel data set can be used for trend detection besides the field of AI and illustrate the transferability of the proposed approach for future research using podcasts.

In addition, we show that podcasts, coming from a healthcare environment as well as in general, are an informative and from our perspective very relevant emerging research medium in data mining, and here specifically in the field of web mining.

Reproducibility: The code for the crawlers to collect the data, that were used in the experiments (see Sect. 4.1), is available in the GitHub repository at https://github.com/mad-lab-fau/trend-detection-in-healthcare-podcast-data-set (see Sect. 5). The transcribed data set will be made available upon request.

2 Related work

In the following section, we investigate the current literature on web data mining and AI perception in general and we have a deeper look at text mining in healthcare-related data sources. We further present the existing research in the fields of podcasts, trend detection as well as topic clustering and sentiment analysis.

In literature, we only find a few studies and research projects that are looking specifically at the medium podcasts. According to MacKenzie [17], podcasts developed as a decentralized medium for science communication to the public since 2004. Nevertheless, the author presented one of the first large-scale quantitative studies looking at the production and dissemination of language science podcasts. He identified in total 952 English podcasts from 2004 until 2018. Due to the lack of a centralized database for podcast series, they used the iTunes podcast directory with over 200,000 podcasts and looked specifically at the category Natural Sciences. One limitation of their study was, that this podcast category is entirely dependent on the assignment of the podcast producers themselves. In this study, only online textual as well as visual metadata (e.g. social media content, websites or descriptions) of the podcasts were analyzed. Audio data or the underlying text data of the podcast were not part of the investigation because of the impracticability and challenges that go along with the transcription and the data processing of a large amount of audio data. In summary, they present a linear increase in the total number of science podcasts between 2004 and 2010, which has been replaced by exponential growth between 2010 and 2018 [17].

Not only as a result of the increase of podcasts in science but due to general higher interest in the medium podcast, Vartakavi and colleagues [18] proposed a system called PodSumm for automatically generating audio summaries of podcasts to support the discovery of new content and to allow listeners to get an episode preview. They applied automatic speech recognition (ASR) to transcribe the audio data, then process and finally summarize the text. For the transcription, they used AWS Transcribe [19]. To test their pipeline, they created a podcast data set by collecting 309 episodes (in total 188 h of audio) from 19 podcast series from different genres [18].

In the case of Crosscast, a system provided by Xia and colleagues [20], the goal is not to automatically summarize podcasts but to automatically add visual data to audio travel podcasts. They transcribed the audio data to text using the crowd-sourced transcription service rev.com [21]. Within their study, they analyzed around 300 episodes from travel shows, documentaries and podcasts. At first, they attempted using an ASR tool for the transcription process, but due to errors in their practical tests, they decided against it. Within their pipeline, they determine keywords and geographic locations in the text data by applying natural language processing (NLP) and text mining. This information is used for the automated selection of images from online sources and matches them at the end with the audio commentary [20].

Looking beyond the online medium podcast, we find multiple studies that are scraping and mining web data. Fast and Horvitz [22] analyzed the New York Times articles between 1986 and 2016 in order to reveal trends as well as positive and negative sentiments towards the subject area of AI. Due to the lack of a universal as well as professional definition of AI, automatic sentiment analysis was not feasible. For this reason, manual annotation was performed based on paragraphs by engagement, optimism vs. pessimism, concerns for AI and hopes for AI. In general, they observed an increasing number of reports linked to the field of AI. However, they also recognized specific trends regarding the opportunities of AI in the areas of healthcare and education. Even without an automated sentiment analysis, they were able to evaluate attitudes toward AI, such as ethical issues or a possible loss of control, which tended to be perceived negatively by the public.

When looking at the methodological approach, their study could be described as an extension to the fundamental sentiment analysis, because it includes sentiments as well as emotions. This work shows the possibilities in trend detection and trend prediction of AI-related topics towards certain emotions based on structured and annotated text data [22]. A similar methodological procedure using topic clustering and sentiment analysis with the aim of detecting future trends was chosen by Aghababaei and Makrehchi [23], who used different data sources, such as Twitter posts and local crime rates, for their analysis.

Table 1 Literature analysis on - podcasts as a research medium, trend detection & analysis, as well as topic clustering & sentiment analysis

Particularly with regard to healthcare, research projects are focusing on AI trend detection and analysis. These studies are concerned with the healthcare sector as a whole and with individual disciplines or application areas such as telemedicine [39, 40]. An example of the former is the work of Jiang and colleagues [39], who examined the status of AI applications in healthcare, with a particular focus on ML and NLP. By investigating the data from PubMed, they show that the number of articles about DL increased already since 2013, but more than doubled from 2015 until 2016. NLP is used for the identification of keywords, e.g. related to diseases, to support the clinical decision-making process and to assist physicians in terms of treatment suggestions. From their point of view, the successful use of AI requires NLP to support the mining of unstructured text data as well as ML methods in the context of handling structured data such as images. The field of telemedicine, which allows physicians to examine or treat patients at a distance, was used to examine the application of AI in healthcare. Pacis and colleagues [40] describe in this area four different trends and discuss the development regarding the application of AI in intelligent assistance diagnosis, information analysis collaboration, patient monitoring and healthcare IT.

Based on the review of the wide-ranging research from the fields of web scraping and web data mining, AI trend analysis in general as well as in healthcare, we conducted an extended literature analysis, presented in Table 1.

In summary, we found multiple studies focusing on either web data mining, the field of AI, or specific domains such as healthcare, but we identify a lack of research regarding the specific combinations of those focus areas.

We found multiple studies mining and analyzing structured as well as unstructured data. Data sources such as newspaper articles [22], dictionaries [29] and public (document) databases [31, 32, 34] can be assigned to the first area. The latter area includes research that uses (large amounts of) social media data, for example from Twitter [30, 33, 38] or different kinds of blog systems [35, 37]. This is also where this study can be methodically allocated.

In the field of trend detection and trend analysis, data from public (social media) platforms is collected and used as a common source. Nevertheless, podcasts, in general as well as addressing healthcare, can still be considered as a significantly less used data source due, among other things, to the obstacles to overcome in the transcription and processing of large amounts of data [17].

3 Methodology

3.1 Data collection

In a first step, healthcare-related podcasts were collected, which resulted in an initial list of 45 healthcare podcasts [41, 42]. We restricted our analysis to English-spoken podcasts because English is the most supported language by Speech-to-Text APIs. The list of podcasts was further evaluated according to four criteria: 1) the Listen Score & Global Rank [43], 2) the overall number of episodes, 3) the availability of all episodes within a Really Simple Syndication (RSS) feed, 4) and the involvement of relevant guests and experts.

The popularity of podcasts is quantified by the metrics Listen Score and Global Rank, which are provided by the podcast search engine Listen Notes [14]. The estimated popularity of a podcast compared to all other RSS-based public podcasts worldwide is scaled between 0 and 100. Only the top 10 percent of the podcasts receive a Listen Score. Therefore, podcasts without a score were sorted out.

Secondly, we looked at the number of published episodes and filtered out podcasts with less than 25 episodes. Due to the monthly release schedule of most podcasts, we were able to ensure at least two years of data from each podcast. Third, it is essential that the number of item tags inside the RSS feed matches the total number of published episodes. Otherwise, the crawler would fail to download all episodes. As the last metric, we looked at the podcast guests to only select those with invited participants like C-level executives, entrepreneurs or scientists. Due to this procedure, we were able to sort out healthcare podcasts that addressed the general audience discussing, for example, health or fitness education, instead of focusing on the state-of-the-art and innovative technological developments in healthcare.

This resulted in the list of in total 30 healthcare podcasts that were selected. One was excluded due to incorrect publishing dates after the later described crawling process. Table 2 gives an overview of the final number of 29 podcasts.

Table 2 Overview of the 29 selected healthcare podcasts in the data set including their names, abbreviations, number of episodes and starting year 

In the data collection process, a RSS feed crawler was implemented to download the audio files and associated meta data that were published by the individual podcasts. The pipeline was built using Python 3 and takes one or more feed links as input. The pipeline includes two stages crawl and convert. In the first stage, all episodes, that are stored in an RSS link, are subsequently downloaded. In the second stage, the non-MP3 file formats are converted into MP3. As shown in Table 2, each podcast received an abbreviation of its name for better readability. After downloading, the metadata from all episodes was parsed and stored in a central CSV file.

3.2 Data processing

The data processing consists of two parts: the evaluation and selection of Speech-to-Text APIs and the transcription of the audio files to text files. At first, we started with a primary search for Speech-to-Text APIs and found the following supported speech recognition engines of the SpeechRecognition Python library [73]:

  • CMU Sphinx [74]

  • Google Web Speech [75]

  • Google Cloud Speech API [76]

  • Houndify API [77]

  • IBM Watson Speech-to-Text [78]

  • Microsoft Bing Voice Recognition [73]

  • Snowboy Hotword Detection [79]

  • Wit.ai [80]

This list was step-wise reduced due to additional needed libraries (CMU Sphinx), to not fitting for the use case (Wit.ai) or to obsolete and switched-off tools by the service providers (Microsoft Bing Voice Recognition, Snowboy Hotword Detection). After an extended literature research, the three Speech-to-Text APIs DeepSpeech [81], Microsoft Azure [82] and Vosk [83] were added to the list, which consisted at the end in total seven to be paid as well as open source Speech-to-Text APIs that were tested and evaluated (at first based on the versions from October 2020).

  • DeepSpeech [81]

  • Google Web Speech [75]

  • Google Cloud Speech API [76]

  • Houndify API [77]

  • IBM Watson Speech-to-Text [78]

  • Microsoft Azure Speech [82]

  • Vosk [83]

Before the transcription tests, snippets of one episode per podcast (less than one minute long) were created. As a reference, each snippet was transcribed manually. Afterward, each audio file was transcribed to multiple text files using the different speech recognition engines. For the evaluation and the comparison of the Speech-to-Text APIs, the word error rate (WER) was used [84, 85]. The WER compares a reference with a hypothesis and is defined as [86]:

$$\begin{aligned} \begin{aligned} WER = \frac{S + I + D}{N} = \frac{S + I + D}{S + I + C} \end{aligned} \end{aligned}$$
(1)

where

  • S is the number of substitutions,

  • I is the number of insertions,

  • D is the number of deletions,

  • N is the total number of input words,

  • C is the number of correct words.

In the following experiment, the manually transcribed transcript is the ground truth and each of the API transcripts serves as one hypothesis. The program loops over all text files and their content was read into variables and pre-processed for example by the replacement of non-alphanumeric characters or by lowering the caps. The text is used as input for calculating the distance matrix, which again serves as the basis for the WER calculation. The WER values were saved in a.CSV file and used for the median WER calculation as well as the value normalization. In previous literature, Microsoft Azure and Google Cloud had the lowest WER among their peers [87, 88]. Within our test on 30 samples, Microsoft Azure again had the lowest median WER of 4.1. Due to limited financial resources, we continued to choose a free Speech-to-Text API out of Google Web Speech, Vosk and DeepSpeech, which was used with WAV and MP3, and decided to use DeepSpeech with MP3 files as input.

During the lifetime of the project, we re-ran the API evaluation of the two open-source APIs DeepSpeech and Vosk with the versions from October 2020 (DeepSpeech 0.8.2., Vosk 0.3.15), and additionally using the latest versions (DeepSpeech 0.9.3., Vosk 0.3.32) from February 2022 following the same procedure as described before. Beyond that, we now added randomness to the WER calculation to improve the robustness of the evaluation. Therefore, one random episode was taken from each of the 30 podcasts at this stage. For each of these episodes, three 30-second-snippets were chosen at a random position in order to avoid for example snippets with the same introductory or farewell sentences in a podcast. Again all snippets were manually transcribed and are serving as the ground-truth label. Each snippet was guaranteed to have a unique word sequence. Both APIs performed relatively equally when looking at the average mean of the WER. Nevertheless, we conducted the experiments with Vosk 0.3.32, because DeepSpeech 0.9.3. had a more significant standard deviation and was challenging regarding the necessary GPU powering for the podcast transcription process.

3.3 Buzzword identification and selection

Within this study, the following glossaries and dictionaries were used as sources for the creation of the dictionary list as well as the English buzzword (or sometimes named keyword) list, which will be used for the detection of AI-related keywords in the podcast transcriptions.

  • Accenture Applied Intelligence Glossary [89],

  • Gartner Hype Cycles 2015 - 2020 [90,91,92,93,94,95,96,97,98,99],

  • Gartner Information Technology Glossary [100],

  • Github Machine Learning Glossary [101],

  • Google Machine Learning Glossary [102],

  • Microsoft Machine Learning Glossary [103],

  • Oxford Dictionary of Computer Science [104],

  • Stanford Machine Learning Glossary [2].

For the buzzword collection, the scraping process of the glossaries is divided into four steps. At first, the HTML data is pulled from the website using the Python library requests [105]. In the second step, the HTML blob is transformed into a traversable data structure by using the parsing library BeautifulSoup (BS4) [106]. Going through the HTML and selecting the relevant tags, that surround the AI terms, summarizes the next step. In the final step, the terms are saved line-wise in a text file inside the keyword directory. After checking possible overlaps between the glossaries, an aggregated list contained 761 keywords, but still included general terms like action or step, that needed to be filtered out. In preparation for the filtering process, the dictionary list was compiled with computer science-related terms from the Oxford Dictionary and the multiple data science as well as AI hype cycles. The only difference in the scraping process compared to the glossaries was in the usage of Selenium [107] instead of the requests library [105]. In the further pre-processing steps, the cases of all characters were lowered, keywords in parentheses were extracted and duplicate entries were removed in both aggregated lists. Subsequently, all terms on the aggregated keyword list were checked to see if they could also be found on the aggregated dictionary list. If this was not the case, they were removed accordingly. Acronyms were eliminated from the list. In the analysis, terms like AI and Artificial Intelligence were observed combined. In the end, the buzzword list contains 102 keywords (see Appendix 4).

3.4 Topic clustering

A collection of content centered around a particularly common theme is described as a topic cluster [108]. Furthermore, the automatic extraction of topic clusters from data by applying machine learning methods is called topic clustering. The latent Dirichlet allocation (LDA), which is based on a Bayesian model, was introduced in 2003 as one of the first approaches for topic clustering. In detail, this involves reducing the dimensionality of word embeddings, grouping words into clusters and distinguishing between various topics [109].

In the framework of this research project, state-of-the-art algorithms based on DL were applied to perform topic clustering. In particular, we used a pre-trained BERT model for extracting contextualized word embeddings. We did not fine-tune the weights of the BERT model for the respective task. Instead, we used these contextualized word embeddings to train a Cross-lingual Contextualized Zero-shot Topic Model (CTM) for topic modeling [110].

Contextualized Topic Models (CTMs) are an extension of topic models that include contextual information in their topic representations. The primary innovation of CTMs compared to traditional topic models like Latent Dirichlet Allocation (LDA) is that CTMs can take advantage of contextual word embeddings, such as those produced by BERT. This approach allows us to only train a relatively small CTM for topic modelling. Here, we followed the work of Bianchi and colleagues [110] and trained a Neural-ProdLDA [111], which has been shown to obtain good results for zero-shot topic modelling. The Neural-ProdLDA is based on the Variational AutoEncoder (VAE) proposed by Srivastava and Sutton [111]. The model consists of two components. First, an encoder network that takes the contextualized word embeddings as input and transfers them to a latent representation generating a mean and standard deviation. Secondly, a decoder network that samples from the latent space with a Gaussian distribution, that is parameterized by the encoded mean and standard deviation. For more details refer to Kingma and Welling [112] and Srivastava and Sutton [111].

We expect the results to be influenced by the choice of the model used for extracting the contextualized word embeddings. It is planned to deal with this topic in more detail in future research. For this study, we chose BERT as previous work showed, that it obtains competitive results on feature extraction for English language [113].

To initiate the training phase of a topic clustering model, it is necessary to predefine the number of topics beforehand. The ideal number of topics for the data set was estimated by working with the metrics Coherence and NPMI, which were already applied in existing literature [114]. Both metrics were calculated iteratively from the range of 10 to 30 with a step size of 2. To identify the number of topics with the highest value, the product of both metrics was calculated and multiplied by 10 to be visible in Fig. 1. Finally, the optimal number of clusters was 14 and was used for the final training and inference process.

Fig. 1
figure 1

Selection of the optimal number of clusters (red highlighted area)

The model suggests strings consisting of 10 words, which in turn form one of the extracted topics. Based on these most contributed words, we self-defined a generic term for each of the 14 topics (see Appendix C).

In Appendix B, more information about the data source and data set building, the text cleaning pipeline and the training, optimization as well as inference steps related to the topic clustering is presented.

3.5 Sentiment analysis

In another experiment, we performed a sentiment analysis following a methodological approach similar to that of the topic clustering, but without the use of an additional tool such as OCTIS [114]. The Hugging Face platform provides different fine-tuned sentiment analysis models. The transformer needs to be trained in this case on the English language. Therefore, we chose distilbert-base-uncased-finetuned-sst-2-english, where the model is based on the distilbert-base-uncased BERT model, which was trained on Wikipedia data. On the Hugging Face GitHub package, named transformers, the transformer for sentiment analysis can be accessed and downloaded by using the module pipeline [16].

Following this procedure, the sentiment was calculated for the episode-level data set on the transcribed podcast text. To investigate the sentiment on the episode level, the overall sentiment for each podcast and topic was calculated (see Eq. 2).

$$\begin{aligned} \begin{aligned} Score = \frac{positive}{positive + negative} \end{aligned} \end{aligned}$$
(2)

Here, the total numbers of positive and negative sentiments are divided by their relative occurrence. A score of 0 indicates a total negative sentiment, whereas a score of 1 represents an entire positive sentiment.

4 Data set, results and discussion

In this section, we present the novel data set and illustrate the development of the medium podcast over the years from 2015 until 2021. In addition, we perform a baseline analysis of the proposed data set. Therefore, we normalized the buzzword occurrences to the total number of words in the data sources. In further experiments, we describe the relative total occurrence of AI-related buzzwords.

4.1 Data set analysis

The 29 healthcare podcasts that were selected in the data collection process of this study were already presented in Sect. 3.1. The distribution of the total of 3449 episodes with respect to the different podcasts is relatively balanced, with one exception being the Outcomes Rocket (OR) podcast. OR has approximately four times more episodes than the healthcare podcast with the second most episodes. This is a result of the episodes appearing almost daily, even though it was launched only in 2017, and could lead to a potential data imbalance towards topics in OR.

To investigate the hypothesis that podcasts are an emerging medium and to address RQ1, we looked at the development of the number of new podcast episodes between 2015 and 2021 for the previously discussed healthcare podcasts (see Fig. 2).

Fig. 2
figure 2

Development of the number of podcast episodes from 2015 until 2021

In the first years of the period under review up to 2017, there was rather little growth in terms of newly published episodes. In the following years, a gradual increase can already be observed, which will continue in the coming years. While in 2019 and 2020 again further gains can be observed, we see the biggest jump, especially towards the end of 2021, when the number of episodes has more than doubled compared to the previous year. Due to this rising development, podcasts, in our case related to healthcare, can be called an emerging medium. Furthermore, this supports the assumption regarding the increased public perception and popularity of podcasts in general.

4.2 AI trend detection and development

In order to answer RQ1 and RQ2, we wanted to detect a possible AI trend in healthcare podcasts and follow its development over the period from 2015 until the start of 2022. As mentioned in Sect. 3.3, we identified 102 unique English AI-related buzzwords, which make up the final keyword list used for this analysis (see again Appendix A).

At first, we analyzed the trend of the most frequently occurring buzzwords in the proposed podcast data set. In Fig. 3, the relative occurrence of the top ten AI-related buzzwords data, artificial intelligence, software, metric, cloud, transparency, bias, algorithm, machine learning and noise is visualized over the whole period under consideration.

Fig. 3
figure 3

Top 10 AI-related buzzwords (2015–2021)

The buzzword data clearly stands out here with a relative occurrence of more than 0.16% and is well ahead of ai and software at just over and below 0.1% respectively. The other keywords in the Top 10 follow at a considerable distance and can be found in the range of 0.02% and 0.06%. The clear distance between data and the second most common key term ai can be explained by the fact that data is not only an AI-related buzzword but is also used as a general term in a wide range of subject areas, not only in healthcare.

In the next step, we not only look at the top keywords over the entire period in the data set but also detect the trend and analyze the further development based on the Top 7 AI-related buzzwords from 2015 to the beginning of 2022 (see Fig. 4).

Fig. 4
figure 4

Trend Development - Relative Occurrence of Top 7 AI-related buzzwords over the analyzed period (2015–2021)

In the first two years up to the end of 2016, only slight fluctuations can initially be observed and the relative buzzword occurrence of these seven terms always remains below 0.002%. From 2017 on, an increase of all buzzwords is visible, especially when looking at the term data, which for the first time reaches a relative buzzword occurrence of just under 0.004%. After a short stagnation of the trend development until the beginning of 2018, a continuously increasing trend can be observed for the four buzzwords, bias, cloud, metric, and transparency, until the end of the observation period.

In 2018, the term data is still rising almost in parallel with software and ai, but from 2019 it is permanently outstripped and has a relative buzzword occurrence of more than 0.012% at its peak. As explained earlier, this is not only due to the increasing trend in the AI context but also to the general use of the term data. In contrast, the value of the buzzword artificial intelligence already doubled from 0.002% to 0.004% for the first time at the beginning of 2018. In the following two years, the trend rises again moderately, similar to the majority of the other AI-related buzzwords in this top seven evaluation. This changes again in 2020 when AI reaches the previous maximum value of almost 0.009% and keeps this relatively constant until the end of the observation period at the beginning of 2022. Based on the podcast data set, these results show an increasing or still high trend of the topic area AI in healthcare.

4.3 Topic clustering

With regard to RQ3, we applied topic clustering in order to detect topics within the selected healthcare podcasts in the created data set and to further investigate the trend development over time.

4.3.1 Cosine similarities

To find out which podcasts or podcast episodes talk about the same or at least very similar topics, we calculated the cosine similarity between individual embedding vectors z of these same data sources. It was not possible to visualize all identified 14 topics for all podcasts in the data set. Thus, we assigned first the most dominant topic to each of the 29 healthcare podcasts. The overview of those in total seven assigned topics is presented in Table 3.

Table 3 Assignment of each healthcare podcast to its most dominant topic

We see here that healthcare innovation, due to its appearance in 13 podcasts as the dominant topic, thus also plays a very strong role in projects and discussions related to healthcare. However, there are other topics focusing on specific guest speakers from the healthcare environment, home care, or startup acceleration that are likewise the focus of several podcasts. The content of the latter in particular is very close to that of healthcare innovation, which has already been discussed. Nevertheless, there are individual podcasts that differ from the broader mass in their choice of discussion topics and deal more intensively with data privacy (in HTP), hospital pricing (in HCR) or vaccination (in CoHC).

In order to not only look at the most prominent topics in each podcast individually but to better evaluate the similarity of the podcasts, Fig. 5 visualizes the average cosine similarity of each podcast over the entire period. We chose as scale a colour bar with a range from 0 to 1, which describes that the closer a value is to 1, the stronger the similarities are. In particular, the four topic clusters healthcare innovation, guest speakers, home care and startup acceleration, which are the focus of discussion in various of the healthcare podcasts, are clearly visible here and show a very high value close to 1.

Fig. 5
figure 5

Average cosine similarity for the most dominant topic of each healthcare podcast

In addition to the clear recognizability of these clusters, the similarity of other podcasts (see for example below in the center of the heat map) is also apparent, which are initially assigned to healthcare innovation and startup acceleration with regard to their dominant topic. Here, too, the value of the average cosine similarity is in the upper range of the scale, close to 1.

4.3.2 Topic and trend development over time

After the identification of the most dominant topics and the evaluation of the podcasts’ similarity, we conducted an additional analysis step to investigate the topic and trend development. For this experiment, the four podcasts Healthcare Triage Podcast (HTP), Medtech Talk (MT), PopHealth Week (PW) and This Just In (TJI) were selected from the data set. This subset was chosen as all those four podcasts started directly at the beginning in 2015 and could therefore be evaluated over the whole period until the end of 2021. The topic changes over time of those podcasts based on the episode-level transcriptions are visualized in the Figs. 6 and 7. The podcasts were grouped into bins (from 2015 until 2021). In form of an individual heat map each, every year as well as the average topics over the respective years are presented. Over the years, the number of identified topics varied significantly between six and ten topics each year. The topic ai is continuously present from 2015 until 2018 as shown in Fig. 6. Nonetheless, it is still considered a topic of lesser relevance in this part of the evaluation. At least based on those four healthcare podcasts, that were chosen in this experiment, the topic ai could not be clearly detected anymore from 2019 onward. It should be noted, however, that in this case, the ai topic contains only very specific terms that one would rather expect in a research context. Therefore, we also direct the view to the multiple topics that are closely related to the field of AI, such as data privacy, cloud architecture or etl pipeline, and were also detected over almost the entire period under review.

Fig. 6
figure 6

Cosine similarities for the topics of the podcasts HTP, MT, PW and TJI (2015–2018)

Fig. 7
figure 7

Cosine similarities for the topics of the podcasts HTP, MT, PW and TJI (2019–2021) and the overall average cosine similarities (2015–2021)

Furthermore, we observe topics such as data privacy, which are continuously demonstrable and increased, especially in the first years. In addition, we find topics like healthcare innovation, which tended to be intensely discussed especially in 2015 and 2016, but subsequently lost some of their importance (at least in terms of their presence in the discussions) until 2021 and settled at a stable level (see again Figs. 6 and 7). The trend development is quite different in the case of home care. While it was still one of the dominant topics in 2015, the presence of home care in the four podcasts under consideration initially declined rapidly in the following years, before a slight increase became apparent again in 2018, which was also maintained in the next years. In Fig. 7d, the average value across all years shows that guest speakers, cloud architecture, data privacy, home care, and startup acceleration are the five most dominant topics in these four healthcare podcasts between 2015 and 2021.

4.4 Sentiment analysis

In an additional experiment, we conducted a sentiment analysis for all 29 healthcare podcasts over the time from 2015 until 2021. We were targeting the visualization of the sentiment of each evaluated podcast within this study towards each of the 14 identified topics and wanted to answer RQ4 whether sentiment analysis could reveal further insights for the detection of past and current trends in healthcare. As illustrated in Sect. 3.5, the applied sentiment transformer from Hugging Face [16, 115] provides a value between 0 and 1 as well as either a positive or negative label. The value represents the likelihood of the label. Within this subsection, the sentiment score (for the Figs. 8, 9 and 10), was calculated as a relative share between positive and negative within the range of 0 (totally negative) to 1 (totally positive) to quantify the respective sentiments.

In Fig. 8, we illustrate the averaged topic sentiments for all 29 healthcare podcasts over time. The darker the green color of the field on the heat map, the more positive the sentiment is towards the respective topic, for example when looking at the topics ai, healthcare innovation or home care in 2021. Overall, the evaluated healthcare podcasts show a positive or very positive sentiment towards most of the identified topics. Only in 2015 and 2016, we can observe a neutral sentiment for example regarding hospital pricing and vaccination. However, it must be taken into account that the number of available episodes was lower in those years than in the rest of the studied period and that some of the podcasts went on air later.

Fig. 8
figure 8

Averaged topic sentiments for all 29 healthcare podcasts per year (2015–2021), grey color indicates the non-presence of a topic

To better assess the contribution of an individual podcast to the average sentiment of all podcasts, we look in the following in some more detail at exemplary podcasts that have either an overall very negative or very positive sentiment toward the topics under consideration. At first, we look at podcasts with a very positive sentiment on average. Figure 9 shows the topic sentiments over the years from the podcast Health Changers [53], which was first been published in 2017. On the one hand, there are topics like heart monitoring or startup acceleration that were identified not only within multiple years but are also linked with a positive sentiment over time. On the other hand, for example, the topic ai could only be identified in this specific podcast in 2019 but associated with a rather negative sentiment.

Fig. 9
figure 9

Yearly sentiments of the topics from the podcast HealthChangers [53], grey color indicates the non-presence of a topic

In comparison, the Healthcare Triage Podcast [57], which is on air since 2015, shows on average a very negative sentiment towards the investigated topics in healthcare podcasts. In Fig. 10, we observe a very negative overall sentiment especially in the years 2015 until 2017. Nevertheless, there are topics like data privacy that play a significant role within this podcast with a rather positive overall sentiment. Even before the COVID-19 pandemic, the topic vaccination was continuously discussed and we can see a shift in sentiment toward the positive, especially in 2020 and beyond, which were heavily influenced by the pandemic in the public discourse [116].

Fig. 10
figure 10

Yearly sentiments of the topics from the podcast Healthcare Triage Podcast [57], grey color indicates the non-presence of a topic

In our study, we chose sentiment analysis as a tool to visualize the sentiment over time towards specific topics and to investigate a trend of positive ’hype’ or ’fear’ towards those. We were able to detect the speakers’ sentiments toward the topics that we identified in the episode transcriptions. In total, we computed the sentiment for all 14 topics in every of the 29 healthcare podcasts and observe, that the evaluated podcasts have overall a more positive than negative sentiment. However, we could not identify a correlation between the respective topics and the sentiment.

5 Conclusion, limitations and outlook

In previous research, a growing interest in AI could be observed during the last decades, especially regarding technological development and application areas of AI such as in healthcare. In addition, digital media is described as an increasing option in the dissemination of research findings to a broader mass and contributing to technological adoption. So far, academic publications, newspapers or social media were used as sources in order to detect trends. Here, we fill the gap in trend research by going beyond those data sources and by creating a novel data set. To enable other researchers to recreate the used data set, we publish code on GitHub (see Sect. 5). Within this research study, we use the data set in order to investigate the suitability of podcasts as a research medium for trend detection in general and conduct a proof of concept study with a focus on the field of AI in healthcare. In this work, we propose a web-mining approach to gain and analyze the data from 29 healthcare podcasts between 2015 and 2021. Based on the identified 102 AI-related buzzwords, we are able to successfully detect an AI trend and make its development visible. We look beyond the topic area of AI and exemplify the possibilities of a podcast-based data set for trend analysis in healthcare. Using a machine learning-based topic clustering approach, we extract the most dominant topics and track their development over time. In an additional sentiment analysis, we were able to visualize the sentiment of podcasts towards the 14 identified topics to detect a more positive sentiment of the speakers towards those. Our methodological approach is transferable for future research working with the same data set on any kind of topic in healthcare besides AI but is also applicable in additional industries working with podcasts as a suitable research medium in trend research in general.

One limitation of this study comes with the selection of an open-source API for the transcription of the podcasts. According to our evaluation, Microsoft Azure performed the best compared to the other tested APIs. Nevertheless, we had to choose a non-commercialised transcriber. In addition, the overall data set size is still limited, even though we selected 30 healthcare podcasts and used at the end 29 of them in the analysis. An extended data set would further reduce the influence of individual podcasts on the results. Within our data set, the podcast Outcomes Rocket has four times more episodes compared to the podcast with the second most episodes. This could lead to a potential data imbalance. The standardization of the number of used episodes for each of the selected podcasts as well as the application of techniques, such as undersampling, oversampling or cost-imbalanced learning, could be suitable approaches to respond appropriately to the data imbalance issue in a future revision of the project. However, it has to be taken care of in selecting techniques to address the data imbalance to not influence the trend analysis results. Therefore, we recommend that rather more podcasts but fewer episodes per selected podcast should be included in the data set.

In future work, the data set should be increased in terms of selecting further healthcare podcasts and adding their respective episodes. Furthermore, it would be valuable to enrich the data set with the next published episodes to extend the period under consideration and further investigate the podcast and AI trend development in healthcare in upcoming years. Not only in order to further validate the findings of this work, but also to close a still-existing research gap in trend detection research, multiple data sources (in the form of textual and non-textual data) should be investigated. Additional data sources could also look beyond the healthcare domain by addressing further industries or politics and making a comparison of the trend development possible.

In the selection process of the podcasts (see Sect. 3.1), four criteria were considered in the evaluation. This procedure could be extended in future studies by looking more specifically at numbers like audience size or ratings. In addition, the expertise of the hosts, not only the guests, should be reviewed. As the study is dependent, among others, on transcription accuracy, the selection of the Speech-to-Text API is a crucial aspect. Therefore, open-source as well as commercialized transcribers should be used in order to extend the API evaluation process and to compare the respective analysis results.

When looking at the methodological approach, the all-mpnet-base-v2 was used as BERT model, which has an impact on the feature extraction of the text and as a consequence of the topics, that were found in the topic model. Therefore, models for a medicine-specific purpose or multilingual models could be chosen and may lead to a differing list of identified topics. The experiments regarding sentiment analysis showed the successful identification of positive or negative sentiments among podcast speakers. Within this study, we did not differentiate between multiple speakers even though multiple people were recorded. Accordingly, speaker diarization [117] would be an approach for future research by separately investigating the transcribed text of podcast hosts and guest speakers.