Abstract
Over the past few decades, the topic of artificial intelligence (AI) has gained considerable attention in both research and industry. In particular, the healthcare sector has witnessed a surge in the use of AI applications, as the maturity of these methods increased. However, as the use of machine learning (ML) in healthcare continues to grow, we believe it will become increasingly important to examine public perceptions of this trend to identify potential impediments and future directions. Current work focuses mainly on academic data sources and industrial applications of AI. However, to gain a comprehensive understanding of the increased societal interest in AI, digital media such as podcasts should be consulted, as they are accessible to a broader audience. In order to examine this hypothesis, we investigate the AI trend development in healthcare from 2015 until 2021. In this study, we propose a web mining approach to collect a novel data set consisting of 29 healthcare podcasts with 3449 episodes. We identify 102 AI-related buzzwords that were extracted from various glossaries and hype cycles. These buzzwords were used to conduct an extensive trend detection and analysis study on the collected data using machine learning-based approaches. We successfully detect an AI trend and follow its evolution in healthcare podcasts over several years. Besides the focus area of AI, we are able to detect 14 topic clusters and visualize the trending or decreasing dominant topics over the whole period under consideration. In addition, we analyze the sentiments in podcasts towards the identified topics and deliver further insights for trend detection in healthcare. Finally, the collected data set can be used for trend detection besides AI-related topics using topic clustering.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Not only is there a growing interest in Artificial Intelligence (AI) in academic research, as illustrated by the quadrupling of peer-reviewed AI-related journal publications in the last two decades [1, 2], but the technological developments in the field of AI also have an enormous impact on everyday life and industry [3]. This process is driven by advances in digitized data acquisition, computing infrastructure, and Machine Learning (ML) [4]. One of these areas, in which AI is increasingly being applied, is healthcare [5]. Rajpurkar and colleagues [6] declare AI ready to transform medicine sustainably and broadly, improving the experience on the part of patients and clinicians. During the last few years, medical AI algorithms step-by-step improved and reached a new level of maturity, for example, in disease detection using medical images [6, 7]. The challenges on the way to the successful adoption of AI lie now more in its application in routine clinical care and are linked to the safety and effectiveness of AI [6, 8, 9]. Healthcare experts from China and Germany named insufficient traceability and causality of AI decision-making processes, but also reliability related to the AI accuracy and the needed supervision as complex topics to address [10].
Particularly in view of these challenges, as well as the goal of broader implementation of AI in the healthcare environment, digital media such as podcasts are an increasingly sought-after option for disseminating relevant information and current research findings related to healthcare and making them accessible to a broad mass of people in a relatively easy, but also understandable way [11, 12]. The rising popularity of the medium podcast, in general, can be observed in the exponential growth of the number of new podcasts and episodes by year over the last decade, especially from 2015 onwards. From around 23,000 new podcasts and almost a million new episodes in the year 2010, the numbers increased tremendously to more than 223,000 published podcasts and more than 26 million news episodes across all genres in 2022 [13]. Among the most popular genres, the platform Listen Notes list besides Society & Culture or Business also Health & Fitness [13, 14]. In the United States, an annual growth rate of 17 per cent can be observed in terms of monthly listeners between 2019 and 2023, and monthly listeners are expected to increase up to 164 million next year [15].
Previous research in the field of AI addresses mostly industrial applications and technological progress in healthcare. In literature, AI trend research mainly works with academic data sources. We argue that in order to investigate an increased interest in society towards AI, it is necessary to use digital media, like podcasts, as sources that are accessible to a broader mass of people. To fill this gap, we propose a web mining approach to create a novel data set based on podcasts and illustrate a data-driven rather than a methodological approach. In this study, we collect data from in total 29 English-spoken healthcare podcasts. Within this work, we are addressing the following research questions (RQ):
-
RQ1: Are podcasts a suitable research medium for trend detection in general as well as related to the field of AI, especially in healthcare?
-
RQ2: Are we able to detect an AI trend and examine its development in healthcare podcasts between 2015 and 2021?
-
RQ3: Can we identify unknown topics within the multiple podcast data sources using topic clustering?
-
RQ4: Is it possible to detect the speakers’ sentiments towards specific AI-related keywords by applying sentiment analysis?
In order to address the research questions, we evaluate and select Speech-to-Text APIs to process the data and transcribe the audio files to text data. AI-related buzzwords are extracted from multiple sources like glossaries and hype cycles. We utilize these buzzwords for the purpose of trend detection and trend analysis on the collected data by the application of machine learning-based approaches. In the further course of the study, we employ state-of-the-art algorithms based on Deep Learning (DL) to perform topic clustering and sentiment analysis. In addition, a pre-trained transformer model based on the BERT architecture was fine-tuned on the healthcare podcast data. We used OCTIS, an open-source technology based on the Hugging Face BERT models, to build the topic clustering pipeline [16]. In the following, we give an overview of the main contributions of this study:
-
We describe a web mining approach that was used to create a novel data set including 29 healthcare podcasts (in total 3449 episodes from 2015 until 2021).
-
We identify 102 AI-related buzzwords and use them to successfully detect an AI trend and analyse its development in healthcare.
-
We identify unknown topics in healthcare based on podcasts as data sources.
-
We exemplify how the novel data set can be used for trend detection besides the field of AI and illustrate the transferability of the proposed approach for future research using podcasts.
In addition, we show that podcasts, coming from a healthcare environment as well as in general, are an informative and from our perspective very relevant emerging research medium in data mining, and here specifically in the field of web mining.
Reproducibility: The code for the crawlers to collect the data, that were used in the experiments (see Sect. 4.1), is available in the GitHub repository at https://github.com/mad-lab-fau/trend-detection-in-healthcare-podcast-data-set (see Sect. 5). The transcribed data set will be made available upon request.
2 Related work
In the following section, we investigate the current literature on web data mining and AI perception in general and we have a deeper look at text mining in healthcare-related data sources. We further present the existing research in the fields of podcasts, trend detection as well as topic clustering and sentiment analysis.
In literature, we only find a few studies and research projects that are looking specifically at the medium podcasts. According to MacKenzie [17], podcasts developed as a decentralized medium for science communication to the public since 2004. Nevertheless, the author presented one of the first large-scale quantitative studies looking at the production and dissemination of language science podcasts. He identified in total 952 English podcasts from 2004 until 2018. Due to the lack of a centralized database for podcast series, they used the iTunes podcast directory with over 200,000 podcasts and looked specifically at the category Natural Sciences. One limitation of their study was, that this podcast category is entirely dependent on the assignment of the podcast producers themselves. In this study, only online textual as well as visual metadata (e.g. social media content, websites or descriptions) of the podcasts were analyzed. Audio data or the underlying text data of the podcast were not part of the investigation because of the impracticability and challenges that go along with the transcription and the data processing of a large amount of audio data. In summary, they present a linear increase in the total number of science podcasts between 2004 and 2010, which has been replaced by exponential growth between 2010 and 2018 [17].
Not only as a result of the increase of podcasts in science but due to general higher interest in the medium podcast, Vartakavi and colleagues [18] proposed a system called PodSumm for automatically generating audio summaries of podcasts to support the discovery of new content and to allow listeners to get an episode preview. They applied automatic speech recognition (ASR) to transcribe the audio data, then process and finally summarize the text. For the transcription, they used AWS Transcribe [19]. To test their pipeline, they created a podcast data set by collecting 309 episodes (in total 188 h of audio) from 19 podcast series from different genres [18].
In the case of Crosscast, a system provided by Xia and colleagues [20], the goal is not to automatically summarize podcasts but to automatically add visual data to audio travel podcasts. They transcribed the audio data to text using the crowd-sourced transcription service rev.com [21]. Within their study, they analyzed around 300 episodes from travel shows, documentaries and podcasts. At first, they attempted using an ASR tool for the transcription process, but due to errors in their practical tests, they decided against it. Within their pipeline, they determine keywords and geographic locations in the text data by applying natural language processing (NLP) and text mining. This information is used for the automated selection of images from online sources and matches them at the end with the audio commentary [20].
Looking beyond the online medium podcast, we find multiple studies that are scraping and mining web data. Fast and Horvitz [22] analyzed the New York Times articles between 1986 and 2016 in order to reveal trends as well as positive and negative sentiments towards the subject area of AI. Due to the lack of a universal as well as professional definition of AI, automatic sentiment analysis was not feasible. For this reason, manual annotation was performed based on paragraphs by engagement, optimism vs. pessimism, concerns for AI and hopes for AI. In general, they observed an increasing number of reports linked to the field of AI. However, they also recognized specific trends regarding the opportunities of AI in the areas of healthcare and education. Even without an automated sentiment analysis, they were able to evaluate attitudes toward AI, such as ethical issues or a possible loss of control, which tended to be perceived negatively by the public.
When looking at the methodological approach, their study could be described as an extension to the fundamental sentiment analysis, because it includes sentiments as well as emotions. This work shows the possibilities in trend detection and trend prediction of AI-related topics towards certain emotions based on structured and annotated text data [22]. A similar methodological procedure using topic clustering and sentiment analysis with the aim of detecting future trends was chosen by Aghababaei and Makrehchi [23], who used different data sources, such as Twitter posts and local crime rates, for their analysis.
Particularly with regard to healthcare, research projects are focusing on AI trend detection and analysis. These studies are concerned with the healthcare sector as a whole and with individual disciplines or application areas such as telemedicine [39, 40]. An example of the former is the work of Jiang and colleagues [39], who examined the status of AI applications in healthcare, with a particular focus on ML and NLP. By investigating the data from PubMed, they show that the number of articles about DL increased already since 2013, but more than doubled from 2015 until 2016. NLP is used for the identification of keywords, e.g. related to diseases, to support the clinical decision-making process and to assist physicians in terms of treatment suggestions. From their point of view, the successful use of AI requires NLP to support the mining of unstructured text data as well as ML methods in the context of handling structured data such as images. The field of telemedicine, which allows physicians to examine or treat patients at a distance, was used to examine the application of AI in healthcare. Pacis and colleagues [40] describe in this area four different trends and discuss the development regarding the application of AI in intelligent assistance diagnosis, information analysis collaboration, patient monitoring and healthcare IT.
Based on the review of the wide-ranging research from the fields of web scraping and web data mining, AI trend analysis in general as well as in healthcare, we conducted an extended literature analysis, presented in Table 1.
In summary, we found multiple studies focusing on either web data mining, the field of AI, or specific domains such as healthcare, but we identify a lack of research regarding the specific combinations of those focus areas.
We found multiple studies mining and analyzing structured as well as unstructured data. Data sources such as newspaper articles [22], dictionaries [29] and public (document) databases [31, 32, 34] can be assigned to the first area. The latter area includes research that uses (large amounts of) social media data, for example from Twitter [30, 33, 38] or different kinds of blog systems [35, 37]. This is also where this study can be methodically allocated.
In the field of trend detection and trend analysis, data from public (social media) platforms is collected and used as a common source. Nevertheless, podcasts, in general as well as addressing healthcare, can still be considered as a significantly less used data source due, among other things, to the obstacles to overcome in the transcription and processing of large amounts of data [17].
3 Methodology
3.1 Data collection
In a first step, healthcare-related podcasts were collected, which resulted in an initial list of 45 healthcare podcasts [41, 42]. We restricted our analysis to English-spoken podcasts because English is the most supported language by Speech-to-Text APIs. The list of podcasts was further evaluated according to four criteria: 1) the Listen Score & Global Rank [43], 2) the overall number of episodes, 3) the availability of all episodes within a Really Simple Syndication (RSS) feed, 4) and the involvement of relevant guests and experts.
The popularity of podcasts is quantified by the metrics Listen Score and Global Rank, which are provided by the podcast search engine Listen Notes [14]. The estimated popularity of a podcast compared to all other RSS-based public podcasts worldwide is scaled between 0 and 100. Only the top 10 percent of the podcasts receive a Listen Score. Therefore, podcasts without a score were sorted out.
Secondly, we looked at the number of published episodes and filtered out podcasts with less than 25 episodes. Due to the monthly release schedule of most podcasts, we were able to ensure at least two years of data from each podcast. Third, it is essential that the number of item tags inside the RSS feed matches the total number of published episodes. Otherwise, the crawler would fail to download all episodes. As the last metric, we looked at the podcast guests to only select those with invited participants like C-level executives, entrepreneurs or scientists. Due to this procedure, we were able to sort out healthcare podcasts that addressed the general audience discussing, for example, health or fitness education, instead of focusing on the state-of-the-art and innovative technological developments in healthcare.
This resulted in the list of in total 30 healthcare podcasts that were selected. One was excluded due to incorrect publishing dates after the later described crawling process. Table 2 gives an overview of the final number of 29 podcasts.
In the data collection process, a RSS feed crawler was implemented to download the audio files and associated meta data that were published by the individual podcasts. The pipeline was built using Python 3 and takes one or more feed links as input. The pipeline includes two stages crawl and convert. In the first stage, all episodes, that are stored in an RSS link, are subsequently downloaded. In the second stage, the non-MP3 file formats are converted into MP3. As shown in Table 2, each podcast received an abbreviation of its name for better readability. After downloading, the metadata from all episodes was parsed and stored in a central CSV file.
3.2 Data processing
The data processing consists of two parts: the evaluation and selection of Speech-to-Text APIs and the transcription of the audio files to text files. At first, we started with a primary search for Speech-to-Text APIs and found the following supported speech recognition engines of the SpeechRecognition Python library [73]:
-
CMU Sphinx [74]
-
Google Web Speech [75]
-
Google Cloud Speech API [76]
-
Houndify API [77]
-
IBM Watson Speech-to-Text [78]
-
Microsoft Bing Voice Recognition [73]
-
Snowboy Hotword Detection [79]
-
Wit.ai [80]
This list was step-wise reduced due to additional needed libraries (CMU Sphinx), to not fitting for the use case (Wit.ai) or to obsolete and switched-off tools by the service providers (Microsoft Bing Voice Recognition, Snowboy Hotword Detection). After an extended literature research, the three Speech-to-Text APIs DeepSpeech [81], Microsoft Azure [82] and Vosk [83] were added to the list, which consisted at the end in total seven to be paid as well as open source Speech-to-Text APIs that were tested and evaluated (at first based on the versions from October 2020).
-
DeepSpeech [81]
-
Google Web Speech [75]
-
Google Cloud Speech API [76]
-
Houndify API [77]
-
IBM Watson Speech-to-Text [78]
-
Microsoft Azure Speech [82]
-
Vosk [83]
Before the transcription tests, snippets of one episode per podcast (less than one minute long) were created. As a reference, each snippet was transcribed manually. Afterward, each audio file was transcribed to multiple text files using the different speech recognition engines. For the evaluation and the comparison of the Speech-to-Text APIs, the word error rate (WER) was used [84, 85]. The WER compares a reference with a hypothesis and is defined as [86]:
where
-
S is the number of substitutions,
-
I is the number of insertions,
-
D is the number of deletions,
-
N is the total number of input words,
-
C is the number of correct words.
In the following experiment, the manually transcribed transcript is the ground truth and each of the API transcripts serves as one hypothesis. The program loops over all text files and their content was read into variables and pre-processed for example by the replacement of non-alphanumeric characters or by lowering the caps. The text is used as input for calculating the distance matrix, which again serves as the basis for the WER calculation. The WER values were saved in a.CSV file and used for the median WER calculation as well as the value normalization. In previous literature, Microsoft Azure and Google Cloud had the lowest WER among their peers [87, 88]. Within our test on 30 samples, Microsoft Azure again had the lowest median WER of 4.1. Due to limited financial resources, we continued to choose a free Speech-to-Text API out of Google Web Speech, Vosk and DeepSpeech, which was used with WAV and MP3, and decided to use DeepSpeech with MP3 files as input.
During the lifetime of the project, we re-ran the API evaluation of the two open-source APIs DeepSpeech and Vosk with the versions from October 2020 (DeepSpeech 0.8.2., Vosk 0.3.15), and additionally using the latest versions (DeepSpeech 0.9.3., Vosk 0.3.32) from February 2022 following the same procedure as described before. Beyond that, we now added randomness to the WER calculation to improve the robustness of the evaluation. Therefore, one random episode was taken from each of the 30 podcasts at this stage. For each of these episodes, three 30-second-snippets were chosen at a random position in order to avoid for example snippets with the same introductory or farewell sentences in a podcast. Again all snippets were manually transcribed and are serving as the ground-truth label. Each snippet was guaranteed to have a unique word sequence. Both APIs performed relatively equally when looking at the average mean of the WER. Nevertheless, we conducted the experiments with Vosk 0.3.32, because DeepSpeech 0.9.3. had a more significant standard deviation and was challenging regarding the necessary GPU powering for the podcast transcription process.
3.3 Buzzword identification and selection
Within this study, the following glossaries and dictionaries were used as sources for the creation of the dictionary list as well as the English buzzword (or sometimes named keyword) list, which will be used for the detection of AI-related keywords in the podcast transcriptions.
-
Accenture Applied Intelligence Glossary [89],
-
Gartner Hype Cycles 2015 - 2020 [90,91,92,93,94,95,96,97,98,99],
-
Gartner Information Technology Glossary [100],
-
Github Machine Learning Glossary [101],
-
Google Machine Learning Glossary [102],
-
Microsoft Machine Learning Glossary [103],
-
Oxford Dictionary of Computer Science [104],
-
Stanford Machine Learning Glossary [2].
For the buzzword collection, the scraping process of the glossaries is divided into four steps. At first, the HTML data is pulled from the website using the Python library requests [105]. In the second step, the HTML blob is transformed into a traversable data structure by using the parsing library BeautifulSoup (BS4) [106]. Going through the HTML and selecting the relevant tags, that surround the AI terms, summarizes the next step. In the final step, the terms are saved line-wise in a text file inside the keyword directory. After checking possible overlaps between the glossaries, an aggregated list contained 761 keywords, but still included general terms like action or step, that needed to be filtered out. In preparation for the filtering process, the dictionary list was compiled with computer science-related terms from the Oxford Dictionary and the multiple data science as well as AI hype cycles. The only difference in the scraping process compared to the glossaries was in the usage of Selenium [107] instead of the requests library [105]. In the further pre-processing steps, the cases of all characters were lowered, keywords in parentheses were extracted and duplicate entries were removed in both aggregated lists. Subsequently, all terms on the aggregated keyword list were checked to see if they could also be found on the aggregated dictionary list. If this was not the case, they were removed accordingly. Acronyms were eliminated from the list. In the analysis, terms like AI and Artificial Intelligence were observed combined. In the end, the buzzword list contains 102 keywords (see Appendix 4).
3.4 Topic clustering
A collection of content centered around a particularly common theme is described as a topic cluster [108]. Furthermore, the automatic extraction of topic clusters from data by applying machine learning methods is called topic clustering. The latent Dirichlet allocation (LDA), which is based on a Bayesian model, was introduced in 2003 as one of the first approaches for topic clustering. In detail, this involves reducing the dimensionality of word embeddings, grouping words into clusters and distinguishing between various topics [109].
In the framework of this research project, state-of-the-art algorithms based on DL were applied to perform topic clustering. In particular, we used a pre-trained BERT model for extracting contextualized word embeddings. We did not fine-tune the weights of the BERT model for the respective task. Instead, we used these contextualized word embeddings to train a Cross-lingual Contextualized Zero-shot Topic Model (CTM) for topic modeling [110].
Contextualized Topic Models (CTMs) are an extension of topic models that include contextual information in their topic representations. The primary innovation of CTMs compared to traditional topic models like Latent Dirichlet Allocation (LDA) is that CTMs can take advantage of contextual word embeddings, such as those produced by BERT. This approach allows us to only train a relatively small CTM for topic modelling. Here, we followed the work of Bianchi and colleagues [110] and trained a Neural-ProdLDA [111], which has been shown to obtain good results for zero-shot topic modelling. The Neural-ProdLDA is based on the Variational AutoEncoder (VAE) proposed by Srivastava and Sutton [111]. The model consists of two components. First, an encoder network that takes the contextualized word embeddings as input and transfers them to a latent representation generating a mean and standard deviation. Secondly, a decoder network that samples from the latent space with a Gaussian distribution, that is parameterized by the encoded mean and standard deviation. For more details refer to Kingma and Welling [112] and Srivastava and Sutton [111].
We expect the results to be influenced by the choice of the model used for extracting the contextualized word embeddings. It is planned to deal with this topic in more detail in future research. For this study, we chose BERT as previous work showed, that it obtains competitive results on feature extraction for English language [113].
To initiate the training phase of a topic clustering model, it is necessary to predefine the number of topics beforehand. The ideal number of topics for the data set was estimated by working with the metrics Coherence and NPMI, which were already applied in existing literature [114]. Both metrics were calculated iteratively from the range of 10 to 30 with a step size of 2. To identify the number of topics with the highest value, the product of both metrics was calculated and multiplied by 10 to be visible in Fig. 1. Finally, the optimal number of clusters was 14 and was used for the final training and inference process.
The model suggests strings consisting of 10 words, which in turn form one of the extracted topics. Based on these most contributed words, we self-defined a generic term for each of the 14 topics (see Appendix C).
In Appendix B, more information about the data source and data set building, the text cleaning pipeline and the training, optimization as well as inference steps related to the topic clustering is presented.
3.5 Sentiment analysis
In another experiment, we performed a sentiment analysis following a methodological approach similar to that of the topic clustering, but without the use of an additional tool such as OCTIS [114]. The Hugging Face platform provides different fine-tuned sentiment analysis models. The transformer needs to be trained in this case on the English language. Therefore, we chose distilbert-base-uncased-finetuned-sst-2-english, where the model is based on the distilbert-base-uncased BERT model, which was trained on Wikipedia data. On the Hugging Face GitHub package, named transformers, the transformer for sentiment analysis can be accessed and downloaded by using the module pipeline [16].
Following this procedure, the sentiment was calculated for the episode-level data set on the transcribed podcast text. To investigate the sentiment on the episode level, the overall sentiment for each podcast and topic was calculated (see Eq. 2).
Here, the total numbers of positive and negative sentiments are divided by their relative occurrence. A score of 0 indicates a total negative sentiment, whereas a score of 1 represents an entire positive sentiment.
4 Data set, results and discussion
In this section, we present the novel data set and illustrate the development of the medium podcast over the years from 2015 until 2021. In addition, we perform a baseline analysis of the proposed data set. Therefore, we normalized the buzzword occurrences to the total number of words in the data sources. In further experiments, we describe the relative total occurrence of AI-related buzzwords.
4.1 Data set analysis
The 29 healthcare podcasts that were selected in the data collection process of this study were already presented in Sect. 3.1. The distribution of the total of 3449 episodes with respect to the different podcasts is relatively balanced, with one exception being the Outcomes Rocket (OR) podcast. OR has approximately four times more episodes than the healthcare podcast with the second most episodes. This is a result of the episodes appearing almost daily, even though it was launched only in 2017, and could lead to a potential data imbalance towards topics in OR.
To investigate the hypothesis that podcasts are an emerging medium and to address RQ1, we looked at the development of the number of new podcast episodes between 2015 and 2021 for the previously discussed healthcare podcasts (see Fig. 2).
In the first years of the period under review up to 2017, there was rather little growth in terms of newly published episodes. In the following years, a gradual increase can already be observed, which will continue in the coming years. While in 2019 and 2020 again further gains can be observed, we see the biggest jump, especially towards the end of 2021, when the number of episodes has more than doubled compared to the previous year. Due to this rising development, podcasts, in our case related to healthcare, can be called an emerging medium. Furthermore, this supports the assumption regarding the increased public perception and popularity of podcasts in general.
4.2 AI trend detection and development
In order to answer RQ1 and RQ2, we wanted to detect a possible AI trend in healthcare podcasts and follow its development over the period from 2015 until the start of 2022. As mentioned in Sect. 3.3, we identified 102 unique English AI-related buzzwords, which make up the final keyword list used for this analysis (see again Appendix A).
At first, we analyzed the trend of the most frequently occurring buzzwords in the proposed podcast data set. In Fig. 3, the relative occurrence of the top ten AI-related buzzwords data, artificial intelligence, software, metric, cloud, transparency, bias, algorithm, machine learning and noise is visualized over the whole period under consideration.
The buzzword data clearly stands out here with a relative occurrence of more than 0.16% and is well ahead of ai and software at just over and below 0.1% respectively. The other keywords in the Top 10 follow at a considerable distance and can be found in the range of 0.02% and 0.06%. The clear distance between data and the second most common key term ai can be explained by the fact that data is not only an AI-related buzzword but is also used as a general term in a wide range of subject areas, not only in healthcare.
In the next step, we not only look at the top keywords over the entire period in the data set but also detect the trend and analyze the further development based on the Top 7 AI-related buzzwords from 2015 to the beginning of 2022 (see Fig. 4).
In the first two years up to the end of 2016, only slight fluctuations can initially be observed and the relative buzzword occurrence of these seven terms always remains below 0.002%. From 2017 on, an increase of all buzzwords is visible, especially when looking at the term data, which for the first time reaches a relative buzzword occurrence of just under 0.004%. After a short stagnation of the trend development until the beginning of 2018, a continuously increasing trend can be observed for the four buzzwords, bias, cloud, metric, and transparency, until the end of the observation period.
In 2018, the term data is still rising almost in parallel with software and ai, but from 2019 it is permanently outstripped and has a relative buzzword occurrence of more than 0.012% at its peak. As explained earlier, this is not only due to the increasing trend in the AI context but also to the general use of the term data. In contrast, the value of the buzzword artificial intelligence already doubled from 0.002% to 0.004% for the first time at the beginning of 2018. In the following two years, the trend rises again moderately, similar to the majority of the other AI-related buzzwords in this top seven evaluation. This changes again in 2020 when AI reaches the previous maximum value of almost 0.009% and keeps this relatively constant until the end of the observation period at the beginning of 2022. Based on the podcast data set, these results show an increasing or still high trend of the topic area AI in healthcare.
4.3 Topic clustering
With regard to RQ3, we applied topic clustering in order to detect topics within the selected healthcare podcasts in the created data set and to further investigate the trend development over time.
4.3.1 Cosine similarities
To find out which podcasts or podcast episodes talk about the same or at least very similar topics, we calculated the cosine similarity between individual embedding vectors z of these same data sources. It was not possible to visualize all identified 14 topics for all podcasts in the data set. Thus, we assigned first the most dominant topic to each of the 29 healthcare podcasts. The overview of those in total seven assigned topics is presented in Table 3.
We see here that healthcare innovation, due to its appearance in 13 podcasts as the dominant topic, thus also plays a very strong role in projects and discussions related to healthcare. However, there are other topics focusing on specific guest speakers from the healthcare environment, home care, or startup acceleration that are likewise the focus of several podcasts. The content of the latter in particular is very close to that of healthcare innovation, which has already been discussed. Nevertheless, there are individual podcasts that differ from the broader mass in their choice of discussion topics and deal more intensively with data privacy (in HTP), hospital pricing (in HCR) or vaccination (in CoHC).
In order to not only look at the most prominent topics in each podcast individually but to better evaluate the similarity of the podcasts, Fig. 5 visualizes the average cosine similarity of each podcast over the entire period. We chose as scale a colour bar with a range from 0 to 1, which describes that the closer a value is to 1, the stronger the similarities are. In particular, the four topic clusters healthcare innovation, guest speakers, home care and startup acceleration, which are the focus of discussion in various of the healthcare podcasts, are clearly visible here and show a very high value close to 1.
In addition to the clear recognizability of these clusters, the similarity of other podcasts (see for example below in the center of the heat map) is also apparent, which are initially assigned to healthcare innovation and startup acceleration with regard to their dominant topic. Here, too, the value of the average cosine similarity is in the upper range of the scale, close to 1.
4.3.2 Topic and trend development over time
After the identification of the most dominant topics and the evaluation of the podcasts’ similarity, we conducted an additional analysis step to investigate the topic and trend development. For this experiment, the four podcasts Healthcare Triage Podcast (HTP), Medtech Talk (MT), PopHealth Week (PW) and This Just In (TJI) were selected from the data set. This subset was chosen as all those four podcasts started directly at the beginning in 2015 and could therefore be evaluated over the whole period until the end of 2021. The topic changes over time of those podcasts based on the episode-level transcriptions are visualized in the Figs. 6 and 7. The podcasts were grouped into bins (from 2015 until 2021). In form of an individual heat map each, every year as well as the average topics over the respective years are presented. Over the years, the number of identified topics varied significantly between six and ten topics each year. The topic ai is continuously present from 2015 until 2018 as shown in Fig. 6. Nonetheless, it is still considered a topic of lesser relevance in this part of the evaluation. At least based on those four healthcare podcasts, that were chosen in this experiment, the topic ai could not be clearly detected anymore from 2019 onward. It should be noted, however, that in this case, the ai topic contains only very specific terms that one would rather expect in a research context. Therefore, we also direct the view to the multiple topics that are closely related to the field of AI, such as data privacy, cloud architecture or etl pipeline, and were also detected over almost the entire period under review.
Furthermore, we observe topics such as data privacy, which are continuously demonstrable and increased, especially in the first years. In addition, we find topics like healthcare innovation, which tended to be intensely discussed especially in 2015 and 2016, but subsequently lost some of their importance (at least in terms of their presence in the discussions) until 2021 and settled at a stable level (see again Figs. 6 and 7). The trend development is quite different in the case of home care. While it was still one of the dominant topics in 2015, the presence of home care in the four podcasts under consideration initially declined rapidly in the following years, before a slight increase became apparent again in 2018, which was also maintained in the next years. In Fig. 7d, the average value across all years shows that guest speakers, cloud architecture, data privacy, home care, and startup acceleration are the five most dominant topics in these four healthcare podcasts between 2015 and 2021.
4.4 Sentiment analysis
In an additional experiment, we conducted a sentiment analysis for all 29 healthcare podcasts over the time from 2015 until 2021. We were targeting the visualization of the sentiment of each evaluated podcast within this study towards each of the 14 identified topics and wanted to answer RQ4 whether sentiment analysis could reveal further insights for the detection of past and current trends in healthcare. As illustrated in Sect. 3.5, the applied sentiment transformer from Hugging Face [16, 115] provides a value between 0 and 1 as well as either a positive or negative label. The value represents the likelihood of the label. Within this subsection, the sentiment score (for the Figs. 8, 9 and 10), was calculated as a relative share between positive and negative within the range of 0 (totally negative) to 1 (totally positive) to quantify the respective sentiments.
In Fig. 8, we illustrate the averaged topic sentiments for all 29 healthcare podcasts over time. The darker the green color of the field on the heat map, the more positive the sentiment is towards the respective topic, for example when looking at the topics ai, healthcare innovation or home care in 2021. Overall, the evaluated healthcare podcasts show a positive or very positive sentiment towards most of the identified topics. Only in 2015 and 2016, we can observe a neutral sentiment for example regarding hospital pricing and vaccination. However, it must be taken into account that the number of available episodes was lower in those years than in the rest of the studied period and that some of the podcasts went on air later.
To better assess the contribution of an individual podcast to the average sentiment of all podcasts, we look in the following in some more detail at exemplary podcasts that have either an overall very negative or very positive sentiment toward the topics under consideration. At first, we look at podcasts with a very positive sentiment on average. Figure 9 shows the topic sentiments over the years from the podcast Health Changers [53], which was first been published in 2017. On the one hand, there are topics like heart monitoring or startup acceleration that were identified not only within multiple years but are also linked with a positive sentiment over time. On the other hand, for example, the topic ai could only be identified in this specific podcast in 2019 but associated with a rather negative sentiment.
In comparison, the Healthcare Triage Podcast [57], which is on air since 2015, shows on average a very negative sentiment towards the investigated topics in healthcare podcasts. In Fig. 10, we observe a very negative overall sentiment especially in the years 2015 until 2017. Nevertheless, there are topics like data privacy that play a significant role within this podcast with a rather positive overall sentiment. Even before the COVID-19 pandemic, the topic vaccination was continuously discussed and we can see a shift in sentiment toward the positive, especially in 2020 and beyond, which were heavily influenced by the pandemic in the public discourse [116].
In our study, we chose sentiment analysis as a tool to visualize the sentiment over time towards specific topics and to investigate a trend of positive ’hype’ or ’fear’ towards those. We were able to detect the speakers’ sentiments toward the topics that we identified in the episode transcriptions. In total, we computed the sentiment for all 14 topics in every of the 29 healthcare podcasts and observe, that the evaluated podcasts have overall a more positive than negative sentiment. However, we could not identify a correlation between the respective topics and the sentiment.
5 Conclusion, limitations and outlook
In previous research, a growing interest in AI could be observed during the last decades, especially regarding technological development and application areas of AI such as in healthcare. In addition, digital media is described as an increasing option in the dissemination of research findings to a broader mass and contributing to technological adoption. So far, academic publications, newspapers or social media were used as sources in order to detect trends. Here, we fill the gap in trend research by going beyond those data sources and by creating a novel data set. To enable other researchers to recreate the used data set, we publish code on GitHub (see Sect. 5). Within this research study, we use the data set in order to investigate the suitability of podcasts as a research medium for trend detection in general and conduct a proof of concept study with a focus on the field of AI in healthcare. In this work, we propose a web-mining approach to gain and analyze the data from 29 healthcare podcasts between 2015 and 2021. Based on the identified 102 AI-related buzzwords, we are able to successfully detect an AI trend and make its development visible. We look beyond the topic area of AI and exemplify the possibilities of a podcast-based data set for trend analysis in healthcare. Using a machine learning-based topic clustering approach, we extract the most dominant topics and track their development over time. In an additional sentiment analysis, we were able to visualize the sentiment of podcasts towards the 14 identified topics to detect a more positive sentiment of the speakers towards those. Our methodological approach is transferable for future research working with the same data set on any kind of topic in healthcare besides AI but is also applicable in additional industries working with podcasts as a suitable research medium in trend research in general.
One limitation of this study comes with the selection of an open-source API for the transcription of the podcasts. According to our evaluation, Microsoft Azure performed the best compared to the other tested APIs. Nevertheless, we had to choose a non-commercialised transcriber. In addition, the overall data set size is still limited, even though we selected 30 healthcare podcasts and used at the end 29 of them in the analysis. An extended data set would further reduce the influence of individual podcasts on the results. Within our data set, the podcast Outcomes Rocket has four times more episodes compared to the podcast with the second most episodes. This could lead to a potential data imbalance. The standardization of the number of used episodes for each of the selected podcasts as well as the application of techniques, such as undersampling, oversampling or cost-imbalanced learning, could be suitable approaches to respond appropriately to the data imbalance issue in a future revision of the project. However, it has to be taken care of in selecting techniques to address the data imbalance to not influence the trend analysis results. Therefore, we recommend that rather more podcasts but fewer episodes per selected podcast should be included in the data set.
In future work, the data set should be increased in terms of selecting further healthcare podcasts and adding their respective episodes. Furthermore, it would be valuable to enrich the data set with the next published episodes to extend the period under consideration and further investigate the podcast and AI trend development in healthcare in upcoming years. Not only in order to further validate the findings of this work, but also to close a still-existing research gap in trend detection research, multiple data sources (in the form of textual and non-textual data) should be investigated. Additional data sources could also look beyond the healthcare domain by addressing further industries or politics and making a comparison of the trend development possible.
In the selection process of the podcasts (see Sect. 3.1), four criteria were considered in the evaluation. This procedure could be extended in future studies by looking more specifically at numbers like audience size or ratings. In addition, the expertise of the hosts, not only the guests, should be reviewed. As the study is dependent, among others, on transcription accuracy, the selection of the Speech-to-Text API is a crucial aspect. Therefore, open-source as well as commercialized transcribers should be used in order to extend the API evaluation process and to compare the respective analysis results.
When looking at the methodological approach, the all-mpnet-base-v2 was used as BERT model, which has an impact on the feature extraction of the text and as a consequence of the topics, that were found in the topic model. Therefore, models for a medicine-specific purpose or multilingual models could be chosen and may lead to a differing list of identified topics. The experiments regarding sentiment analysis showed the successful identification of positive or negative sentiments among podcast speakers. Within this study, we did not differentiate between multiple speakers even though multiple people were recorded. Accordingly, speaker diarization [117] would be an approach for future research by separately investigating the transcribed text of podcast hosts and guest speakers.
Availability of data and materials
The code for the crawlers is included in the GitHub repository: https://github.com/mad-lab-fau/trend-detection-in-healthcare-podcast-data-set. For more information about the data crawling procedure of the podcast data sources and the transcribed data set please contact the corresponding author via. philipp.dumbach@fau.de.
References
Ongsulee P (2017) Artificial intelligence, machine learning and deep learning. In: 15th International Conference on ICT and Knowledge Engineering (ICT &KE) pp 1–6 https://doi.org/10.1109/ICTKE.2017.8259629
Perrault R, et al (2019) The AI index 2019 Annual Report (AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA, 2019). https://hai.stanford.edu/sites/default/files/ai_index_2019_report.pdf
Nguyen A et al (2021) System design for a data-driven and explainable customer sentiment monitor using IoT and enterprise data. IEEE Access 9:117140–117152. https://doi.org/10.1109/ACCESS.2021.3106791
Yu K-H, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2(10):719–731. https://doi.org/10.1038/s41551-018-0305-z
Dicuonzo G, Donofrio F, Fusco A, Shini M (2023) Healthcare system: moving forward with artificial intelligence. Technovation 120:102510. https://doi.org/10.1016/j.technovation.2022.102510
Rajpurkar P, Chen E, Banerjee O, Topol EJ (2022) Ai in health and medicine. Nat Med 28(1):31–38. https://doi.org/10.1038/s41591-021-01614-0
Hannun AY et al (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 25(1):65–69. https://doi.org/10.1038/s41591-018-0268-3
Schwinn L et al. (202) Identifying untrustworthy predictions in neural networks by geometric gradient analysis. In: de Campos C, Maathuis MH (eds) Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, Vol 161 854–864 (PMLR, Proceedings of Machine Learning Research, 2021) (2021). https://proceedings.mlr.press/v161/schwinn21a.html
Schwinn L et al. (2022) Improving robustness against real-world and worst-case distribution shifts through decision region quantification. In: Chaudhuri K et al. (eds.) Proceedings of the 39th International Conference on Machine Learning, Vol 162 19434–19449 (PMLR, Proceedings of Machine Learning Research, 2022). https://proceedings.mlr.press/v162/schwinn22a.html
Dumbach P, Liu R, Jalowski M, Eskofier BM (2021) The adoption of artificial intelligence in SMES—a cross-national comparison in German and Chinese healthcare. In: Joint Proceedings of the BIR 2021 Workshops and Doctoral Consortium co-located with 20th International Conference on Perspectives in Business Informatics Research (BIR 2021) (2991), 84–98 (2021). https://ceur-ws.org/Vol-2991/paper08.pdf
Casares DR (2020) Embracing the podcast era: trends, opportunities, and implications for counselors. J Creat Ment Health 17(1):123–138. https://doi.org/10.1080/15401383.2020.1816865
King L (2022) Benefits of podcasts for healthcare professionals. J Child Health Care 26(3):341–342. https://doi.org/10.1177/13674935221116553
LISTEN NOTES (2023) Podcast stats: how many podcasts are there?. https://www.listennotes.com/podcast-stats/
LISTEN NOTES (2022) Listen notes: the best podcast search engine. https://www.listennotes.com
Götting MC (2023) Number of monthly podcast listeners in the united states from 2013 to 2023.. https://www.statista.com/statistics/786826/podcast-listeners-in-the-us/#statisticContainer
Wolf T. et al. (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations pp. 38–45,(2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6
MacKenzie LE (2019) Science podcasts: analysis of global production and output from 2004 to 2018. R Soc Open Sci 6(1):180932. https://doi.org/10.1098/rsos.180932
Vartakavi A, Garg A, Rafii Z (2021) European Association for Signal Processing (eds.) Audio summarization for podcasts. (eds. European Association for Signal Processing) In: 2021 29th European Signal Processing Conference (EUSIPCO), Proceedings European Signal Processing Conference (EUSIPCO), 431–435 (IEEE, 2021)
Amazon Web Services (2020) Amazon transcribe: automatically convert speech to text. https://aws.amazon.com/transcribe/
Xia H, Jacobs J, Agrawala M, Iqbal S, MacLean K, Chevalier F, Mueller S (2020) (eds.) Crosscast: adding visuals to audio travel podcasts. In: (Iqbal S, MacLean K, Chevalier F, Mueller S) (eds.) Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, 735–746 (ACM, New York, NY, USA, 2020)
rev.com. rev (2020) How to transcribe audio to text. https://www.rev.com/blog/resources/how-to-transcribe-audio-to-text
Fast E, Horvitz E (2017) Long-term trends in the public perception of artificial intelligence. In: Proceedings of the AAAI Conference on Artificial Intelligence 31(1). https://ojs.aaai.org/index.php/aaai/article/view/10635
Aghababaei S, Makrehchi M (2018) Mining twitter data for crime trend prediction. Intell Data Anal 22(1):117–141. https://doi.org/10.3233/IDA-163183
Johnson L, Grayden S (2006) Podcasts — an emerging form of digital publishing. Int J Comp. Dent 9:205–218. http://www.quintpub.com/userhome/ijcd/ijcd_2006_03_s0205.pdf
Bonini T (2015) The ‘second age’ of podcasting: reframing podcasting as a new digital mass medium. Quad CAC 41(XVIII):21–30. https://www.cac.cat/sites/default/files/2019-01/Q41_Bonini_EN_0.pdf
Berry R (2016) Podcasting: considering the evolution of the medium and its association with the word ‘radio’. Radio J Int Stud Broadcast & Audio Media 14(1):7–22. https://doi.org/10.1386/rjao.14.1.7_1
Clifton A, et al (2020) Scott D, Bel N, Zong C (eds) 100,000 podcasts: a spoken English document corpus. In: Scott D, Bel N & Zong C) (eds.) Proceedings of the 28th International Conference on Computational Linguistics, Vol 2020, 5903–5917. (International Committee on Computational Linguistics, Stroudsburg, PA, USA, 2020)
Valero FB, Baranes M, Epure EV, Hagen M et al (2022) Topic modeling on podcast short-text metadata. (Hagen M et al.) (eds.) Advances in Information Retrieval, Vol. 13185 of ECIR: European Conference on Information Retrieval, 472–486 (Springer, Cham, 2022). https://link.springer.com/chapter/10.1007/978-3-030-99736-6_32#chapter-info
Cornwall A (2007) Buzzwords and fuzzwords: deconstructing development discourse. Dev Pract 17(4–5):471–484. https://doi.org/10.1080/09614520701469302
Budak C, Agrawal D, El Abbadi A (2011) Structural trend analysis for online social networks. Proc VLDB Endow 4(10):646–656. https://doi.org/10.14778/2021017.2021022
Caled D, Beyssac P, Xexéo G, Zimbrão G (2016) Buzzword detection in the scientific scenario. Pattern Recognit Lett 69:42–48
Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) (2018) Machine learning and knowledge extraction lecture notes in computer science. Springer International Publishing, Cham
Fedoryszak M, Frederick B, Rajaram V, Zhong C (2019) Real-time event detection on social data streams. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining pp 2774–2782. https://doi.org/10.1145/3292500.3330689
Mühlroth C, Grottke M (2022) Artificial intelligence in innovation: how to spot emerging trends and technologies. IEEE Trans Eng Manag 69(2):493–510. https://doi.org/10.1109/TEM.2020.2989214
Nikolenko SI, Koltcov S, Koltsova O (2015) Topic modelling for qualitative studies. J Inf Sci 43(1):88–102. https://doi.org/10.1177/0165551515617393
Reagan AJ, Danforth CM, Tivnan B, Williams JR, Dodds PS (2017) Sentiment analysis methods for understanding large-scale texts: a case for using continuum-scored words and word shift graphs. EPJ Data Sci. https://doi.org/10.1140/epjds/s13688-017-0121-9
Zakkar MA, Lizotte DJ (2021) Analyzing patient stories on social media using text analytics. Healthc Inform Res 5(4):382–400. https://doi.org/10.1007/s41666-021-00097-5
Sanders AC et al. (2021) Unmasking the conversation on masks: natural language processing for topical sentiment analysis of covid-19 twitter discourse. AMIA Jt Summits Transl Sci Proc vol 2021, pp 555–564. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8378598/
Jiang F et al (2017) Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2(4):230–243. https://doi.org/10.1136/svn-2017-000101
Pacis DMM, Subido EDC, Bugtai NT (2018) Trends in telemedicine utilizing artificial intelligence. AIP Conf Proc 1933(1):040009. https://doi.org/10.1063/1.5023979
Turea M.(2020) The 19 healthcare podcasts you should be listening to in 2020. Healthcare tech. https://healthcareweekly.com/best-healthcare-podcasts/
FeedSpot (2020) 70 best healthcare industry podcasts by health professionals. https://blog.feedspot.com/healthcare_podcasts/
LISTEN NOTES (2022) Listen score: how popular a podcast is?. https://www.listennotes.com/listen-score/
Shankar, V (2017) 15 minutes with the doctor: learn from healthcare entrepreneurs and innovators. https://podcasts.apple.com/gb/podcast/15-minutes-with-the-doctor-learn-from/id1231946185
Becker’s Healthcare (2021) Becker’s healthcare podcast. https://podcasts.apple.com/us/podcast/beckers-healthcare-podcast/id1452376188
Masselli M, Flinter M (2020) Conversations on health care. https://podcasts.apple.com/us/podcast/conversations-on-health-care/id1139815935
Inside Digital Health (2018) Data book: chief healthcare executive. https://podcasts.apple.com/us/podcast/data-book/id1365789336
Kendall, D (2016) Digital health today. https://podcasts.apple.com/us/channel/digital-health-today/id6442486438
Zajc T (2017) Faces of digital health. https://podcasts.apple.com/us/podcast/faces-of-digital-health/id1194284040
GeekWire (2017) Geekwire health tech. https://podcasts.apple.com/us/podcast/geekwire-health-tech/id1243992489
Harlow D (2017) Harlow on healthcare. https://www.healthcarenowradio.com/programs/harlow-health-care/
Marchica J (2018) Health care rounds. https://podcasts.apple.com/us/podcast/health-care-rounds/id1380448243
Cambia Health Solutions (2017) Healthchanger. https://podcasts.apple.com/us/podcast/healthchangers/id1215167172
Lynn J, Hung C (2019) Healthcare it today. https://podcasts.apple.com/us/podcast/healthcare-it-today/id1449044715
Johnson J, Ismail Z (2019) Healthcare rap: shift forward health. https://podcasts.apple.com/us/podcast/healthcare-rap/id1367047468
Xtelligent Healthcare Media (2019) Healthcare strategies. https://podcasts.apple.com/us/podcast/healthcare-strategies/id1485735357
Carroll A (2015) Healthcare triage podcast. https://podcasts.apple.com/us/podcast/healthcare-triage-podcast/id999134849
Arsene C, Reddy M (2019) Healthcare weekly: at the forefront of healthcare innovation. https://podcasts.apple.com/us/podcast/healthcare-weekly-at-the-forefront-of/id1454446734
FAH’s Chip K (2018) Hospitals in focus: federation of American hospitals & voxtopica. https://podcasts.apple.com/us/podcast/hospitals-in-focus/id1438138193
Virsys12 (2019) How i transformed this: success stories of transformation in healthcare. https://podcasts.apple.com/us/podcast/how-i-transformed-this/id1476745436
Pardo G (2015) Medtech talk: healthegy. https://podcasts.apple.com/us/podcast/medtech-talk/id978000677
Marquez S (2017) Outcomes rocket. https://podcasts.apple.com/us/podcast/outcomes-rocket/id1246067757
Cerner (2018) Perspectives on health and tech. https://podcasts.apple.com/us/podcast/perspectives-on-health-and-tech/id1450841795
Goldstein F, Masters G (2015) Pophealth week. https://podcasts.apple.com/de/podcast/pophealth-week/id1293846845
Kyeremanteng K (2019) Solving healthcare: with dr. kwadwo kyeremanteng. https://podcasts.apple.com/ca/podcast/solving-healthcare-with-dr-kwadwo-kyeremanteng/id1478899917
Birch P (2018) Talking healthtech: digital health and healthcare technology podcast. https://podcasts.apple.com/au/podcast/talking-healthtech-digital-health-and-healthcare/id1451558982
Lee D, Shah S (2017) The #hcbiz show! https://podcasts.apple.com/us/podcast/the-hcbiz-show/id1223753364
Change Healthcare (2018) Changing healthcare: a podcast about accelerating transformation. https://podcasts.apple.com/us/podcast/changing-healthcare-a-podcast-about-accelerating/id1440326284
van Terheyden N (2018) The incrementalist. https://www.healthcarenowradio.com/programs/incrementalist/
Wharton Digital Health (2019) The pulse by Wharton digital health. https://podcasts.apple.com/us/podcast/the-pulse-by-wharton-digital-health/id1442422790
Tate J (2019) The tate chronicles: dispatches from the frontline of health it. https://podcasts.apple.com/us/podcast/the-tate-chronicles-amit-trivedi-director-of/id1301407966?i=1000578259478
Barnes J (2015) This just in. https://www.healthcarenowradio.com/programs/this-just-in/
Zhang A (2017) Speech recognition (version 3.8.). https://github.com/Uberi/speech_recognition#readme
CMUSphinx (2017) Cmusphinx documentation. https://cmusphinx.github.io/wiki/
Google (2021) Google web speech. https://www.google.com/intl/en/chrome/demos/speech.html
Google Cloud (2021)Google cloud speect-to-text. https://cloud.google.com/speech-to-text
SoundHound Inc (2015) Houndify documentation. https://www.houndify.com/signup
IBM (2021) Watson speech to text. https://www.ibm.com/de-de/cloud/watson-speech-to-text
Kitt AI (2016) Snowboy. https://github.com/Kitt-AI/snowboy/
wit.ai. (2021) Build natural language experiences. https://wit.ai/
DeepSpeech (2020) Deepspeech. https://github.com/mozilla/DeepSpeech
Microsoft Azure (2021) Speech services pricing. https://azure.microsoft.com/en-us/products/cognitive-services/speech-services/
Alpha C (2021) Vosk api: Vosk speech recognition toolkit. https://github.com/alphacep/vosk-api
Park Y, Patwardhan S, Visweswariah K, Gates SC (2008) An empirical analysis of word error rate and keyword error rate. Proc Interspeech 2008:2070–2073. https://doi.org/10.21437/Interspeech.2008-537
Errattahi R, El Hannani A, Ouahmane H (2018) Automatic speech recognition errors detection and correction: a review. Procedia Comput Sci 128:32–37. https://doi.org/10.1016/j.procs.2018.03.005
Zechner K, Waibel AH (2000) Minimizing word error rate in textual summaries of spoken language. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics pp 186–193
Këpuska V (2017) Comparing speech recognition systems (microsoft API, google API and CMU sphinx). IJERA 07(03):20–24. https://doi.org/10.9790/9622-0703022024
Kim JY, et al (2019) A comparison of online automatic speech recognition systems and the nonverbal responses to unintelligible speech. arXiv:1904.12403
Accenture (2021) The applied intelligence glossary. https://www.accenture.com/gb-en/insights/applied-intelligence/artificial-intelligence-glossary
Linden A (2015) Hype cycle for advanced analytics and data science. https://www.gartner.com/en/documents/3087721
Hare J, Linden A, Krensky P 2016 Hype cycle for data science. https://www.gartner.com/en/documents/3388917
Krensky P, Hare J (2017) Hype cycle for data science and machine learning. https://www.gartner.com/en/documents/3772081
Krensky P, Hare J (2018) Hype cycle for data science and machine learning, https://www.gartner.com/en/documents/3883664
Vashisth S, Linden A, Hare J, Krensky P (2019) Hype cycle for data science and machine learning, 2019. https://www.gartner.com/en/documents/3955984
Vashisth S, Linden A, Hare J, den Hamer P (2020) Hype cycle for data science and machine learning. https://www.gartner.com/en/documents/3988118
Austin T, Brant K (2017) Hype cycle for artificial intelligence https://www.gartner.com/en/documents/3770467
Sicular S, Brant K (2018) Hype cycle for artificial intelligence. https://www.gartner.com/en/documents/3883863
Sicular, S, Hare J, Brant K (2019) Hype cycle for artificial intelligence. https://www.gartner.com/en/documents/3953603
Sicular S, Vashisth S (2020) Hype cycle for artificial intelligence. https://www.gartner.com/en/documents/3988006
Gartner Inc (2021) Gartner glossary: Information technology glossary. https://www.gartner.com/en/information-technology/glossary
Fortuner B (2017) Ml glossary on github. https://github.com/bfortuner/ml-glossary/blob/master/docs/glossary.rst
Google (2021) Machine learning glossary. https://developers.google.com/machine-learning/glossary
Microsoft Corporation (2021) Machine learning glossary of important terms. https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/glossary
Butterfield A, Ngondi GE, Kerr A (2016) A dictionary of computer science, 7th edn. Oxford University Press, New York, NY
Reitz K (2021) Requests: Http for humans: requests is an elegant and simple http library for python, built for human beings. https://requests.readthedocs.io/en/master/
Richardson L (2020) Beautiful soup: beautiful soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Selenium (2022) Selenium. https://www.selenium.dev/documentation/
Vayansky I, Kumar SA (2020) A review of topic modeling methods. Inf Syst 94:101582. https://doi.org/10.1016/j.is.2020.101582
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. JMLR 3(Jan):993–1022
Bianchi F, Terragni S, Hovy D, Nozza D, Fersini E (2021) Cross-lingual contextualized topic models with zero-shot learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics pp. 1676–1683. https://aclanthology.org/2021.eacl-main.143.pdf
Srivastava A, Sutton C (2017) Autoenconding variational inference for topic models. In: Proceedings for the 5th International Conference on Learning Representations (ICLR 2017). https://openreview.net/forum?id=BybtVK9lg
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114
Jayanthi SM, Embar V, Raghunathan K (2021) Evaluating pretrained transformer models for entity linking in task-oriented dialog. arXiv:2112.08327
Terragni S, Fersini E, Galuzzi BG, Tropeano P, Candelieri A (2021) Octis: comparing and optimizing topic models is simple! In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Systems Demonstrations pp 263–270
Hugging Face (2022) The AI community building the future. https://huggingface.co/
Böhmer MM et al (2020) Investigation of a COVID-19 outbreak in Germany resulting from a single travel-associated primary case: a case series. Lancet Infect Dis 20(8):920–928. https://doi.org/10.1016/S1473-3099(20)30314-5
Park TJ et al (2022) A review of speaker diarization: recent advances with deep learning. Comput Speech Lang 72:101317. https://doi.org/10.1016/j.csl.2021.101317
Sbalchiero S, Eder M (2020) Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual Quant 54(4):1095–1108. https://doi.org/10.1007/s11135-020-00976-w
Schmiedel T, Müller O, vom Brocke J (2018) Topic modeling as a strategy of inquiry in organizational research: a tutorial with an application example on organizational culture. Organ Res Methods 22(4):941–968. https://doi.org/10.1177/1094428118773858
Acknowledgements
Bjoern M. Eskofier gratefully acknowledges the support of the German Research Foundation (DFG) within the framework of the Heisenberg professorship program (grant number ES 434/8-1).
Funding
Open Access funding enabled and organized by Projekt DEAL. Bjoern M. Eskofier gratefully acknowledges the support of the German Research Foundation (DFG) within the framework of the Heisenberg professorship program (grant number ES 434/8-1).
Author information
Authors and Affiliations
Contributions
PD conceived the proposed project and method, conducted the experiments (partly in form of a master thesis supervised by the first author) and wrote the initial draft of the paper. LS, TL and PLD discussed the results and reviewed the paper. All authors analyzed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Buzzword list
Appendix B Topic clustering pipeline
1.1 Data source
The scraped data sources that were used in this study were already introduced in Sect. 3.1. In total 30 healthcare-related podcasts were selected, that have a focus on guest speakers from a technological, especially AI-related background. In the end 29 of those podcasts were included in the analysis. After performing the RSS download pipeline, a total of 3449 episodes from the years 2015–2021 were collected. The files were stored as MP3 files and account for a total of 1724 h of speech, which relates to 1.23 Terabytes for all the episodes within this data set.
1.2 Data set building
After scraping the data in MP3 format, multiple processing steps were performed to build the final data set. The speech-to-text API Vosk was identified as the best-fitting open-source API for this study. An episode has an average length of 30 min. For the transcription of all episodes, an NVIDIA RTX 3080 Ti with 10,240 CUDA cores was used. This resulted in 3449 text files with an average length of 5000 words per episode. It is a common practice to use texts with a short total word length for topic clustering because an increase in corpus length can have adverse effects on finding the optimal number of topics [118]. In line with existing research [22], using the full episode transcription with an average of 5000 words each as a training sample did not prove to be ideal. Therefore, a different strategy was chosen to divide the transcriptions into smaller text chunks.
1.3 Text cleaning
Topic clustering techniques use words’ embeddings as input to their training and inference process. Different words with similar meanings will still result in embedding vectors with a similar direction. A text cleaning pipeline is an important step before applying the actual topic clustering because removing e.g. unnecessary words like stopwords, can lead to an improved quality of the results [119]. The step-by-step cleaning process (see Fig. 11) was performed for every episode transcription.
1.4 Keyword data set
The keyword list (see again Appendix A) was used to find syntactical small but meaningful text chunks in the episodes’ raw text. As already mentioned before, words and abbreviations with the same meaning were combined in the keyword list as a single word. All episodes were iteratively scanned for the listed keywords. From the raw episode text, we used each keyword occurrence and picked always 200 words post and prior to it, which is called text chunk throughout this study.
The fixed number of 200 was chosen by comparing the development of two computed metrics. The first metric is the relative proportion of overlapping text chunks. This metric reveals how significant the percentage of overlapping text chunks is within one episode. The closer the number reaches 1, the worse the selected fixed number k for the number of words gets. The second metric computes the relative proportion of used words. This metric analyses how many words of the total number of words per episode are found in all text chunks for each episode. This metric gets continuously closer to 1 when k becomes larger, because it eventually approaches the total text length. For this reason, the desired number of k was selected to be as small as possible, but still preserves as many words as possible. The scanning of each keyword in the 3449 episode texts revealed that on average approximately 10 keywords were found in each text. This resulted in 35,816 text chunks of the length 401, because the keyword is surrounded by twice 200 words.
1.5 Training
The BERT architecture is the foundation for the model training. The pipeline for topic clustering was built with OCTIS [114], an open-source technology provided on GitHub, which builds upon the Hugging Face models [16]. The model all-mpnet-base-v2 was used as the basis transformer because it provides the best performance for the language feature extraction in English [113]. The advantage of OCTIS is, that the Coherence and NPMI metrics can be computed right away without any additional implementation. Topic clustering requires for the training that the number of topics is set as a hyper parameter before the training starts.
1.6 Optimization
The ideal number of topics is a crucial hyper parameter because it decides how many topics will be found in the training data. On the one hand side, if the number is not high enough, the context of the training data could not be caught correctly. Desired topics that shall be found, especially AI-related topics, could accidentally be omitted by the model. On the other hand side, if the set number of topics is too high, many overlapping topics and words within these topics would appear and the topics over time analysis could not work correctly either. Therefore, the metrics from OCTIS [114], namely Coherence and NPMI were used to compute the ideal number of topics. The metrics were iteratively calculated from the range [10, 40] with a step size of two. The product of those two metrics was calculated to select the highest value as ideal number of topics, which was 14 in this case.
1.7 Inference
After training the model with the hyper-parameters from the optimization function and the ideal number of topics as 14, a large dictionary was returned. The dictionary contains the following entries after the training with the 35,816 by 401 keyword data set:
-
topics: (14 x 10) the list of the most significant words for each topic (list of lists of strings),
-
topic-word-matrix: (14 x 24,072) matrix of weights with 14 as the number of topics and 24,072 as the vocabulary length,
-
topic-document-matrix: (14 x 35,816) matrix of weights with 14 as the number of topics and 35,816 as the number of documents in the corpus.
There are ten words listed for each topic in string format. The naming convention was to use the initial five topics and connect them by underscore with each other. As a result, a single string describes a topic uniquely. Each of the 14 topics defines the ten most contributed words to this specific topic, but no generic term, which was self-defined for each topic (see Appendix C). The embedding vectors of the topic clustering are necessary to calculate the cosine similarities. The embedding data sets have an equal length as the training data and the columns correspond to the number of topics, 14 in this case.
Appendix C Topic strings and generic terms
The following Table 5 shows the mapping from the topic string to the self-defined generic term to describe the selected 14 topics, that were identified in the healthcare podcast data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dumbach, P., Schwinn, L., Löhr, T. et al. Artificial intelligence trend analysis on healthcare podcasts using topic modeling and sentiment analysis: a data-driven approach. Evol. Intel. 17, 2145–2166 (2024). https://doi.org/10.1007/s12065-023-00878-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-023-00878-4