Keywords

1 Introduction

The history of automated annotation of textual documents starts from the 1960s when Borko and Bernick [1] applied exploratory factor analysis to unsupervised classification of scientific publication abstracts. Nowadays, dozens of models have been developed and applied to extract topics from a texts [2, 3]. In tourism and social sciences in general, the most popular approach [4] is Latent Dirichlet Allocation (LDA) developed by Blei [5]. Meanwhile, LDA has important restrictions, which are usually ignored by authors. First, LDA relies on discerning parameters of the document-topic and topic-word distributions, which necessitates the presence of documents of ample length to effectively encapsulate a diversified amalgamation of topics. Second, the LDA algorithm mandates a substantial corpus of textual data to ensure precise estimation of the underlying topic distributions. Lastly, the discordant or extraneous documents within the corpus, which are common in social media, negatively impact the quality of the inferred topics. Even when all these assumptions are met, LDA topic models are criticized for inherent instability and challenges in defining the “optimal” number of target topics.

In the past few years, a new crop of large language model (LLM) such as Google’s BERT [6] has become increasingly popular, owing the success to their ability to capture the context instead of considering document words in isolation. In tourism domain, TourBERT topic model was pre-trained on tourist reviews, descriptions of tourist services, attractions and sights [7], though we are not aware of any publication in tourism journals that would utilize it.

The explosive development of the LLM field, which drew public attention after a ChatGPT became freely available over a web-based interface, has led to exploration of LLM topic extraction capabilities following a set of instructions (prompts). A new discipline known as prompt engineering explores LLM ability to learn new tasks from examples provided as an input (prompts). The key concepts of prompt engineering are the precise setting of the context such as providing relevant facts; providing elaborate instructions; conditioning LLM behavior by, e.g., providing examples; controlling for data biases; iterative refinement of LLM responses; and, finally, result validating [8, 9].

Emerging studies hint at ability of using LLM prompt engineering for topic modeling [10,11,12]. In this respect, LLMs have numerous advantages over previous generation of topic models such as leveraging general knowledge obtained in the pre-training process to infer the comments’ topics, even when the data is incomplete or ambiguous; ability to infer the topic of short comments by transferring knowledge from similar domains; and robustness to noise in the data. They can handle misspellings, grammatical errors, and inconsistent punctuation, which are common in noisy documents, by capitalizing on the surrounding context and their understanding of language patterns [8, 9].

This paper is the first to the best of our knowledge attempt to apply an LLM (GPT-3) to extraction of topics from a set of online feedbacks (reactions) of blog readers. A typical reaction is short (one sentence) and noisy (contains cultural references, slang, and typos), which makes topic extractions with traditional methods challenging. We compare extracted topics with results of traditional LDA model trained on the same dataset.

2 Data and Methodology

The specific setting are online reviews of a famous Chinese social media influencer Li Ziqi who holds a Guinness World Record for the “most subscribers for a Chinese language channel on YouTube”. The focus of Li Ziqi’s videos is on rural China; their depiction of simple yet beautiful traditional way of life evidently impacts potential tourists wishing to “visit LIZIQI’S world”. We collected all Weibo and Youtube reactions to four most popular Li Ziqi’s videos reflective of her area of interest: Rural way of life; Traditional self-made culture; Food and cooking; and Input of China to the world civilization. The collected data was cleaned, and short reactions (lesser than three words) were removed. In total, 1,852 reactions in English language were collected on Youtube. On Weibo, 2,980 reactions in simplified Chinese were collected and translated to English with Google translate. The quality of translation was verified by a native speaker.

Collected data was then processed in batches of circa 2,000 words to fit GPT-3 limits using the following prompt: “Find the most common and prominent topics covered in the {text}. For each topic that you find print the number of occurrences of this topic.” Here, {text} represents a block of reactions. Identified topics were then merged using GPT-3, resulting in 18 major topics. Finally, the reactions were mapped back to the topics following prompt engineering best practices (abridged):

  • goal = “match review to the best fitting review topic from a list of topics”

  • steps = “1. Break the list of reviews onto separate reviews; 2. For each review find two best matching review topics from the list of review topics separated by the ‘;’ sign; 3. When there are no well-matching topics, assume that the topic is ‘Other’; 4. Print the review followed by the best matching topics”

  • actAs = “a classifier assigning a class label to a data input”

  • format = “a table with reviews in the first column …”

  • prompt = “Your goal is to {goal}, acting as {actAs}. To achieve this, take a systematic approach by: {steps}. Present your response in markdown format, following the structure: {format}. The list of review topics are as follows: {topics_str}”.

  • The list of reviews is as follows: {text}

For comparison, we used identical set of reactions to extract their topic with LDA. Data was pre-processed following the best practices of topic modeling: stop word removal, bigram tokenization, and lemmatization. Then, LDA topic modeling was completed for the number of topics varying from 5 to 25. A 13-topic solution was selected for its best interpretability.

3 Results

Table 1 presents LLM topics, together with validation outcomes. The quality of topic modeling was validated by bilingual expert on a stratified random sample consisting of 360 reactions (20 per topic). The overall accuracy of topic modeling, as conducted by LLM, was found to be 97.7%. The most important reason for the high accuracy is improved recognition of short texts. Note that 30% of reviews were classified into “Other” category and were not rated. In a similar way, we performed validation of LDA topics (Table 2). For each document, LDA returns a mix of topics; we validated the topic with the highest probability and only this probability exceeded 0.5. One can interpret this decision as assigning documents not related to any high probability topic to the category “Other” (42% of dataset) and removing them from validation process. Overall accuracy of topic assignment was 58%.

Table 1. Topic validation outcomes, LLM.
Table 2. Topic validation accuracy, LDA.

4 Discussion

Given that the social media reactions tend to be short, it is not surprising that LDA topic modeling accuracy was moderate (58%); in comparison, LLM accuracy was excellent (98%). Meanwhile, even though LDA performance in terms of assigning the documents to specific topics was unimpressive, the overall set of topics is similar between LDA and LLM. It includes themes related to Chinese culture, crafts, beauty of living with nature, pets, and variations of expressions of praise towards the influencer. Note that LLM derived topics are much more specific, easy to comprehend, and did not require tedious interpretation process.

To our best knowledge, this is the first attempt to use LLM in tourism domain, a much wider effort is needed to make solid conclusions about the best practices and limitations of the methodology. The field of prompt engineering has existed for only one year. However, in our view application of LLM to topic modeling in tourism domain seems to have a very high potential. Our next plans are exploration of LLM capabilities in analysis of textual and pictorial tourism data with goals of understanding limitations and formulation of the best practices.