Introduction

Cities worldwide are grappling with a core challenge: the transition towards more sustainable mobility. Current car-centred transport is linked to a number of negative externalities, including greenhouse gas emissions, air pollution, congestion, accidents, and noise pollution (Parry et al. 2007; Small and Verhoef 2007). In Europe, transport is responsible for 27% of greenhouse gas emissions, with road transport representing the greatest share of these emissions (72% in 2019) (European Environment Agency 2022). Cities are therefore rethinking their urban mobility systems, with car use declining in advanced cities that promote sustainable alternatives to the car (Jones 2014). However, we often see that the transformation process towards sustainable mobility is met with protests. This was the case in Barcelona, where the superblock model aims to decrease through-traffic (O’Sullivan 2017), or for the expansion of London’s congestion charge (Milmo 2007). More recently, this has also been the case in Brussels, Belgium, where the phased implementation of several low-traffic neighbourhoods (LTNs) resulted in sometimes violent protests (The Brussels Times 2022).

Although the preconditions for a sustainable mobility transition are known and well documented, mobility behaviour is still dominated by car use (Haustein and Kroesen 2022), with attempts at transitioning towards sustainable mobility often undermined by public resistance. These demonstrations are amplified by media coverage, raising awareness of the issues at hand (Jennings and Saunders 2019). One possible explanation for this is that forcing a system to change can create a backlash (Rotmans et al. 2012). Zipori and Cohen (2015), e.g., mention that it is necessary to implement changes in ‘gentle’ ways to avoid resistance, but that such an approach has its limitations.

When looking at the data collection for mobility planning, it is often still dominated by traditional methods, such as surveys, to estimate travel demand and transport supply. These methods are expensive and time consuming (Zannat and Choudhury 2019). There are therefore new demands being placed on data in terms of the amount of data required, as well as the accuracy and completeness of it (Stopher and Greaves 2007). Additionally, public participation and involvement have become core elements in transport planning, but can be challenging to achieve (Evans-Cowley and Griffin 2012). Developments in big data analysis can provide opportunities for mobility planning by complementing these traditional methods (Pucci and Vecchio 2019). One advantage of big data is that the sample size analysed can be larger than with traditional survey methods.

A particularly interesting direction to gain the necessary insights is through user-generated content (UGC), which can be complementary to traditional data-collection methods (Martin-Domingo et al. 2019). UGC is content that can be made widely available by individuals without needing to go through a publisher. Through social media, this has become possible for almost anyone (Wyrwoll 2014). UGC can be analysed using text mining techniques, one of which being sentiment analysis, where positive or negative opinions about a subject are analysed (Quan and Ren 2016). Using social media data is a relatively new development in transport planning, but it shows great potential (Nikolaidou and Papaioannou 2018).

In this paper, we therefore ask the following question: “Can sentiment analysis through pre-trained language models improve our understanding of public perception of mobility measures and interventions?” This paper seeks to provide policymakers and practitioners with an understanding of alternative tools for mobility planning that provide a broader understanding of public sentiment. Our analysis focuses on Brussels, Belgium, where the recent implementation of the regional mobility plan rerouting and restricting car traffic led to opposition in several neighbourhoods. We perform sentiment analysis using two different Transformer-based pre-trained language models: XLM-T (Barbieri et al. 2022), an encoder-based model fine-tuned on collected data, and GPT3.5/4 (OpenAI 2023), a decoder-based model employed in a zero-shot manner.

This paper is structured as follows: the next section provides some background on UGC and sentiment analysis. In the subsequent section, we introduce the mobility interventions that served as the subject of our sentiment analysis in Brussels, Belgium, and we explain the methodology employed. The penultimate section presents our results, and the last section provides a discussion and some concluding remarks.

Literature Review

User-Generated Content in Transport

Understanding the sentiments of the public can be a difficult task. In recent years, UGC started playing an important role across politics, business, and entertainment (Gal-Tzur et al. 2014). The availability of UGC allows for sentiment analysis in different areas, such as the harvesting and analysis of opinions and product trends (Tuarob and Tucker 2015), or political orientations (Maynard and Funk 2012).

In transport planning, travel surveys have historically been used to collect data and guide decision making. Surveys are useful to obtain socio-demographic information, but are labour-intensive and therefore more costly. These higher costs lead to smaller sample sizes, as well as lower update frequency of the data. Additionally, quality issues of the data can arise (Serna et al. 2017; Zannat and Choudhury 2019). Transport planning also faces difficulties with regards to public participation (Evans-Cowley and Griffin 2012), which is a necessary precondition for achieving sustainable mobility (Lindenau and Böhler-Baedeker 2014). UGC can provide an interesting complement to the traditional data-collection methods currently used in transport planning, as this type of data contains a high level of accuracy at a lower cost (Zannat and Choudhury 2019). As such, UGC has been used to analyse the experience of transportation services (Collins et al. 2013), or the reporting of heavy traffic (Endarnoto et al. 2011). Serna et al. (2017) employ UGC to identify sustainability issues related to urban mobility.

Yet the full potential of UGC for the transport sector has not yet been reached (Gal-Tzur et al. 2014), and planners should further develop the use of social media as a data source (Lock & Pettit 2020). According to Kuflik et al. (2017), UGC has the potential to complement, enrich, or even replace traditional data collection in the transport sector. The integration of big data in the planning process can help reduce the duration of the planning cycle (currently ranging anywhere from 5 to 20 years (Khan et al. 2014)), as well as result in more informed and agile decision-making (Semanjski et al. 2016). The use of UGC to improve transport decision making by understanding the public’s feeling towards mobility policies therefore offers an interesting avenue of research.

Sentiment Analysis on UGC

Sentiment analysis is a natural language task that analyses individuals' opinions, attitudes and emotions towards entities such as products, services, organisations, locations and events (Liu 2015). Sentiment analysis can encompass many approaches. In this work, we focus on simply classifying the polarity (i.e., positive, neutral, or negative) of text. It should be noted that there appears to be a ‘negativity bias’ within UGC, with social media being a sharing arena that reflects negative emotions (Jalonen 2014).

Various domains have successfully applied sentiment analysis to TwitterFootnote 1 data, e.g., from understanding the public’s sentiment towards the COVID-19 pandemic (Naseem et al. 2021), to extracting trends of food consumption across the United States (Widener and Li 2014), or even predicting stock market evolutions (Pagolu et al. 2016). In the transport sector, Twitter-based sentiment analysis was used to evaluate the satisfaction of transit service users in Los Angeles (Luong and Houston 2015) and Chicago (Collins et al. 2013). Collins et al. (2013) find that users are more likely to express a negative sentiment. Lock and Pettit (2020) use Twitter data to evaluate public transport performance in Sydney, Australia, and found that there was no clear majority of either positive or negative sentiments being expressed. However, they also report that sarcasm was often not picked up and that sarcastic tweets were often labelled as positive. In their study, they compare two different models to perform the sentiment analysis, and conclude that the use of multiple models adds confidence to the interpretation of their results.

Sentiment Analysis Using Pre-trained Language Models

The Transformer architecture (Vaswani et al. 2017) has been an incredible advance for the application of deep learning on Nature Language Processing (NLP) tasks. Pre-trained Language Models (LMs) based on the transformer architecture such as OpenAI's GPT4 (OpenAI 2023), Google's PaLM 2 (Google, 2023), BERT (Devlin et al. 2019) or RoBERTa (Liu et al. 2019) are trained to create contextual word embeddings with the use of large amounts of unlabelled training data. Once pre-trained, these models can be fine-tuned for specific NLP tasks, which can be mono- or multilingual. State-of-the-art performances on multilingual tasks have been pushed by pre-trained multilingual models such as mBert (Devlin et al. 2019), XLM (Lample and Conneau 2019) or XLM-R(oBERTa) (Conneau et al. 2020).

Using social media data, specifically Twitter data, for NLP tasks does suffer from drawbacks due to its uncurated nature (Derczynski et al. 2013). Tweet brevity incentivises users to compress their message, omitting possible contextualising words (Derczynski et al. 2013). Additionally, the widespread use of slang and neologisms means Twitter data contain peculiarities which are generally not included in the general training corpus of language models (Camacho-Collados et al. 2020). Emojis also play an essential role in understanding social media data as they carry a non-negligible semantic load (Barbieri et al. 2018) and are omnipresent in their usage (Barbieri et al. 2017). This means that an NLP task such as sentiment analysis needs to consider this additional source of information when making predictions. Felbo et al. (2017) showed that training models on emoji prediction tasks improved their performance on other tasks such as sentiment analysis or sarcasm detection.

Although LMs are pre-trained on a large corpus of data, our topic-specific task poses some challenges. Judging the sentiment of a tweet requires knowledge about the domain of the subject, which is why models are further fine-tuned for a specific task and topic. Improving the performances can occur in different ways. A first approach consists of continuing pre-training. Gururangan et al. (2020) show that further pre-training on domain-specific and task-specific data offers performance gains. In the same direction, Rietzler et al. (2019) demonstrate that the performance of BERT for Aspect-Task Sentiment Classification (ATSC), which combines both aspect extraction and sentiment polarity detection, can be improved by further pre-training on domain-specific data and then fine-tuning the model on task-specific data. This combination of task and domain knowledge enhancement also works when using domain knowledge for further pre-training but fine-tuning on (out-of-domain) task data (Xu et al. 2019), where this was demonstrated using BERT on tasks such as Review Reading Comprehension, Aspect Extraction and Sentiment Analysis. Further training can also be done purely through fine-tuning, whether on a task level (e.g. Glue benchmark tasks (Liu et al. 2019), intent detection/classification (Zhang et al. 2021) or text classification (Howard and Ruder 2018)), or on domain-specific level (Araci 2019).

Enabling a modal shift is dependent on changes in policy and planning. However, the developments described above demonstrate that traditional data collection methods alone are no longer sufficient for transport planning. Developments in big data, and UGC, can be a valuable complement to inform decision makers on the feelings of the public, but there is a need to understand how these methods can be useful to decision makers.

Materials and Methods

Research Context

Brussels, Belgium, is home to 1.2 million inhabitants (IBSA 2022). It is historically a very car-oriented city. The World Expo of 1958 provided a push to modernize the city, resulting in modern road infrastructure to accommodate cars (Hubert 2008). In recent years, there has been a trend to reclaim some of the urban space, with the most notable project being the conversion of one of the city’s central car arteries into a pedestrian area in 2015 (Hubert et al. 2017). The city also adopted Good Move, its regional mobility plan, in 2020, which was the result of a four-year participatory process. The plan won the 2020 SUMP Award for its ambitiousness (Bruxelles Mobilité, 2020b). It includes the implementation of one of Europe’s largest 30 km/h-zones, as well as the elimination of through-traffic in multiple neighbourhoods (Bruxelles Mobilité, 2020a). In the context of the COVID-19 pandemic, which hit Europe and Belgium in the spring of 2020, changes with regards to urban mobility in the city were also accelerated, with 40 new bike lanes being deployed faster than anticipated, or the closure of streets to cars (see for example Bruzz 2020a). However, the temporary closures to cars during COVID-19 sparked some backlash (Bruzz 2020b; Macharis et al. 2021), as did the phased implementation of the LTNs planned in the context of the Good Move plan (The Brussels Times 2022). These last protests resulted in multiple municipalities delaying or cancelling the implementation of their LTN. However, it is not actually clear how many people objected or supported the plans, since no surveys were carried out.

Methodological Approach

There are two main steps to approach our task of sentiment analysis for UGC data in the context of mobility changes in Brussels. First, we comb through Twitter to obtain relevant tweets. These are tweets which have as subject one of the (future) mobility interventions in Brussels, included in the regional mobility plan Good Move). In Northern Europe, 81% of the population is an active social media user (i.e., a user logging in in a 30-day period) (We Are Social & Meltwater 2023). When compared to survey data, UGC can provide complementary, faster, and specific information about a topic (Endarnoto et al. 2011). Using Twitter specifically as a source for sentiment analysis can be motivated due to multiple aspects. Sentiment analysis through Twitter is more straightforward than other platforms based on the post lengths, which are limited to an upper maximumFootnote 2 (Nikolaidou and Papaioannou 2018). Other platforms such as Instagram, Flickr or Foursquare do not offer the possibility of opinion sharing. Facebook allows users to share their opinions, but the data on Facebook for sentiment analysis has been found to be messy and not structured properly. Due to its low cost, ease of access and presence of (public) leaders (Naseem et al. 2021), Twitter offers an approachable way to voice concerns or praise about policies. Twitter data also offer spatiotemporal information about users sharing their opinion, as tweets are tagged with their time of posting and possibly a geotag, offering additional dimensions for analysis. Lastly, Twitter is the sixth most visited website worldwide (We Are Social & Meltwater 2023).

Once we collect the tweets, we use two pre-trained language models (XLM-T and GPT4) to analyse the sentiment of those tweets. When processing textual data, LLMs are used to create contextual numerical representations, i.e. embedding vectors, of the sentences using the transformer architecture (Vaswani et al. 2017). This embedding can then be used in different ways, depending on the model architecture (see Fig. 1; Sect. 3.2.2 provides more details on our specific use of the models).

Fig. 1
figure 1

Schematic view of sentiment analysis using XLM-T and GPT

For XLM-T, an encoder-based model, tweets are passed as input and the model outputs a probability for the three possible sentiments of the input. The one with highest probability is then chosen as final sentiment. In the case of the generative language model GPT4 (OpenAI 2023), which is a decoder-based model, its output is a text constructed by predicting the most likely next word, given the input sentence and the subsequently generated words. By adding to the input tweet instructions for the model, it can be conditioned to output the sentiment of the given tweet. Conditioning models with these methods is referred to as “prompting” (Liu et al. 2021). Next to their architecture, these models also differ on their training objective as well as the data used during pre-training. One of the largest differences in this is the use of Reinforcement Learning from Human Feedback for GPT4 (Christiano et al. 2017) which is added fine-tune the model after pre-training.

For our analyses, we fist collected the tweets relevant to mobility changes in Brussels (see Sect. 3.2.1), which we then cleaned and labelled (see Sect. 3.2.2). Then, we fine-tuned XLM-T architecture and prompted GPT4 for the sentiment analysis tasks (see Sect. 3.2.3).

Tweet Corpus Creation

We collected tweets through the academic research access of the Twitter API.Footnote 3 Academic research access allows users to perform “Full Archive Searches” giving the possibility to obtain any tweets since 2006. Using this access, we collected Twitter data between July 18th 2019 at 00:00 (forming of the last Brussels regional government) and December 31st 2022 at 23:59 (starting date of the analyses) on Brussels Local time (GMT + 1). Within this timeframe, five major mobility policy changes took place (in chronological order): (i) the regional Good Move plan came into force, (ii) the Brussels region became a 30km/h zone, (iii) the LTN in the city centre was announced, (iv) the LTN in the city centre was implemented, and (v) LTNs in three other municipalities were implemented.

Due to the important multilingual element of Brussels, we performed searches for tweets in three languages: French, Dutch (official languages of the region) and English. To cast the widest net possible, while limiting the collection of irrelevant tweets, we performed our search in three steps. Starting from a combination of baseline keywords relating to mobility changes, we combined these with three additional search criteria: person/official instance-based (These are people implementing/facilitating the changes), location-based (These are the places where change happens. Choosing municipalities instead of only Brussels allows for a wider net) and keyword-based. The specificity of each search can be found in Table 1. In our search, we did not select tweets based on geolocation, as previous research showed only around 0.85% of tweets are geo-tagged (Sloan et al. 2013), limiting the pool of potential data. Additionally, since residents of Brussels are not the only ones expressing opinions about mobility changes in the city, filtering based on keywords offers a broader view. As our focus lies on the opinion of users, we excluded tweets created by accounts of media institutions, automated accounts, and known parody accounts, which were identified based on preliminary searches.

Table 1 Overview of tweet queries

Data Cleaning and Labelling

After collecting 2425 tweets, two researchers of the team independently manually labelled them over the course of approximately 30 h. This labelling occurred on the uncleaned tweets. For each tweet, we attributed a label from the following possibilities: Negative (0), Neutral (1), Positive (2) and Irrelevant (3). Tweets were denoted irrelevant and removed from the dataset if their subject was not related to mobility changes in Brussels. For each tweet, we labelled two sentiments: the general sentiment of the tweet and its sentiment towards the planned or implemented mobility changes.

The availability of correctly labelled data is a crucial aspect of the performance of supervised machine learning across a wide range of domains (Halevy et al. 2009; Northcutt et al. 2021). Although the traditional learning problem setting works on the assumption of noiseless and correct labels (Bootkrajang & Kaban 2011), annotation inconsistencies can occur even when labelling is performed by field experts (Sylolypavan et al. 2023). To minimize the impact of variability between the two independent annotators, we first manually labelled 300 random tweets from the dataset and computed the Cohen kappa of the labelling (Cohen 1960), which measures the agreement between annotations. A Cohen kappa value between 0.41 and 0.60 denotes a moderate agreement, while a value between 0.61 and 0.80 shows a substantial agreement between the annotators (Viera and Garrett 2005). Table 2 contains the values obtained for both the general sentiment and the sentiment towards mobility changes. The moderate values obtained for kappa indicated a difference in labelling. To remedy this, a discussion concerning the discrepancies and labelling strategies was held, focussing on the labelling strategy for sarcasm and tweets containing media titles. To evaluate this adapted strategy we again labelled 150 random tweets. The kappa value for this second round showed a more substantial agreement between the annotators for both categories. Notably, the annotators only labelled one tweet with completely opposite sentiments (i.e. positive and negative). This indicates the disagreement stemmed primarily from one annotator labelling tweets as neutral while the other assigned positive or negative labels. The remaining 2275 uncleaned tweets were then annotated independently by both researchers.

Table 2 Obtained Cohen's Kappa \((\kappa )\) interannotator agreement before (run 1) and after (run 2) debrief for the general sentiment label (\(\kappa )\) and the mobility change sentiment label (\({\kappa }_{MC})\)

Pre-trained Language Models

After cleaning the dataset, we fine-tune XLM-T (Barbieri et al. 2022), a pre-trained multilingual LLM based on XLM-R(oBERTa) checkpoints. We use XLM-T due to its multilingual capabilities and because it has been further pre-trained on 198 M multilingual tweets and fine-tuned for sentiment analysis. Due to the limited amount of data we collect, with respect to the size of the original training dataset of XLM, using XLM-T is essential as it has already been trained on the intricacies of Twitter data. To analyse the sentiment of our dataset, we accessed XLM-T (Barbieri et al. 2022) checkpoints through the Huggingface API,Footnote 4 which we then further fine-tuned for our particular task. during 10 epochs, with a batch size of 64 and a learning rate of \(5\cdot {10}^{-6}\). We also employ a polynomial learning rate scheduler, activating it after 50 warmup steps. In order to train and evaluate our model, we split the dataset into a train (75%), validation (15%) and test dataset (15%), using the later to report performances.

In an addition to fine-tuning an encoder-based LLM, we also test the capabilities of GPT4 by OpenAI (OpenAI 2023), one of the most powerful LLM that exists at the time of writing. Instead of being fine-tuned, GPT4 successes are often attributed to the fact that they can be conditioned for a task by either being presented examples and instructions of the task (few-shot) or by only adding instructions (zero-shot) (Kojima et al. 2023). Few-shot prompting relies on the model’s capability for few-shot learning (Brown et al. 2020), where the model is fed instructions in natural language together with a number of (few) examples. However, recently the potential of zero-shot methods, where the model is only fed instructions about the task, has been demonstrated (Kojima et al. 2023), in particular in the context of reasoning tasks. For our task, we employ a zero-shot prompting method for text classification presented in Sun et al. (2023).

Results

After cleaning and labelling, we obtained a total of 1998 tweets, originating from 895 unique users. This means that, even with careful planning of data collection using the Twitter API, 16% of tweets were deemed irrelevant. Other research using twitter data over varying fields also found post-collection processing to be necessary (e.g. (Xia et al. 2021) found 56.6% of collected tweets to be irrelevant when gauging perception in context of the USA election, (Dahal et al. 2019) 7% when studying climate change related tweets, or 24.8% when measuring sentiment on airline services (Wan and Gao 2015). This already indicates that pipelines implementing an analysis of UGC related to specific themes require manual controls. Figure 2 shows that our dataset is heavily skewed towards tweets posted between August and October 2022, correlating with high-impact mobility changes, which generated considerable commotion (The Brussels Times 2022).

Fig. 2
figure 2

Distribution of the obtained tweets by date of creation and relative proportion of the label for each time slice (months). The column for July 2019 is smaller in width to represent the data collection starting on the 19th instead of considering a full month

From the distribution of labels of the sentiment regarding the mobility changes, we can see that the increase in tweets posted is not due to an increase in tweets with only a negative sentiment. Another way of looking at our dataset is by analysing the evolution of the sentiment distribution over time. Figure 2 also shows that the distribution remains primarily stable, except between August and October 2022, where tweets containing negative sentiments increase percentage wise. For other notable interventions, such as the implementation of a general zone 30 km/h (January 2021) and the announcement/approval of an LTN in the centre of Brussels (October 2022), we see that tweets with negative sentiments towards mobility changes do not dominate the increase in absolute counts. By looking at the percentual distribution we can also notice that the announcement of a LTN (event number 4 on Fig. 2) generated less negative responses than the implementation of that LTN (event number 5).An essential aspect of our labelling is the difference between a tweet’s sentiment and the sentiment expressed towards mobility changes in the tweet, which do not necessarily correspond. This is the case in around 30% of our dataset. A possible example of such a situation is a tweet which expresses joy when mobility changes are halted or turned back. In this case, although the tweet sentiment is positive, it represents a negative sentiment towards the (planned) mobility changes. Two concrete examples of such a tweet are displayed in Table 3 and Table(SM) 1 in the appendix contains an example for each quadrant of the matrix displayed in Fig. 3. Figure 3 shows that, for our dataset, this occurs more significantly when the tweet’s sentiment is negative. In contrast, positive tweets correlate with a positive sentiment towards mobility changes more often. These discrepancies between labels form an additional difficulty when using language models for automatic labelling, as the sentiment towards the mobility changes is the most relevant for policymakers. However, correctly classifying the tweets based on this label requires implementing an understanding of the context in the model.

Table 3 Example tweets for which the sentiment (S) and the mobility changes sentiment (SMC) do not match
Fig. 3
figure 3

Confusion matrix for the labelled tweet sentiments and sentiment expressed towards the mobility changes

General Sentiment

First, we analysed the model’s performance when classifying the general sentiment of the tweets. The results of the different optimisation methods are shown in Table 4. Even though XLM-T is a model which has been pre-trained and fine-tuned for a sentiment analysis task, we see that the performance when no domain-specific fine-tuning has occurred is fairly low. The confusion matrices in Fig. 4 show that this poor performance is due to the model classifying tweets as neutral more often than our ground truth labelling, which occurs to a greater extent for tweets labelled as negative.

Table 4 Performance scores of XLM-T for the three different tasks, in a zero-shot evaluation as well as after domain-specific fine-tuning
Fig. 4
figure 4

Confusion matrix of the true and predicted label, for the test set, a when applying the model in a zero-shot way. b Fine-Tuning the entire XLM-T model

Once we trained the model on domain-specific data, we obtained an accuracy of 0.67. This indicates that, together with the zero-shot performance, although XLM-T has been fine-tuned for sentiment analysis on tweets, it tends to label tweets as neutral when it has not been presented the domain-specific training data.

Using the F1 score of the model, we can compare its performance with the performance of XLM-T on a general benchmark multilingual dataset where an average F1 score of 69.35 is obtained (Barbieri et al. 2022). Our results in Table 4 indicate that our model and training procedure performs within expectations considering the domain-specific and multilingual aspect of our task.

Mobility Change Sentiment

In a second phase, we applied XLM-T to classify the sentiment of the same tweets, but this time specifically towards the mobility changes, using the same training procedure as before. To correctly identify this sentiment in the tweets, context knowledge is essential. Results obtained when applying XLM-T in a zero-shot approach confirm this, as the accuracy is close to random guessing (see Table 4). Even after training, we can observe that the model does not obtain the same performance as on the original sentiment task. If we only consider the tweets in the test dataset whose sentiments do not match, we see that XLM-T reaches an accuracy of 37%, far below the accuracy on the entire dataset. The most considerable difficulty for our pretrained model when labelling is then finding the implicit context from a tweet.

Additionally, some tweets in the dataset were attributed a ‘context’ label during manual annotation, indicating the presence of a URL, image or other external addition, which was deemed necessary to correctly label the sentiment of the tweets towards the mobility changes. An example of this is an image showing a street containing heavy traffic, where the tweet text is (paraphrased) “Thanks GoodMove!”. To assess the dependency of the model on this category of tweets, we repeat the process of fine-tuning and evaluating while removing these tweets from the dataset. The model's marginal performance increases (see Table 4) indicate that challenges with labelling are not solely attributable to tweets requiring contextual information.

Sentiment Analysis Using GPT

As GPT is based on a decoder architecture, it can generate any text as a response to a task, in contrast with XLM-T, which outputs only class labels. Using this capability, we labelled the tweets using a novel approach. Instead of classifying the tweets into three distinct categories (positive, neutral and negative) we prompt GPT3.5 and GPT4 to attribute each tweet a score which is a number between -1 and 1. This score reflects how negative (when closer to -1) or positive (when closer to 1) the tweet’s sentiment towards the mobility plan is. An important comment is that these scores do not represent a confidence level in the sentiment, but are a value to quantify the intensity of the sentiment expressed. Adding this dimension offers a nuanced and less binary classification when performing sentiment analysis, which is more in line with how (dis)satisfaction is expressed by citizens. To compare the performance of the models we then translated these scores to labels. The classification task we perform is enhanced using Clue And Reasoning Prompting (CARP) (Sun et al. 2023), which yielded state of the art performances on text-classification benchmarks. CARP prompting enhances the ability of the model by asking it to construct an answer containing clues and a reasoning on which it then bases itself to determine the sentiment. We also explicitly mention in the prompt that the model should classify the tweets based on their opinion towards the mobility changes.

Our results (see Table 5) show that GPT4 largely outperforms GPT3.5 and XLM-T when classifying sentiments towards the mobility changes, obtaining an accuracy of 0.66. This difference in performance is even more pronounced when compared to the zero-shot performance of XLM-T, which obtained an accuracy of 0.39. We can also see that an accuracy of 0.66 comes close to the accuracy of XLM-T when classifying the general sentiment (Table 4), an easier task (as demonstrated by the accuracy of 0.58 it obtains when classifying mobility sentiment) for which it was fine-tuned. This demonstrates the potential of GPT-4 to be used to classify implicit sentiment, a crucial aspect of UGC data related to transport and mobility changes.

Table 5 Accuracy scores for the three models XLM-T, GPT3.5-Turbo and GPT4 on the mobility sentiment classification task. Mismatched tweets are tweets whose intrinsic sentiment and sentiment towards the mobility changes do not match

Additionally, the accuracy of GPT4 on tweets with a mismatched sentiment indicates that it might be better suited to obtain an implicit sentiment expressed in a text. This is illustrated in the response provided by GPT4 when correctly labelling a mismatched tweet, shown in Table 6. Although the models recognizes the negative emotions expressed in the tweet, it correctly identifies the underlying positive sentiment towards the mobility changes and attributes it a positive score. This deduction capability of contextualised information is in line with the proposition that GPT4 shows “sparks of general intelligence” (Bubeck et al. 2023).

Table 6 Tweet and response of GPT4 after CARP prompting. Note the reasoning of GPT4 where it classifies the tweet as positive towards the mobility changes, even though the language used initially leaned towards a negative sentiment

Looking at the score distribution (Fig. 5), we observe that both GPT3.5 and GPT4 underuse the ranges [-0.4, -0.1] and [0.1, 0.4] to score the sentiment. Instead, neutral tweets were characterised solely by being assigned a score of 0. We note that this was purely a phenomenon originating from the models themselves, as the prompt instructed the models to use the entire range. Due to this behaviour, we translated negative scores to negative labels, zero scores to neutral labels and positive scores to positive labels. Finally, we can also see a tendency of GPT3.5 or GPT4 to over- or underuse certain scores, such as − 0.9 for GPT3.5 while GPT4 avoided this score and its positive equivalent. From a machine learning perspective this raises questions about a possible underlying bias of these LLMs. This phenomenon could indicate some limitations of the model to capture/interpret certain sentiment and further investigation could prove beneficial for future tasks.

Fig. 5
figure 5

Distribution of scores when classifying the sentiment of tweets with respect to the mobility changes by a GPT3.5-Turbo b GPT4. Tweets with a score of 0 were considered as neutral, between [0.1, 1] as positive and scores and between [− 1,− 0.1] as negative

Discussion and Conclusions

Through our research, we aimed at exploring the usability of sentiment analysis through deep learning methods in transport planning, to incorporate the views of the population into decision-making.

A first important observation can be made with regards to our results. After the press coverage of the vocal negative reactions and the loud protests in the fall of 2022, more than one municipality in Brussels has paused the implementation of the regional mobility plan (BRUZZ 2022). However, when looking at the data from Twitter users, we see that the overall sentiment of our Twitter population is predominantly positive towards the implementation of the sustainable mobility plan, which provides a more nuanced perspective than the press coverage. This is contrary to our expectations, as other studies have found that social media is mainly used to express negative sentiment (Jalonen 2014). Although the sentiment of Twitter users is certainly not representative of the sentiment of the whole Brussels population, the implicit assumption that citizens are against the Good Move mobility plan because there were public outcries cannot be verified and should be nuanced. We also saw from our results that there is a correlation between the number of tweets available and the mobility interventions in the city, which shows that UGC can be an interesting and relevant complementary source of data for policy makers. For an analysis similar to ours, the initial fine-tuning of the XLM-T model does require time and effort, but once trained, it can then be used on multiple occasions. And although our analysis was limited in the number of tweets, we demonstrated the feasibility by comparing it to a ground truth which originates from manual labelling. However, future work can replicate this in contexts where manual labelling is not possible, i.e. with larger datasets, as we have shown that current LLMs are already quite powerful.

From our results, we see that using GPT offers a good alternative to provide an analysis without the need for training, since GPT4 obtained the highest accuracy when classifying the sentiment of tweets with regard to mobility changes in a zero-shot way. This can make this type of analysis more accessible in the context of policy making, as it removes the need for experience in training models and fine-tuning. Sentiment analysis using such models like GPT for policy making can therefore be more easily implemented as it removes a time consuming and costly aspect of using other pre-trained models which still require fine-tuning. Apart from practical considerations, the outperformance of GPT-4 compared to the fine-tuned model XML-T for detecting implicit sentiment shows that such decoder-based models are naturally better suited for this task. With well-thought prompts, decoder models also provide more information than simple classification, rendering the output more transparent, as demonstrated in Table 6. Finally we also showed that these models can be used to attribute scores to text when performing sentiment analysis, introducing a novel dimension to these kinds of analysis.

Importantly, from our results, we can say that some level of local knowledge is needed to obtain relevant content. In the tweet selection process, for example, we used the names of some politicians, as well as specific geographic locations in Brussels. If this type of analysis is to become relevant for policymaking, enough time should be spent on the inclusion criteria for the data to be used. Selection criteria can also introduce a bias into the UGC collected, as it can be skewed towards data expressing more of a certain sentiment due to the inclusion or not of certain keywords. For policy makers, local knowledge is therefore crucial in the data selection process, to provide a holistic view on a problem.

It should also be noted that, although social media provides access to a larger dataset than could be collected through traditional sources, it does tend to exclude some users, e.g., older people (Nikolaidou and Papaioannou 2018) and people with low digital skills or no access to the internet. Additionally, the sentiments expressed on Twitter are limited to social media users and may not fully represent the broader public, since social media users are not a randomized sample of the population. It is therefore important to complement data from social media with other data sources as well, both other types of UGC, as well as data collected using traditional methods, to ensure a broad representativity for policymaking purposes.

A peculiar aspect of our dataset was the presence of two distinct sentiments in the tweets, one inherent to the vocabulary used in the tweet, the other the implicit sentiment expressed towards specific mobility changes. While a majority of the tweets collected had matching sentiments, around 30% expressed a different sentiment towards the mobility changes than one would extract from the vocabulary used. These tweets formed a difficult hurdle to overcome for all models. GPT-4 obtained an accuracy moderately above random chance on those tweets, in contrast with its performance on the total test dataset. XLM-T obtained similar performances on these types of tweets, but only after pre-training the model with domain-specific data. Our results, therefore, show that detecting contextual sentiment expressed in a text is a task for which pre-trained language models still require improvement. Future work focusing on this specific type of data could therefore yield important benefits when using NLP methods in societal contexts, where there is often mismatch between inherent and implied sentiment. Combining these advancements in language models and integrating UGC data effectively, policymakers can attain a more comprehensive comprehension of public sentiment, thereby facilitating the shift in mobility towards a sustainable future.