1 Introduction

Recommendation systems (RSs) are tools able to understand and learn users’ behaviours and preferences from the analysis of historical data, with the aim to provide tailored and relevant suggestions about a collection of items. RSs are becoming an essential tool to guide users through huge catalogues of items. Indeed, the growing proliferation of big datasets has highlighted the pressing need for intelligent techniques able to guide the exploration and sense-making processes and avoid making data access a nightmare but instead an opportunity. For this reason, RSs have been successfully applied in many different application domains, from e-commerce and on-demand TV shows to touristic scenarios.

The collection and analysis of historical data about user behaviour is the core of any recommendation system which needs to understand the preferences and tastes of users in order to suggest to them the best next item in a collection. Many literature solutions demonstrate that the integration of contextual information can improve the relevance of recommendations, both for single users or groups of users, since their behaviours and preferences are often affected by the situation they are acting in Adomavicius and Tuzhilin (2015); Chen and Chen (2015); Migliorini et al. (2022); Villegas et al. (2018); Zheng et al. (2021). At the same time, RSs could be profitably adopted also by service providers or suppliers not only to increase their knowledge about users but also to guide the user towards particular items in their preference list. In regard to the tourist domain, the ability to guide user choices toward specific attractions and prevent overcrowding has become particularly important in recent years. Indeed, it has become prominent the need to restrict the number of people that can access the same PoI together both for epidemic prevention and control, and social security. A similar requirement can also be found in TV on-demand platforms, where in order to optimize the broadcast service or to promote particular items, there is a need to suggest new and lesser-known shows instead of more popular ones to create balance. Therefore, a natural extension of currently available RSs is the ability to predict the level of crowding of a given item and to redirect people to other attractions, which are equally appreciated, but are less in demand at that moment (Belussi et al., 2022; Migliorini et al., 2018, 2021).

The aim of this paper is twofold: (a) firstly, we will investigate the impact of contextual information in both the formulation of user preferences and crowding forecast. (b) Secondly, we propose the architecture of a Context-Aware Recommendation System with Crowding Forecasting, called ARTEMIS, which is able to produce tailored suggestions by considering also the expected level of occupation of each PoI. ARTEMIS includes two main components: a crowding forecaster and a preference model, which are both able to take care of articulated contextual pieces of information. The aim of this second contribution is to define a generalized architecture that can be used in several application domains characterized by heterogeneous features, different amounts of available historical information, and different availability of contextual information. Throughout the paper, we use the scenario of tourism wherein we need to suggest the next PoI to visit. However, the system can be easily extended or adapted to other real-world domains, by properly customizing the notion of crowding and the kind of collected historical data.

This paper extends the work published in Belussi et al. (2022), focused on a crowding forecaster component, in two main directions: it analyses the role of context also in the formulation of user preferences and provides a complete architecture for a context-aware recommendation system. More specifically, as regards the first aspect in the paper we consider anonymous users, namely users for which we do not know the exact identity or past tastes, but for which we define a notion of similarity with other users which is based on the similarity between the choices they made in the same context. Moreover, relatively to the second aspect, a prototypical implementation of ARTEMIS as a web application is also presented in Section 7 to demonstrate the possible interaction with the end-user.

Fig. 1
figure 1

ARTEMIS Architecture

ARTEMIS architecture – The general architecture of the proposed solution is illustrated in Fig. 1 and is articulated into three main components. The first component, C1, is the crowding forecaster: in the first phase, represented by box B1, the historical user data and the related contextual data are integrated and processed to produce an enhanced dataset of integrated contextual historical data (operation \(op_1\)). This new dataset is used as a training input for the crowding forecaster. The result of this phase is a model M1 which is able to produce an estimation of the level of occupation of a given PoI in a given context. The trained model M1 is then used in a second phase, box B2, which, when given a query represented by a PoI p and a context \(\gamma \), produces an estimation of the level of occupation of p in \(\gamma \). In detail, at specific time instants, such level of occupation is computed (or updated) in the background and stored in a database also containing the description of the PoIs (see T-DB in component C3). The second component C2 is the contextual preference estimator: in the first phase, represented by box B3, the log data regarding past visiting experiences performed by tourists are enriched with contextual information (operation \(op_3\)) and then this enhanced dataset is used for training a preference model M2. The trained model M2 is used in a second phase, box B4, to produce an estimated contextual preference for a user u located in a certain position l in a certain context \(\gamma \). The third component C3 combines the information produced by the contextual crowding forecaster and the contextual preference estimator in order to produce the final set of recommendations which will be returned to the user through a user interface or application (box B5).

Table 1 Comparison between the related work considered in this paper

The remainder of this paper is organized as follows: Section 2 discusses the state-of-the-art use of context in recommendation systems and crowding forecasting, while Section 3 formalizes the considered problem. Section 4 analyses the role covered by the notion of context in defining dynamic user preferences and PoI occupation levels. Section 5 compares different machine learning and deep learning solutions in implementing the crowding forecaster, while Section 6 performs similar work in implementing the preference estimator, then Section 7 combines these two components and presents the architecture of the ARTEMIS framework. Finally, Section 8 concludes the work and discusses some possible future extensions.

2 Related Work

This section summarizes some related work in the field of recommendation systems, with particular attention to those developed in the tourist domain and those considering the level of occupation in attractions. Table 1 summarizes their characteristics with respect to both the considered contextual dimensions, the introduction of the notion of crowding, and the kind of approach.

Tourist RS – The development of RSs for the tourist domain has received a lot of attention in recent years and therefore many surveys are available in this field. In Islam et al. (2020) the authors summarize the proposed solutions based on the use of deep learning techniques in the generation of a suggestion about the next PoI to visit. In detail, the authors evaluate the performances of different solutions and they study the factors that mainly influence the recommendations of a given PoI. The compared neural networks are the traditional feed-forward neural network, the Convolutional Neural Network (CNN), the Recurrent Neural Network (RNN), the Long-Short Term Memory (LSTM), the Gated Recurrent Unit (GRU), the Attention Mechanisms, and the Generative Adversarial Network. The used datasets are taken from the check-ins collected through many different LBSN (Location Based Social Network), like Instagram and Twitter. Similarly, in Borrás et al. (2014), the author compares many previous works about the use of Artificial Intelligence techniques in the tourist domain, considering not only the used techniques and algorithms but also the user interfaces and interaction with the final user.

The use of Reinforcement Learning (RL) techniques for the production of recommendations in the touristic domain is proposed in Jiang et al. (2021), where the RL is used to determine the correlation between PoIs and satisfy some predefined constraints. Conversely, in Wang et al. (2020) the use of RL is proposed to study the behaviour of tourists with the aim of replicating their choices and generating predictions about their future actions.

The approach proposed in this paper differs from the ones already present in literature about the use of deep learning and RL techniques in the attention and importance given to the notion of spatio-temporal context in the determination of both the level of crowding and the user preferences.

Contextual RS – Context-aware recommender systems have received a lot of attention in recent years since their aim is to decrease recommendation errors and improve the system’s quality by enriching available item and user information with contextual data (Adomavicius and Tuzhilin, 2015; Chen and Chen, 2015; Rahmani et al., 2022; Zheng et al., 2021). In Migliorini et al. (2022, 2019) the notion of context is used to infer more precise preferences by suggesting the best sequence of items to a group of users whilst considering the current context they are acting in. More specifically, in Migliorini et al. (2019) the recommendation process is formalized as a multi-objective optimization problem and solved with a MapReduce application of the Multi-Objective Simulated Annealing (MOSA) technique. This solution is further extended in Migliorini et al. (2022) to deal with the dynamic evolution of groups of users and the balancing of preferences inside groups.

In the urban tourist scenario, and in particular, in PoI recommendations, multiple factors may be considered despite most of the approaches developed in the past mainly exploit three contextual dimensions, namely time, geolocation, and social conditions (e.g. PoIs visited alone, in groups, with kids) (Li et al., 2015; Cheng et al., 2012; Ye et al., 2011). In general, additional factors may be included in the recommendation process; indeed, in Massimo and Ricci (2021); Trattner et al. (2016) the authors consider weather conditions in PoI recommendations. They propose novel recommender systems to suggest the next PoIs to visit that match users’ preferences (what to visit) and the specific context of the visit (how to visit each PoI). However, the contextual information is quite limited and restricted to only a weather feature in Trattner et al. (2016) and to an hourly weather summary (e.g. cloudy), temperature (e.g. cold), and temporal information such as the time interval related to the visit (e.g. evening) in Massimo and Ricci (2021). Another very recent proposal that does consider weather conditions as an important contextual future to refine a dynamic and fine-grained user preference and PoI popularity model for recommending sequences of PoIs is Chen et al. (2023).

Although these proposals deem as important the notion of context, they do not consider other external information like the occupancy rate of each PoI, the day of the week, the presence of holidays, or other important events in the considered city. Since our main aim is to prevent a high level of crowding in each PoI and to consequently suggest interesting, but less crowded PoIs to tourists, we need to be able to consider all the contextual features that may influence PoI visits.

In Yuan et al. (2013) the authors propose a methodology for suggesting the next PoI to visit which considers both the specific moment during the day wherein the visit will be performed and the location of such an attraction. The used datasets are extracted from the main available LBSNs and the obtained results provide an improvement of 37% in accuracy with respect to the base techniques which do not consider the temporal context. Another work that takes advantage of datasets collected via LBNSs to provide recommendations is Hu et al. (2021). The proposed model considers various factors to provide personalized suggestions, including user preferences, geographical influence, and social influence. However, the used contextual dimensions in both (Yuan et al., 2013; Hu et al., 2021) are very limited as opposed to the ones considered in this paper, wherein the temporal characterization is enriched by other semantic information, like the presence of holidays, and accompanied by other knowledge, like the weather conditions, the degree of crowding and the user spatial location.

In Zhou (2020) the authors deal with the problem of generating recommendations in situations where both users and items change dynamically and continuously, so recommendations need to be adaptable to new conditions as new information arrives. The proposed solutions make use of a deep neural network and some insights in this paper are also used to manage unknown users for whom we do not have much information about tastes.

An innovative proposal is Li et al. (2020), a model that integrates into a neural network the user’s sentiment information with spatial-temporal contexts to improve the accuracy of PoI recommendations. The user sentiment in the scenario of the tourism domain has been considered also in Xing et al. (2019), where a Convolutional Neural Network for PoI review texts is introduced and integrated into the recommendation system to obtain user sentiment indication, user preference, and POI properties. Once again, semantically richer contextual features may interact with tourists’ sentiments and their preferences and should be investigated.

Crowd-aware RS – Tourism occupancy is a research topic that has been investigated in recent years. In Migliorini et al. (2021, 2018) the prediction of PoI occupancy is obtained only on the basis of historical accesses, without enriching data by means of contextual features, and is then used to balance travelers without considering their personal preferences. The authors in Zheng et al. (2021) include spatial factors, besides fine grained temporal information, to forecast tourism demand towards multiple attractions.

In Shao et al. (2021) the authors predict parking occupancy in areas where the availability of parking data produced by sensors is limited and needs to be integrated with heterogeneous contextual information, like PoIs presence. However, the authors do not include the analysis of user contextual preferences to suggest free parking slots. In Alvarez-Lozano et al. (2015) the authors predict the location of mobile users in the near future under a strong assumption, i.e. users exhibit a different mobility pattern for each day of the week. We will demonstrate that the mobility patterns in the tourism domain are influenced by external contextual factors, and not only by the day of the week and the time slot. A similar problem about the contemporaneous visit of the same PoI by different users has also been considered in Kong et al. (2022), where the author proposed an algorithm of multi-agent reinforcement learning with dynamic reward, which is able to equally distribute tourists in the various PoIs. Different machine learning models have been tested in Bollenbach et al. (2022) to predict the occupancy of a single semi-open PoI under different factors, i.e. weather conditions, holidays, and temporal data. The generality of our approach allows us to discover the possible correlations between contextual factors and PoI categories and can consequently be applied to provide future suggestions, based on the category, also including PoIs not considered during the training phase.

3 Problem Formulation

This paper considers the touristic domain as the application scenario for our context-aware recommendation system. Therefore, the formulation that follows is tailored for that scenario, but can be easily extended to other domains. Table 2 summarizes all the symbols used in this formalization.

Definition 1

(Touristic Visit) Given a set of PoIs \(\mathcal {P}\) and a set of users \(\mathcal {U}\), a visit performed by a user u is represented by a tuple:

$$\begin{aligned} v = \langle u, p, t, g \rangle \end{aligned}$$
(1)

where: \(u \in \mathcal {U}\) is a user identifier, \(p \in \mathcal {P}\) identifies the PoI, t is a timestamp representing the date and time of the visit, and g is the spatial position where the PoI is located.

A spatial position g can be represented in different ways, for instance as a pair \((lat, long) \in \mathbb {R} \times \mathbb {R}\) containing the latitude and longitude of the location, respectively. In the following, the symbol \(\mathcal {L}\) will be used to denote all possible spatial locations in the chosen reference system, while the set of all visits performed by users in \(\mathcal {U}\) will be denoted as \(\mathcal {V}\).

With reference to our considered dataset, containing the visits performed by tourists in Verona through a city pass called VeronaCard, a touristic visit is represented as in Eq. 1 with the tuple \(v = \langle vc39201234, \text {Arena Amphitheatre}, \) \(\text {14-02-2021 10:30}, (45.4392, 10.9943) \rangle \).

Table 2 Summary of symbols used in the formal notation

Historical data about past touristic visits can be enriched with some contextual information, better characterizing the conditions in which the visit has been performed.

Definition 2

(Contextual information) Given a visit \(v \in \mathcal {V}\), we define its context \(\gamma \) as a tuple of values for some relevant dimensions \(\delta = \langle d_1, \dots , d_n \rangle \) as follows:

$$\begin{aligned} \gamma = \langle c_1,\dots ,c_n \rangle \end{aligned}$$
(2)

where each \(c_i\) is the value of a contextual dimension \(d_i\) characterizing the problem at hand. The set of all possible contexts with dimensions \(\delta \) will be denoted as \(\mathcal {C}\).

In our specific scenario regarding the touristic domain, we consider as relevant the contextual dimensions of the tuple:

$$ \delta = \langle ts, doy, dow, hol, pres, wind, rain, temp, hum \rangle $$

where ts is a predefined timeslot inside the day, doy is the day of the year, dow is the day of the week, hol is a boolean value representing the fact that the visit is performed on a public holiday and/or during a weekend, or not, pres is the atmospheric pressure, wind is the wind speed, rain is the amount of precipitation, temp is the temperature, and hum is the percentage of humidity. Going back to the visit example introduced previously, the context \(\gamma \) for this visit should be \(\langle \text {Morning}, 348, \text {Sunday}, true, 1021.0, 2.03, 1.54, 9.18, 80.\) \(68\rangle \), where “Morning” denotes the first timeslot in the day (we initially consider three timeslots of 4 hours each).

The choice of these dimensions has been motivated by an extensive analysis performed on the available data and reported in Section 4. Clearly, the specific set of dimensions could depend on the data and problem at hand, but the provided definitions allow us to easily accommodate any scenario.

The notion of context will be used to characterize both the visits performed by users and the preferences of users with respect to specific PoIs. A context-aware tourist visit essentially combines both the information regarding the visit with the context in which that visit has been performed.

Definition 3

(Context-aware touristic visit) Let \(v = \langle u, p, t,\) \( g \rangle \) be a visit performed by a user u in a specific context \(\gamma = \langle c_1,\dots ,c_n \rangle \), where \(\forall i\in \{1,\dots ,n\}\) \(c_i\) is the actual value for the contextual dimension \(d_i\), therefore a context-aware touristic visit is defined as:

$$\begin{aligned} cv = \langle u, p, t, g, c_1, \dots , c_n \rangle \end{aligned}$$
(3)

where the tuple v representing the touristic visit is enriched with the contextual values in \(\gamma \).

Our recommendation system also assumes that user preferences are affected by the context, leading to the notion of contextual preference.

Definition 4

(Contextual Preference) A contextual preference is a function: \(\pi : \mathcal {U} \times \mathcal {L} \times \mathcal {P} \times \mathcal {C} \rightarrow \mathbb {R}\) that returns a real value representing the preference of a user \(u \in \mathcal {U}\) located in a position \(l \in \mathcal {L}\) for a PoI \(p \in \mathcal {P}\) in the context \(\gamma \in \mathcal {C}\).

A preference estimator is a system that, once trained with historical data about context-aware touristic visits, is able to produce an estimate of the preference of a user u for a PoI p when u is located at l during context \(\gamma \). From the definition it is clear that the preference of a user for a given PoI p can change based on both its current location, due to the travel time or effort needed to reach the next PoI, and the context, since as we can see in the following section some attractions can be preferred during some periods of the year (i.e. summer vs winter) or in specific weather conditions (i.e. raining or sunny days). Several different examples are discussed in the following section.

In this paper, we consider the case of anonymous users, namely users for which we do not know the exact identity or previous tastes. Therefore, preferences can be influenced only by the context in which the visit is performed and the current location of the user. This is a typical situation in touristic scenarios, where recommendations have to be provided to occasional users who rarely return to the same location in a brief interval of time. This definition of contextual preference essentially uses a collaborative filtering technique (Schafer et al., 2007), in which the notion of similarity between users is given by the fact that they want to perform a visit starting from the same location and in the same temporal and weather conditions. However, if more information about the specific users is available, it can be straightforwardly included in the previous definition, using a hybrid approach to define the preference notion.

The other essential ingredient for the development of ARTEMIS is a crowding forecaster. A crowding forecaster is a system that, once trained with historical data about context-aware tourist visits, is able to produce an estimation of the level of occupation for a PoI \(p \in \mathcal {P}\) in a context \(\gamma \in \mathcal {C}\).

Definition 5

(Contextual PoI occupancy) Given a set of PoIs \(\mathcal {P}\) and \(\gamma \in C\) the desired context, the occupation of p in the context \(\gamma \) is a function:

$$\begin{aligned} \sigma : \mathcal {P} \times C \rightarrow \mathbb {N} \cup \{0\} \end{aligned}$$
(4)

which returns the number of tourists present in p in the context \(\gamma \).

A crowding forecaster is a system that, once trained with historical data about context-aware touristic visits, is able to produce an estimation of the level of crowding of a PoI p during a context \(\gamma \).

A context-aware recommendation system enriched with forecasting capabilities is a recommendation system able to produce useful recommendations by considering the role of context in both determining users’ preferences and also the expected level of crowding in the considered set of PoIs during the visit.

Definition 6

(Context-aware Recommendation) A context-aware recommendation is a function \(\rho : \mathcal {U} \times \mathcal {L} \times \mathcal {C} \rightarrow \mathcal {P}^{n}\) that given a user \(u \in \mathcal {U}\) located at \(l \in \mathcal {L}\) in a context \(\gamma \in \mathcal {C}\), returns a sequence of n PoIs \(\langle p_{1}, \dots , p_{n} \rangle \) sorted in descending order w.r.t. the contextual preferences of u and the expected level of crowding of \(p_i\) in the given context \(\gamma \). More specifically, the value v associated to each PoI \(p \in \mathcal {P}\) is computed as a function \(\omega : \mathcal {U} \times \mathcal {L} \times \mathcal {P} \times \mathcal {C} \rightarrow \mathbb {R}\):

$$\begin{aligned} \omega (u, l, p, \gamma ) = \pi (u,l,p,\gamma ))\cdot (1- \sigma (p,\gamma )) \end{aligned}$$
(5)

The size of the sequence \(n \le |\mathcal {P}|\) can be chosen based on the system requirements.

From Eq. 5 it is clear that the value associated with each PoI p in a given context \(\gamma \) decreases as the level of crowding, expressed between 0 and 1, of p in \(\gamma \) increases, whereas it increases as the preference of the user u for that PoI increases. In Section 7 we will see how the two components, namely the contextual level of crowding and the contextual preference, can influence the final set of recommendations provided to users. Indeed, in some cases even if the preference of the user is high for a given PoI p, the system can provide as the first suggestion another PoI \(p'\) which has a lower preference but is less crowded in the considered context. It may seem that in extreme cases, the most famous PoIs, which can be subject to more overcrowded situations, may never be suggested to users. However, the level of crowding is just used to weight the preference and the formula can be adjusted appropriately based on domain-specific considerations about popular PoIs. However, in this paper we focus on the problem of preventing situations in which users with a limited time to spend for a trip, are wasting the majority of it in queue, waiting to access an attraction. At the same time, our crowding forecaster provides a detailed prediction and is able to identify other time slots on the same day, or other days in the same period, where a particular attraction can be visited without problems.

4 The Role of Context in PoI Crowding and User’s Preferences

In this paper, we use a real-world dataset containing the visits performed by tourists to PoIs located in Verona, a city in Northern Italy. This dataset contains about 2,1 million records spanning 6 years (i.e., from 2014 to 2019) regarding about 500,000 tourists and 9 different PoIs. Each of these records reports an anonymised user identifier, the PoI identifier and category, and the timestamp of the visit. The information contained in the dataset has been enriched with some contextual information regarding the weather conditions and some semantic temporal information, like the presence of holidays. In general, such features could be extended in order to include the presence of touristic events in the analyzed area or spatial information (e.g. PoIs in immediate proximity). Table 3 provides an overview of the dataset statistics.

Table 3 Summary of the PoIs information contained in the considered dataset

On the available dataset, a preliminary set of analysis has been performed to (a) confirm the role of the context in influencing both the crowding level in the considered PoIs and the behaviour or preference of tourists relative to the next PoI to visit, (b) correctly identify the contextual dimensions to consider in the experimental section and (c) better understand the obtained results. The following two sub-sections illustrate some results of this analysis.

4.1 Context and PoI Crowding

This section reports the analysis performed on the considered real-world dataset to identify which contextual dimension greatly influences the PoI crowding.

Dependency on the day of the week – PoI occupancy has generally a great dependency on the day of the week; in particular, visits tend to concentrate around the weekend and to be very low in the middle of the week. However, if we enrich the information regarding the temporal contexts, for instance by distinguishing some anniversaries or holiday periods, we are able to better capture the behaviour of tourists and forecast the number of visits. In particular, for the attraction known as Juliet’s house, also known as the House of Lovers, Valentine’s day is a very important anniversary. Indeed, we observed that even if this anniversary happens on a Wednesday, which is usually a very quiet day, the actual number of visits during this day triplicates and matches the number of visits on the busiest days.

Dependency on external events – The analysis of data reveals also that some events can have a great influence on the number of visits: they can change tourist behaviour on the basis of the period in which they happen. For instance, the Easter holidays are an annual event that happens at a variable moment, mainly in March or April. In our dataset, we can notice that March is usually a month with few visits compared to others, but when the Easter holidays are in that month, the number of visits drastically increases, whereas when Easter is in April the peak moves to that month.

Dependency on the time slot – As we have discussed in the introduction, the goal of crowding forecasting could also be to drive the choice of tourists, for instance by suggesting a different time-slot for their visit. Indeed, even on the busiest dates, the number of tourists which are present in a given PoI is not constant. We have observed different trends in the distribution of the number of visits, where for some PoIs the majority of them are concentrated in the morning, while for others in the afternoon. In this case, a smart recommendation system can suggest to perform the visit at other times, anticipating or postponing the visit, or switching the visit to two consecutive PoIs.

Dependency on weather conditions – Finally, another important contextual feature that can influence the number of visits, independently from the considered day, is represented by the weather conditions. Figure 2 illustrates how on the same day of the year, the number of visits can depend on the fact that it is raining or not. We can notice that some PoIs, like 42 and 202, are greatly influenced by this condition because they are outdoor attractions and in presence of rain there are no visits. Conversely, other PoIs, like 49, are influenced but to a less extent, and finally, other attractions (like 52 and 58) could benefit from adverse weather conditions, for instance, because they are indoor attractions where tourists can spend some hours on a rainy day.

Fig. 2
figure 2

Number of visits on the same day of year but in the presence of different weather conditions: blue line on a rainy day, while orange line on a sunny day

Fig. 3
figure 3

Number of transitions from PoI 49, 61, and 59 to all the other PoIs

4.2 Context and User’s Preferences

Similarly to what has been done for the crowding level, this section analyses the influence of context in determining users’ preferences with reference to the considered real-world dataset.

Dependency on current user’s location – In the dataset at our disposal, we do not have information about the position l of a user u before visiting a specific PoI p, but we know the location of the PoI \(p'\) that user u visited immediately before p. Therefore, we can consider the location of \(p'\) as the current position l of u and we analyse any possible pairs \((p',p)\) for checking the distribution of \(p \in \mathcal {P}\). In other words, we check if given a current location l there is a set of preferential PoIs chosen by users located in l. Figure 3 reports the frequency of target PoIs taking the location of PoIs 49, 61, and 59 as source position. As you can notice, starting from each of them we have on average a preferred target PoI: for PoI 49 it is sharply PoI 61, for PoI 61 it is PoI 59, while for PoI 59 we have two PoIs with a high preference, namely 54 and 61.

Table 4 Statistics about the number of pairs \((p_1\times p_2) \in \mathcal {P} \times \mathcal {P}\) where \(p_1\) is the source PoI and \(p_2\) is the target one

Table 4 reports for each source PoI the number of transitions found in the data towards each of the other PoIs. As you can notice, for each source there are one or at most two preferred targets whose corresponding cell has been highlighted with a gray background.

Dependency on the day of the week – Given the results obtained in the previous paragraph about the average dependency on the current user’s location, we analyse if there is also a dependency on the current day of the week. In other words, we have verified that the preference for a transition from a PoI \(p'\) to another PoI p is not constant, but it changes when different days of the week are considered. For instance, by taking as source location the position of PoI 59, from Table 4 it appears that on average in the 15% of cases, tourists choose PoI 58 as target, while on average in the 24.6% of cases they choose PoI 61. However, when we analyzed the detailed behaviour on each day of the week, we see that on Monday tourists choose PoI 58 after PoI 59 in only the 2% of cases, clearly below the average value, while on all the other days the percentage is around the average.

Dependency on the time slot – As regards the temporal contextual information, besides the day of the week, also the time slot inside the day has a role in defining the level of crowding. This can depend on several factors, for instance, the proximity to restaurants or accommodations facilities, if the visit is performed near the launch or dinner time, or the opening hours of the attractions.

Dependency on weather conditions – Finally, we consider the influence of weather conditions on the decision about the next PoI to visit. Figure 4 illustrates the choices performed starting from PoI 59 and PoI 202, respectively, on May 30th, 2016, and 2017. More specifically, May 30th, 2017 was a sunny day, while the same day in the year before was a rainy day. As you can notice, starting from PoI 59 the main preference was towards PoI 54 in 2017, while the most chosen PoI in 2016 was 61. Similarly, starting from PoI 202, on a sunny day the most preferred target was PoI 61, while on a rainy day was PoI 71.

Fig. 4
figure 4

Number of transitions from PoI 59 and 202 to other PoIs on 30/05/2017, a sunny day, and on 30/05/2016, a rainy one

5 Forecasting PoI Occupation with a ML/DL Model

In this section, we discuss how the crowding forecaster can be modelled with a machine learning (ML) or a deep learning (DL) approach. In general, the problem can be formulated as a regression problem where the value to be estimated is the expected crowing level of a PoI p in a given context \(\gamma \). More specifically, we start by evaluating and comparing the forecast capabilities of a simple Random Forest (RF) and a Dense Neural Network (DNN) when trained with only raw historical records about the performed visits (see Def. 1), obtaining a simple PoI occupancy estimation. Then, given the obtained results, we try to improve them by training the same models by including also the contextual information presented in Def. 2, showing how the presence of context can increase the forecast capabilities. Finally, we consider the use of a Recurrent Neural Network (RNN), in terms of a Long-Short Term Memory (LSTM), to evaluate not only the role of the current context, but also the role of the previous contextual PoI occupancies in defining the crowding trend.

Several configurations of the RF, DNN and RNN models have been tested in order to determine the best one and study their behaviour in the presence of PoIs with different characteristics. For each PoI we try to estimate the level of occupancy at different moments of the day by considering three different time slots (i.e., morning, noon, and evening). This minimum granularity is further refined for the RNN model, given the good obtained results. As regards to the level of occupancy, we start with the forecast of aggregate percentages, in place of the exact number of tourists, and then we show how also this aggregation level can be reduced up to forecast the precise number of tourists.

The quality of the evaluated models will be measured in terms of MAPE (Mean Absolute Percentage Error):

$$\begin{aligned} \text {MAPE} = \dfrac{1}{n} \cdot \sum _{i=1}^{n} \dfrac{\mid \hat{y}_i - y_i\mid }{y_{i}} \end{aligned}$$

where \(y_{i}\) is the actual value and \(\hat{y}_{i}\) is the forecasted value. However, since when we try to estimate the exact number of tourists, the actual value could be zero, we adopt also another accuracy metrics which is the WMAPE (Weighted Absolute Percentage Error):

$$\begin{aligned} \text {WMAPE} = \dfrac{\sum _{i=1}^{n} \mid \hat{y}_i - y_i\mid }{\sum _{i=1}^{n} |y_i|} \end{aligned}$$

The implementation of the models has been done in Python by using Tensorflow (Abadi et al., 2015) and Keras (Gulli and Pal, 2017) librariesFootnote 1.

Table 5 Results obtained by applying a RF model on raw data

5.1 Forecasting with Only Raw Data

As regards the estimation of the number of tourists in each PoI with only the raw historical records described in Def. 1, we initially use a RF model and then a DNN model, obtaining the results reported in Tables 5 and 6 respectively. For the RF we tried different configurations represented by a different number of trees. As you can notice, the accuracy obtained for the various PoIs is quite different and it decreases as the number of available training data decreases: for the default forest with 100 trees, it spans from about 31.4% in the best case to 50.8% in the worst one, which is associated with PoI 202, namely the one with the smallest amount of historical records. Conversely, the accuracy of the network trained and tested with all the PoIs together (i.e., raw ALL) is about 38% and is not substantially affected by the number of trees in the network.

Table 6 Results obtained by applying a DNN model on the raw data
Fig. 5
figure 5

Architecture of the DNN model for crowding forecasting. It includes an input layer, a dense layer with n nodes and ReLU activation function, a dropout layer with rate DP, another dense layer with n nodes and ReLU activation function, followed by another dropout layer with the same rate DP, and finally an output layer with one node

In the case of the DNN model we tried different values for the following hyperparameters that control the architecture or topology of the network: the number of nodes, the number of epochs, and the dropout. The last two parameters are used to approximate the best solution without falling into an overfitting. The architecture of the DNN is illustrated in Fig. 5 and it includes an input layer, a dense layer with n nodes and ReLU activation function, a droupout layer with rate DP, another dense layer with n nodes and ReLU activation function, followed by another dropout layer with the same rate DP, and finally an output layer with one node and a linear activation function. Notice that in Tensorflow, the Dropout Layer randomly sets, at each step during the training time, the input units to zero with a certain frequency DP. In this way a certain percentage of randomly chosen units (i.e. neurons) are ignored during the training. The percentage of units which are dropped out in the considered DNN model are reported in column DP of Table 6. Conversely, the first column represents the number of nodes and column EP indicates the number of epochs. For each considered PoI, the corresponding cell reports the MAPE obtained with the current configuration. In the cells we distinguish the following cases: (1) an overfitting (i.e., when the validation error is significantly greater than the training error) is identified by the presence of an “*” symbol before the percentage, (2) a potentially good model is identified by a cell with a gray background, (3) the best found model has a gray background and a bold MAPE value, and finally (4) an unknown fit (i.e., when the validation error is significantly smaller than the training error) is represented by a cell without a background color. If we consider the best DNN models for each PoI or for all PoIs together in Table 6, they are able to provide a smaller error w.r.t. the corresponding RF in Table 5. In this case, the errors span from about 24.7% to 49.4%, while the error for the global network decreases to about 36%. As you can notice, there is not a single best configuration for all PoIs, but each one could require a different model.

However, the obtained results and the error improvements provided by the DNN models are not satisfactory. Therefore, we evaluate in the following section the introduction of contextual information together with the raw historical data about visits, in order to improve the forecasting capabilities of both the RF and the DNN models.

Table 7 Results obtained by using a RF on raw and contextual information together

5.2 Forecasting with Contextual Information

In this section we tried to improve the results obtained in the previous one by considering also contextual information during training. We continue to compare both the behaviour of a RF model with the one of a DNN. The obtained results are reported in Tables 7 and 8, respectively.

With the addition of contextual information during the training, the error of the RF model decreases with respect to the one reported in Table 5. In this case, the MAPE error spans from about 22.7% to 41.3%. In particular, if we consider the global network trained with all the PoIs together, we obtain a decrease of the error from about 38% to 32.6%. Moreover, the RF is able to produce a better prediction also with respect to the best DNN in Table 6 for the ALL case. Indeed, in Table 6 the best model for ALL has a MAPE of 35.8%, while in Table 7 it has a MAPE of 32.6%. Finally, if we consider how the accuracy changes with the network dimension (i.e, the number of trees), we can notice that also in this case a decrement or an increment of the number of trees w.r.t. the default one (i.e., 100) does not produce any evident effect.

As regards the DNN model, we vary the network parameters as in the previous case and we report the obtained results in Table 8. In this case, the MAPE error spans from about 20.7% to 37.0%, with the worst case represented by PoI 202, which is the one with the lowest number of historical records. In particular, if we consider the global network trained with all the PoIs together, we obtain a decrease of the error from about 35.8% to 31.3%.

Table 9 compares the best results achieved with the four models: RF and DNN trained with only raw data, and RF and DNN trained with both raw and contextual data. As you can notice, the best accuracy values are obtained with the last model (i.e., DNN trained with raw and contextual data together). The use of a DNN model allows us to register an initial improvement also on raw data. For instance, if we consider PoI 49 (the one with the greatest number of training data points) just the change from a RF to a DNN model decreases the error from 31.4% to 24.7%. However, even without changing the model, but only including contextual information during the training phase, we obtain an important decrement in the error: from 31.4% to 22.7%, confirming the central role of contextual information in crowding forecasting. This behaviour is confirmed also for the networks trained with all PoIs together (row ALL), where an initial error rate of 37.9%, obtained with a RF trained only with raw data, decreases to 31.3% when a DNN model is trained with both raw and contextual data. For these four networks, we evaluate also the percentage of error in the 95%, 99%, and 100% of cases obtaining the results in the row ALL*. These values highlight that in the majority of cases, the obtained forecast values are even more accurate.

Table 8 Results obtained by applying a DNN model on raw and contextual information together
Table 9 Comparison of the best value obtained by the four models Row ALL* reports the MAPE values for the 95/99/100% of test data
Fig. 6
figure 6

Architecture of the RNN model for crowding forecasting. It includes two LSTM hidden layers, each one followed by a dropout layer, and a final output layer with only one node providing the regression value

Table 10 Results obtained by applying a RNN model on contextual data with an aggregated level of occupations and temporal granularity of 4 hours

5.3 Forecasting with Contextual Time Series

The results obtained in the previous section confirm that the use of contextual information and the production of contextual PoI occupancy estimations increase the accuracy with respect to the use of only raw data. However, to further improve the obtained results, we also try a solution based on RNN. The experiments performed in this case can be grouped into three main categories: (a) TS1, the same tests performed with the RF and DNN about the aggregate percentage of occupancy and temporal granularity of 4 hours, (b) TS2, tests about the exact number of tourists and temporal granularity of 4 hours, and (c) TS3, tests about the exact number of tourists and temporal granularity of 1 hour.

The architecture of the considered RNN is illustrated in Fig. 6. As you can see, it is composed of two hidden LSTM layers each one followed by a dropout layer (DP) layer, and finally, an output layer producing the final estimate. Several values have been tested for the following hyperparameters: the number of nodes in each LSTM layer, the number of epochs, the dropout value, and the number of time-steps composing each time series.

LSTMs (Hochreiter and Schmidhuber, 1997) present several advantages with respect to traditional RNN models: firstly, they better handle long-term dependencies between input variables, secondly they are less susceptible to the vanishing gradient problem and finally, they are very efficient at modeling complex sequential data. The main characteristic of a LSTM is the presence of both a long and a short term memory cell, which allows it to automatically discard some information from the past time-steps or maintain it when needed, based on the evolution of the input time series.

Table 11 Results obtained by applying a RNN model on contextual data with the exact number of tourists and temporal granularity of 4 hours
Fig. 7
figure 7

Percentage of cases in which the error in the estimated values are below a given number of tourists and obtained average error expressed in the number of tourists for TS2

5.3.1 Test TS1: aggregate occupations and temporal granularity of 4 hours

As the first set of experiments with the RNN, we consider the same configuration used in the previous subsection for RF and DNN. The obtained results are illustrated in Table 10 where the various tested hyperparameters are reported in the first three columns: the number of nodes (first column), the dropout value (column DP) and the number of epochs (column EP). Conversely, as regards to the length of the time series, we set it equal to 3, so that each time series contains the data of an entire “touristic” day (4 hours \(\times \) 3 \(=\) 12 hours). As for the previous cases, for each of the considered configurations, the neural network has been trained on each PoI alone and the obtained results are reported in the column labelled with the corresponding PoI identifier, while the column ALL reports the results obtained by training the network with all PoIs altogether. The final row contains the corresponding best value achieved with the DNN in the previous subsection.

From the error values reported in Table 10, we can notice that with reference to the models trained with a single PoI, the best obtained results are better than the best ones of the DNN model (reported in the last row) and this occurs more or less in all the cases. The only exception is represented by PoIs 49 and 52 for which the obtained MAPE is quite worse (about 1 point percent) with respect to the DNN results. Conversely, if we compare the best result obtained for the ALL configuration, the obtained improvement is very high, about 9%. This great improvement is clearly obtained with the highest number of nodes and epochs, since the network requires great capabilities for learning the different behaviours of all the PoIs together. As regards the time required by the learning process, we can observe that for the network trained with a single PoI, the required time is about 4 minutes, while the time for the ALL configuration goes from 1 to 1,5 hours depending on the number of nodes and epochs. Comparing these results with the ones in Table 9, we can conclude that while for the single PoI the improvement is smaller and the required training time is similar, for all PoIs together the improvement is greater but also the amount of required time is greatly increased. Finally, also in this case there is not a configuration good for all PoIs, but at least three of them have to be considered. However, the performances of the ALL network are quite as good as the best single PoI network.

Table 12 Results obtained by applying a RNN model on contextual data with the exact number of tourists and temporal granularity of 1 hour

5.3.2 Test TS2: exact number of tourists and temporal granularity of 4 hours

Given the good results obtained in the previous section about the use of a RNN for forecasting the aggregated level of occupation with a temporal granularity of 4 hours, we try to take a step forward by trying to estimate the exact number of people with a granularity of 4 hours. We essentially use the same architecture of RNN based on two LSTM hidden layers illustrated in Fig. 6, and we vary the same hyperparameters: number of epochs, number of nodes in the hidden layers, and the dropout factor. Moreover, we consider two different lengths for the time series: 3 and 6 steps (column TS). We perform the experiments for three of the main important PoIs which are the ones with the greatest number of historical contextual visits, and the ALL configuration. During the experiments, we observe that when the exact number of tourists has to be estimated, we can encounter situations in which the actual value is equal to zero. Therefore, in this case, we use the WMAPE accuracy measure illustrated at the beginning of Section 5, instead of the MAPE one.

Table 11 reports the obtained results. The first aspect to be noticed is that increasing the length of each time series does not correspond to an improvement in the obtained results. Indeed, the best values are all obtained with a number of time steps equal to 3. For evaluating the errors obtained in this case, we compute two baselines: the WMAPE of the RNN trained in the previous test set (TS1) and the WMAPE of a DNN trained to estimate the exact number of tourists. As regards to the amount of time required for the training, we need about 10-20 minutes for the models trained with only a PoI at a time, and from about 1,5 to 2,5 hours for the models trained with all PoIs together.

Fig. 8
figure 8

Percentage of cases in which the estimated values are below a given number of tourists and the obtained average error expressed in number of tourists for TS2

We can observe that in this case, even if the amount of error is increased with respect to the best values obtained with the RNN in TS1, the obtained errors are less than the one obtained with the DNN model trained with the same data about the exact number of tourists. This is particularly true for the ALL configuration. Going into details about the obtained results and considering PoI 49, we can observe that the maximum capacity of this PoI appearing in TS2 is equal to 409, so an absolute error of 10-15 people is reasonable. Therefore, with the aim of better understanding the forecasting capabilities of the network, we have also collected a set of statistics about the average error and the percentage of test cases in which the difference in the number of estimated tourists is below a given value. These results have been reported in Fig. 7.

Let us consider for instance the forecast regarding the Arena Amphitheatre (PoI 49) which has a WMAPE of 31.2% in the best case. From the chart in Fig. 7, we can observe that actually the average forecast error is about 19 tourists with a maximum occupation of 470 people. Moreover, the percentage of forecasts with an error less than 10 people is about 47%, while in the 68% of cases the error is less than 20 people. Therefore, considering absolute values, the obtained results are still quite good and useful. The results for Lamberti’s Tower (PoI 59) are even better. Indeed, the average error is about 14 people on a maximum occupancy of 317, while the number of cases in which the error is less than 10 people are about 60% and those below 20 people are about 80%. Finally, the best results are obtained for the model trained with all POIs together. In this case, the average error is about 10 people, while in 47% of cases, the error is below 5 people, in the 68% of cases the error is below 10 people, and in the 87% of cases it is below 20 people.

5.3.3 Test TS3: exact number of tourists and temporal granularity of 1 hour

Finally, in this last set of experiments TS3, we try to use the RNN to estimate the exact number of tourists with a minimum granularity of 1 hour. As done in the previous section, we use the three most representative PoIs as examples and the ALL configuration, and we consider both time series of length 3 and 6. Notice that since the temporal granularity in this case is finer than the previous one, the number of records in the input dataset has been greatly increased and this consequently increases the amount of time needed for the training. More specifically, the training on a single PoI takes now about 30 minutes, while the training on the ALL configuration requires at least 5 hours.

Table 12 shows the results of the various experiments with their configurations, as well as the WMAPE values obtained with the two considered baselines: the best RNN trained in the previous case (TS2) and the DNN trained for estimating the exact number of tourists with a granularity of one hour. Similarly to the previous set of experiments, even if the errors are greater than the ones obtained with the RNN in TS2, we can see a slight improvement over the DNN on the individual PoIs and a marked improvement on the overall dataset, this time of about 15%. Regarding the effect of using longer time series, again the results are quite similar with both 3 and 6 time steps. Therefore, time series of length 3 still remain preferable, as the time required to complete the training is significantly smaller.

From the analysis of the best configurations reported in Table 12, it also emerges that in this case it is possible to identify a good configuration for all test cases, namely the one with 3 time steps, 256 nodes, 200 epochs and dropout fixed at 0.5. In Fig. 8 the results of this configuration are analyzed in more detail.

In the case of the Arena Amphitheatre (POI 49) the average forecast error is around 6 people per hour. Moreover, about 62% of the cases contain an error under 5 people, 80% of the forecasts have an error under 10 people, and the cases with an error under 15 people are 90% of the forecasts. For the Lamberti’s Tower (POI 59), the average absolute error is even better, around 4 people, with the 71% of the forecasts that do not exceed an error of 5 people. Finally, the best results are obtained with the training on all PoIs together, as the average error is about 3 people. The percentage of predictions with error under 5 people is 76%, while the portion of results with error under 10 people is around 91%.

Table 13 Evaluation of the generalization capabilities for the global networks trained with all PoI data

5.4 Generalization Capabilities

In this section, we evaluate the capability of the networks proposed in the previous subsections to generalize what they have learned to other different situations.

In particular, we consider the four global networks denoted as ALL and we assume to have a recently added PoI for which we have collected a small amount of historical information (i.e., only 6 months in the last year) and we try to estimate the level of crowding in months never seen before. The tests are performed with reference to test case TS1, namely an aggregated level of occupancy and a temporal granularity of 4 hours.

We report the obtained results for 3 different POIs in Table 13. If we compare the results in Table 13 with the ones in Table 9, we can observe an increment in the error values. However, in general, such increment is acceptable and the use of contextual information is able to improve the generalization capabilities of the network. For instance, for PoI 59, the increment of error is about 8% for the DNN trained with only Raw Data and about 4% when the DNN is trained with both raw and contextual information. This confirms the assumption that the use of a richer contextualized training dataset not only improves the accuracy of the model but also increases its generalization capability. An interesting case is represented by PoI 61, for which the RF trained with only raw data has an error less than the one reported in Table 9 for the same PoI (but trained with only the data points of the specific PoI). In this case, the presence of data points related to other PoIs decreases the error of the network, confirming the generalization capabilities of the models.

Finally, as regards to the performances obtained by the RNN, if we compare the results in the last column of Table 13 with the ones reported in Table 10 for each single PoI, we can observe that the errors are essentially the same, if even slightly better of a 1%. This demonstrates that the use of time series of contextual occupations better captures the behaviour of tourists and the training with all PoIs together helps in this process.

5.5 Discussion

In this section we explore the capabilities of different ML and DL techniques in producing good estimations about the level of crowding in different PoIs. The obtained results confirm the rule of context in producing good estimations and the need to introduce several contextual dimensions during the training process, till a 10% of improvement is obtained with each technique by only introducing contextual dimensions during training. However, the use of simple ML techniques, like Random Forest, appears not to be enough for this purpose and the use of at least a Dense Neural Network appears to be necessary. The change from a RF to a DNN model produces a variable percentage of improvement, from 1% to 6%, depending on the specific PoI.

These experiments reveal also that each single PoI can require a different RF or DNN architecture to achieve the best results and this can be determined by the different amounts of training data available for each of them and the dependency of their occupancy behaviour to the various contextual dimensions. However, the training of RF and DNN with all PoIs together presents acceptable results, where the obtained errors get worse in some cases, but are better in other ones. Furthermore, a detailed analysis reveals that by considering only the 95% or the 99% of the test data, the results are even much more better, confirming that despite the presence of some isolated outliers, the prediction capabilities of the DNN are goods.

With the aim to improve the results obtained with the DNN trained with contextual data, we experiment also the use of a RNN, in particular a model composed of two LSTM layers, for studying the effect of contextual occupancy time series in the forecasting. The comparison between the results obtained with the RNN and the corresponding DNN model, reveals the efficacy of this last approach in almost all the cases, with an additional improvement of about 10% for the network trained with all PoIs together. Given such good results we push the test a bit higher by reducing both the occupancy aggregation level to a single tourist and the temporal granularity till 1 hour. Clearly, in these cases, the error increases with respect to the more aggregated tests, but the obtained results are still acceptable, in particular when the average behaviour is considered.

Finally, the trained models, and in particular the RNN one, seem to be able to generalize what they have been learned to other different situations. More specifically, they are able to identify similarities between PoI behaviours in order to cover the lack of information for some new attractions recently added, producing estimations with reasonable errors also for them.

As a general consideration, we conclude that the use of context is decisive in correctly forecasting the level of occupation of a PoI. Moreover, the obtained results could be improved by not only considering the current context, but also the previous ones in the form of a contextual occupancy time series processed by a RNN.

6 Forecasting User’s Preferences with a ML/DL Model

In Section 4.2 we have shown how the context can influence the preference of users towards the next PoI to visit. In this section, we discuss how a context-aware preference estimator can be modeled with a machine learning or a deep learning approach. We start from the set of records describing historical visits performed by users, as formalized in Def. 1, and we extract a set of pairs \((p_{1},p_{2}) \in \mathcal {P} \times \mathcal {P}\) where \(p_{2}\) is the PoI visited by a user u immediately after another PoI \(p_{1}\), together with the timestamp of that visit. In other words, we model the current position of u as the first PoI \(p_{1}\) in the pair, while the next visited PoI is represented by \(p_{2}\). Given that, we compare the estimation of user preferences obtained by considering only such raw tuples with the ones obtained from tuples enriched with the contextual information described in Def. 2. In both cases (raw and contextual) we compare the accuracy of a machine learning model represented by a Random Forest (RF) with the accuracy of a Deep Neural Network (DNN). In this case, the estimation can be modelled as a multi-class classification problem, where the different classes are represented by the different possible target PoIs \(p'\). In other words, given a source PoI p and a context \(\gamma \) we want to estimate a class represented by the preferred target PoI \(p'\). However, we are in the presence of an imbalanced classification, since as we can notice in Table 4 for each source PoI the distribution of the target PoIs (i.e., the classes) is very skewed and in the extreme case the network can try to learn only the most frequent target, discarding contextual information. For this reason, we will apply some appropriate correction strategies for dealing with imbalanced classification. In particular, for the RF we will use specific class weights, while for the DNN during the training phase we exploit some oversampling techniques of the minority classes. Moreover, the transition to a classification problem requires also changing the evaluation metric; for this purpose, due to the imbalance nature of the dataset, we choose to consider the F1-score in place of accuracy:

$$\begin{aligned} \textsf {f1} = 2 \cdot \dfrac{\textsf {prec} \cdot \textsf {recall}}{\textsf {prec} + \textsf {recall}} \end{aligned}$$

where

$$\begin{aligned} \textsf {prec} = \dfrac{\text {TP}}{\text {TP}+\text {FP}} ; \textsf {recall} = \dfrac{\text {TP}}{\text {TP}+\text {FN}} \end{aligned}$$

TP is the number of true positive, \(\text {TN}\) is the number of true negative, \(\text {FP}\) is the number of false positive and \(\text {FN}\) is the number of false negative. Function prec measures the precision of the estimation as the portion of positive indications that are actually correct, while recall measures the portion of actual positives that have been correctly classified. Since we are in the presence of a multi-class classification, we adopt a one-vs-rest approach in defining the above metrics. More specifically, the metrics are computed for each target class X, considered as the positive case (denoted as 1), with respect to the union of all the other cases considered as the negative one (denoted as 0). Given the value of these metrics for each target class X, the corresponding overall value is computed by averaging the single ones. On the basis of this clarification, we need to define a baseline for our F1-score that needs to be exceeded by our models.

Definition 7

(F1-score baseline) Let us assume to have marginal probabilities \(p(y=1) = r\) and \(p(y=0) = (1-r)\), where 1 denotes the positive case (actual class) and 0 denotes the negative case (all the other classes). Due to the applied oversampling technique, this marginal probability r becomes equal to \(1/\# classes = 1/8\). We identify the following two dummy predictors for the computation of the F1-score baseline values:

  • Always true: we assume a predictor that always returns the positive class X. In this case, the precision corresponds to r and the recall becomes equal to 1. Therefore, the first baseline value becomes:

    $$\begin{aligned} \textsf {f}1_{b_1} = 2 r / (r+1) = 0.22 \end{aligned}$$
  • Predict 1 with probability \(q = r\): in this case F1-score becomes equal to r:

    $$\begin{aligned} \textsf {f}1_{b_2} = 2 r^{2} / 2r = r = 0.125 \end{aligned}$$

F1-score is a utility function that has to be maximized rather than a cost or error function, which should be minimized like for the MAPE used in the previous section.

As done in the previous section, for each kind of network, we test several configurations in order to determine the best one in the presence of different contextual information. The implementation of the models has been done in Python by using Tensorflow (Abadi et al., 2015) and Keras (Gulli and Pal, 2017) librariesFootnote 2.

Table 14 Results obtained by applying a RF model on raw data about the past transition between PoIs
Table 15 Results obtained by applying a DNN model on raw data about past transitions between PoIs

6.1 Preference Estimation with Only Raw Data

Similarly to what we have done for forecasting PoI occupations, we initially use the set of raw tuples as input data for training a RF model and then a DNN model, obtaining the results in Tables 14 and 15, respectively.

As reported in Table 14, we tried three different configurations for the RF represented by the number of trees, but we do not observe an improvement in the F1-score with an increment or a decrement in the number of trees w.r.t. the standard configuration (i.e., 100 trees), while the training time has a considerable increment. Therefore, we will keep as a reference the default architecture with 100 trees. The achieved F1-score spans from 0.31 in the best case to 0.14 in the worst case, with a global value of 0.15 for the network trained with all pairs together (see row ALL). The obtained values are all above the lowest baseline value \(\textsf {f}1_{b_1}\), but only a few of them are above the other baseline \(\textsf {f}1_{b_1}\). Moreover, also the F1-score obtained for the ALL case is not satisfactory, so we try to train a DNN model with the same raw input dataset.

As regards the DNN model, we use the architecture reported in Fig. 9, which is very similar to the one used for forecasting PoI occupations, except for the last activation function (softmax in place of ReLU) and the used loss function. Table 15 summarizes the results obtained by modifying parameters like the number of nodes (first column) and the dropout frequency (column DP). In this case, 200 epochs are enough for obtaining good results, and by increasing them we do not obtain any relevant modification in the accuracy. Therefore, in contrast to Table 6, in Table 15 we fix the number of epochs to 200 and we do not mention them in the table. For each source PoI, the table reports the F1-score value obtained with the corresponding configuration. We distinguish the obtained results with the following convention: an “*” symbol before the percentage value identifies an overfitting, white cells contain an unknown fit, while gray cells identify potential good models. Finally, gray cells with a bold value are the ones representing the best model for the given source PoI.

The use of a DNN allows for significantly improve only the F1-score for all PoIs trained together with respect to the RF. Indeed, the obtained F1-score is of 0.32 in place of 0.15 outperforming both baseline values. However, if we consider the F1-score obtained by the model trained for each single PoI alone, the F1-scores are generally worse than the ones obtained with the RF. In this case, it spans from 0.13 to 0.20 without outperforming the baseline \(\textsf {f}1_{b1}\). We can conclude that the most complex DNN is able to better capture the transition behaviour when the training is performed with all source PoI together, while the raw data are not enough for this kind of configuration. In the following section, we try to introduce contextual information in the training data in order to see its effect on the estimation capabilities.

Fig. 9
figure 9

Architecture of the DNN model for preference estimation. It includes two dense layers spaced out by a dropout layer and with a Softmax activation function at the end, which returns the belonging probability associated to each target class

Table 16 Results obtained by applying a RF model on contextual data transition preferences between PoIs

6.2 Preference Estimation with Contextual Information

Given the results obtained in the previous section, we try to improve them by considering also contextual information specified in Def. 2 together with the raw pairs in Table 4. We start again with a RF model and then we move to a DNN model. For the RF we obtain the results reported in Table 16, while for the DNN we obtain the results in Table 17.

For the RF model, we check again three configurations: with 10 trees, 100 trees, and 1000 trees. Similarly, to what has been observed in the case of raw data training, the accuracy is not substantially improved by increasing the number of trees, while the training time is consistently incremented. Therefore, we will consider the default model as the reference one. We can notice that the introduction of contextual information is able to improve the F1-score of essentially all PoIs, that now span from 0.19 in the worst case to 0.40 in the best one. Moreover, with the exclusion of PoI 49, all obtained values outperform both baseline \(\textsf {f}1_{b_{1}}\) and \(\textsf {f}1_{b_{2}}\). However, the values in row ALL are much lower than the individual ones and are even worse than the ones obtained for the RF trained with only raw data.

Therefore, we try to use a DNN to see if this kind of model is able to capture the role of context in determining the preferred transition from a source PoI \(p_1\) to a target PoI \(p_2\). The obtained results are reported in Table 17 where like in Table 15 the number of epochs is fixed to 200, while we vary the number of nodes (column Nodes) and the dropout frequency (column DP). In this case, the F1-score spans from 0.18 in the worst case to 0.24 in the best case, with an F1-score of 0.33 in the case of the global network trained with all the PoI together (column ALL). These results confirm the behaviour obtained for the DNN trained with only raw information, the more complex model is able to better capture the global behaviour given by the training of all PoIs together but performs worst in the case of individual training.

Table 18 compares the best results obtained with the four models: RF and DNN trained with both raw and contextual data about transition preferences between PoIs. We can notice that given the same model (RF or DNN), the introduction of contextual information allows us to improve the performances obtained for each single source PoI alone. However, if we consider the training with all PoIs together, the RF has poor performances with both raw and contextual training. The best values for the ALL configuration are obtained with the DNN trained with contextual information. An F1-score of 0.33 outperforms not only both baseline values but also the values obtained for each individual PoI in any considered models (except for source PoI 202 with an RF trained with contextual information). We will take this as the reference model for preference estimation in the following sections.

Table 17 Results obtained by applying a DNN model on contextual data about transition preferences between PoIs

6.3 Discussion

The preference estimator is modelled as a multiclass classification problem, where given the current position and the desired context is able to predict the preferred next PoI to visit. The experimental results confirm both the role of context in improving the obtained accuracy, and, most importantly, the need for a DL model in place of a simple ML model in order to correctly capture the behaviour of tourists. Indeed, the only shift from an RF to a DNN improves the accuracy by more than double for the model trained with all PoIs altogether. As regards instead to the improvement due to the introduction of context, this is less noticeable. This can be due to the fact that also the current position of the user can be considered as a contextual dimension, which is however inevitably used also in the raw tests.

Finally, with respect to the results obtained in the previous section about the level of occupation, in this case, we can identify a configuration that performs well in almost all the cases, and in particular, for both the training performed on single PoIs and all the PoIs together.

7 The ARTEMIS Architecture and Design

This section illustrated in detail each component of the ARTEMIS framework reported in Fig. 1.

7.1 Contextual Data Enrichment

The initial operation performed by the ARTEMIS framework is the production of the contextual historical data which will be used to train the two models M1 and M2 in Fig. 1. For both models the raw historical data are derived from the logs of the visits performed by tourists in the past, as mentioned in Section 4. In particular, from these records we compute: (a) the degree of crowding for each PoI in the past at specific timeframes, which is needed for model M1, and (b) the pairs \((p_1,p_2) \in \mathcal {P} \times \mathcal {P}\) representing a transition from PoI p1 to PoI \(p_2\) during a visit, which is required by model M2.

Given that raw data, it is necessary to identify the sources of information that represent the relevant context for the problem at hand. Based on the analysis performed in Section 4, we consider historical data about past touristic visits and we enrich them by deriving some semantic temporal information from the timestamp and by adding information about the weather condition in each specific visit interval. Weather conditions are extracted through the API of OpenWeatherFootnote 3 at regular and configurable intervals. Clearly, operations \(op_1\) and \(op_3\) can be customized and enriched with other sources of information based on the problem at hand.

The integration of raw historical data and historical contextual information produces the crowding contextual historical data and the visiting contextual historical data which are reported in box B1 and B3 of Fig. 1 as the input for training model M1 and M2, respectively.

Table 18 Comparison of the best value obtained by the four models
Fig. 10
figure 10

ARTEMIS web application

7.2 Training of a Crowding Forecasting Model

The crowding forecaster is implemented as an RNN model trained with the contextual historical data produced in the previous section. In Section 5, we tried different machine learning and deep learning models, together with many different configurations. In this section, we assume that the best RNN model identified in Table 9 for case ALL is used in box \(B_1\).

The trained model is then used to forecast the occupation of each POI in a given context. More specifically, at specific intervals, the future context is retrieved (weather conditions and temporal characterization), and the model is queried in order to produce an estimation of the future level of occupation of each PoI. With reference to box B2 in Fig. 1, given the PoIs and the desired context, operation \(op_2\) is responsible for combining them in order to define a query for the forecast model M1. The execution of model M1 produces a collection of occupation forecasts, one for each PoI in the given context, which is stored in database T-DB together with other information, like the POI descriptions and the user preferences. For the PoI occupation, the database T-DB is filled not only with the obtained forecast, but it is also updated with data about the actual occupancy. This information can be made available to users, but it is also useful for evaluating the accuracy of previous forecasts and eventually determining the need for a new training of M1, when the estimation becomes inaccurate.

7.3 Training of a Preference Estimator Model

Similarly to the previous model, the preference estimator is implemented as a DNN model trained with the contextual historical data previously produced. Given the experiments performed in Section 6, we assume that the best model identified in Table 18 for the case ALL is used in box \(B_3\).

The trained model is then used to estimate the preference of a user when it is located in a certain position and in a certain context. More specifically, given the current user location and the current context, a real-time query is built and submitted to model M2 (see box B4). The user preference is then returned to the recommendation system for producing the final recommendation.

7.4 Context-Aware Recommendation

At its core, the ARTEMIS framework combines at real-time the information produced by the occupation forecaster with the ones produced by the preference estimator. The final context-aware recommendation is obtained by applying the equation in Def. 6. In Fig. 1, box B5 contains different kinds of applications, called context-aware recommendation apps, which will return the computed set of recommendations, as the best n PoIs.

In order to demonstrate the potentiality of ARTEMIS, we implement the recommendation application illustrated in Fig. 10. Each authenticated user can see his/her position on the map and can obtain a suggestion for the next PoI to visit based on his/her contextual preference and the expected level of crowding.

Table 19 PoI occupancy rates for two contexts

The web application illustrated in Fig. 10 allows us to demonstrate the potentiality of the ARTEMIS framework and in particular to compare the different suggestions produced by using either the contextual preferences and contextual crowding forecaster provided by models M1 and M2, with static suggestions obtained by considering only raw information during the training phase. More specifically, the general interface of the application consists of a web map where the user’s position is highlighted with a blue placeholder, while the location of each PoI is denoted with a placeholder whose color indicates the different level of occupation: green means quite empty, yellow means normally occupied, while red denotes an overcrowded situation. At the top of the map, there are some input fields that allow the user to specify the spatial and temporal context of the visit: the user identifier and position, and the desired date for the visit. The user position can be specified both textually or by performing a selection on the map. Finally, the button Suggest computes the suggestion without considering contextual information or by using Def. 6. The system returns two lists with the first three best PoIs for the user: each row has a color depending on the estimated level of occupation of the PoI in the given date, and it reports the preference of the user as a set of stars, from 1 to 3, as well as the distance of the PoI from the current user position.

As an example, we report in Table 20 the different suggestions produced by ARTEMIS in two different contexts C1 and C2. More specifically, for C1 we choose a sunny day during the weekend, while for C2 we consider a sunny Thursday, which is one of the quieter days in terms of tourist visits. The estimated level of occupation of each PoI in the two contexts is described in Table 19. As you can notice in C1 there are some PoIs, like P7 and P13, that are overcrowded, while others like P2 and P11 that are average occupied. Conversely, in C2 all occupation rates are below 50%.

Table 20 Result of recommendations

Table 20 shows different raw and contextual recommendations suggested to two users located in two different positions \(l_1\) and \(l_2\) in contexts C1 and C2. Besides to the three suggested PoIs, it reports the distance of each PoI from the user position and the user preference for the PoI (3 is the maximum value and 1 is the minimum value). We can notice that in context C1, PoI P7 has a level of occupation close to saturation; therefore, even if its preference is very high, the system suggests P2 in place of P7. Similarly, starting from position \(l_2\) even if P7 has a high estimated preference, due to its spatial proximity to user’s position, it does not appear in the contextual suggestions. Conversely, in context C2, PoIs are typically unloaded and the set of suggestions is the same for both static and dynamic approaches.

8 Conclusion

The paper proposes a framework called ARTEMIS which combines a context-aware crowding forecaster and a context-aware preference model. Starting from a real-world case scenario represented by data collected from 2014 to 2019 about the touristic visits performed in Verona (Italy), we have first exemplified the role of context in influencing both aspects of the recommendations and then have tried different machine and deep learning models for estimating them. In ARTEMIS the crowding forecaster, and so the prevention of overcrowded situations, has become an essential requirement for the need, spread in recent years, about epidemic prevention control and social security. This requirement can appear as a limitation for the final user, since the most famous PoIs, which can be subject to more overcrowded situations, in extreme cases can never be suggested to users. However, in real world cases, where tourists have a limited time to spend for a trip or visit, the ability to prevent situations where the majority of such time is wasted in a queue, waiting to enter an attraction, can be considered very important. At the same time, our crowding forecaster provides a detailed prediction and is able to identify other time slots on the same day, or other days in the available period, where a particular attraction can be visited without any problem. For this reason, as a future extension, we plan to plug the crowding forecaster into a system able to produce sequences of recommendations instead of instantaneous ones, and to respond to queries like “When is the best time to visit the Arena Amphitheater?".

The proposed models suggest that the use of contextual information can increase the accuracy of both crowding forecasting and preference estimation, although they require more elaborated models to be properly captured. As mentioned in the introduction, the proposed framework can be easily adapted to other application domains, like TV on demand services, movie recommendations, and any form of tourist attractions, like amusement parks or restaurants. Moreover, several other contextual dimensions can be added to determine their influence either in the crowding or in the preference inference steps. Finally, a prototypical implementation of a web application based on the ARTEMIS framework has been presented to show a real-world application scenario and lay the basis for future extensions to other domains.

As future work we plan to study further the problem in particular by considering three important extensions: (i) the suggestion of sequences of PoIs, instead of single ones, (ii) the consideration of groups of tourists, in place of individual ones, and (iii) the analysis of the effect induced by the provided suggestions to the following ones. As regards the first problem, it is common in the touristic domain the need to suggest a complete itinerary instead of only the next PoI, indeed this can produce greater value to users in case they want to schedule a complete vacation. Moreover, tourists usually plan to do their activities in a group, where the tastes and needs of all components have to be properly considered and balanced to obtain the final suggestion. Finally, it is important to consider also the possible effects of the provided suggestions on the future behaviours of users, or most importantly on the future level of crowding, like in a multi-agent system. In this regard, the inclusion of sentiment analysis techniques can help in evaluating the effect of the suggestions and better calibrate them.