Keywords

1 Introduction

Data-driven decision making is an area of crucial importance in the digital transformation in which many organizations are immersed. These decision-making methodologies allow organizations to be truly market-oriented, enabling them to focus on customers in order to build customer loyalty in a more cost-effective way (Moreno et al., 2019). Of course, the tourism sector is no exception to this situation and is undergoing a major transformation in which this type of management will be predominant (Camilleri, 2020).

The raw material for these data-driven strategies is data and its correct management, storage and use within an organization is crucial. The aim of this study is to propose a conceptual architecture for a data platform for an organization in the tourism sector that helps companies to better manage and use data in order to implement data-driven strategies. There are studies in the literature that propose a conceptual architecture for market-oriented organizations (Moreno et al., 2019) in our case this architecture focuses on the type of data and analysis requirements of organizations in the tourism sector. In the field of tourism itself we find studies that propose data architecture of different types and scope in terms of the variety of the type of data used (Navarro & Rubio, 2000; Abdulaziz et al, 2015; Bustamante et al, 2020). In our study we extend this scope to the entire spectrum of possible types of data in the tourism sector. Due to the digitization of products and services, the types of data handled in the tourism sector have grown significantly in variety and quantity. A modern architecture such as the one we propose, in line with current standards in data platforms, is required for their management.

2 Conceptual Architecture for a Tourism Organization Data Platform

As mentioned above, the objective of this study is to propose a conceptual architecture for the data platform of an organization in the tourism sector. A good starting point for this architecture is the one proposed by Moreno et al (2019) for a market-oriented organization. This architecture proposes several layers:

  • Data Sources

  • Data Management for analytics

  • Analytical techniques and Business Intelligence

  • Business insights

In our approach we will start with an initial analysis of the types of data sources currently existing in the tourism sector and, based on these types, the following layers of information management and analysis will be proposed, reaching a final layer in which applications are proposed for the most common analysis needs of the tourism sector. In the following sections, the characteristics of each layer are detailed and, finally, the complete conceptual architecture is proposed.

2.1 Data Sources

Data sources are the raw material on which all analysis that will enable data-driven decisions to be made is based. There is a wide variety of types of data sources. The different types of data sources should be well identified as their typology will determine how they are stored, managed and analysed. In the tourism sector, a very interesting classification is made by Li et al. (2018), in which three large blocks are defined as can be seen in Fig. 1:

Fig. 1
A classification chart of data sources in the tourism sector, includes users using social networks and search engines via devices like mobile, G P S, and I o Ts for operations such as hotel bookings, payments, and others.

Adapted from on Li et al. (2018)

Data sources classification in tourism research.

  • Operations: data from transactions or operations such as hotel bookings, payments, flights, transport (flights, cruises, rail transport), website visits, etc.

  • Devices: data coming from devices: mainly mobile data, but also IoT (Internet of Things) data from sensors or other devices.

  • Users: data generated by tourists themselves: comments on social networks, online booking platforms, search engines, virtual communities, co-creation of tourist experiences, etc. This type of data is often referred to as User Generated Content (UGC).

There are interesting application cases in the literature for each of these types of data sources; if we focus on operations, we see applications in payment data (Ramos & Murta, 2022) or air flights (Gallego & Font, 2020). In terms of data from devices, we find applications in mobile data (Zaragozi et al., 2021), generally with georeferenced information, as well as data from IoT (Cha et al., 2017). In the third block concerning UGC-type data we also find applications of social network data (Gunter et al., 2019) or data from online booking platforms (Liu et al., 2021; Van der Zee & Bertocchi, 2018). In this last block we have a type of information that is worth highlighting, namely data from emerging co-creation models (Mohammadi et al, 2020). This data is data collected on travel booking platforms that allows consumers to co-design their own travel experiences. It should be noted that while in the first block (operations) the data are structured (standard format and well-defined structure), in the other two blocks (devices and users) we can find semi-structured or unstructured data, which should be taken into account in their management and storage. The last two blocks are those that have experienced the greatest growth in the last decade and those that require the greatest need for real-time management (Ranganathan et al., 2020).

In addition to the three blocks of data mentioned above, we can add data generated by public institutions. Here we find statistical data generated by specialized institutions or open data provided by local or state administrations or international organizations. For example, information on the level of occupancy of a destination, origin of tourists, expenditure and others. This type of open data has grown significantly in the last decade and is used in multiple applications (Bratucu & Cismaru, 2015). It is shared information that democratizes access to data for all public and private agents in the sector, as pointed out by Celdran-Bernabeu et al. (2018). In the tourism sector, different tourism intelligence systems have been developed in the last decade (Gajdosik, 2019), some at state administration level and others at local level, which collect and generate information of great interest.

2.2 Data Management for Analytics

These different types of data sources must be stored and managed in order to apply analytical techniques to obtain insights. For this data management and storage layer we propose to use a Data Lakehouse architecture. This architecture was introduced by Armbrust et al. (2021) and is an architecture that combines the transactions and data governance of enterprise data warehouses with the flexibility and cost-efficiency of data lakes to enable business intelligence and machine learning. This architecture is currently being adopted by many companies and we believe that the flexibility it provides is appropriate for the diversity of data sources we have identified in the previous section. A Data Lakehouse is the natural evolution of the Data Warehouse and Data Lake (Harby & Zulkernine, 2022). Data Warehouses have been widely used in the tourism sector (Navarro & Rubio, 2000; Abdulaziz et al., 2015) and more recently so have Data Lakes (Sankaranarayanan & Lalchandani, 2017; Raju et al., 2018). One of the characteristics of a Data Lakehouse architecture is scalability, which is very useful given the significant growth rate of UGC or IoT sources in the tourism sector. On the other hand, the flexibility of this type of architectures is suitable for storing structured, semi-structured or unstructured data sources, which are typologies of source structure identified in the previous section. The data sources managed in the Data Lakehouse will pass through different storage areas. These areas are differentiated by the degree of elaboration of the data and there will be areas with raw information and areas with highly elaborated information, which will facilitate different types of data analysis. The processes that ingest the information into the Data Lakehouse from the original sources and that carry out the treatment of the different data areas are the ETL (Extract, Load and Transform) processes. These processes must support batch data ingestion processes with the periodicity defined (daily, weekly, monthly) and others closer to real-time. This will depend on the type of source we are working with, for example, UGC data has a very high generation speed and will require ingestion close to real-time, while if we are working with open data information published by an organization, this information will have a specific publication frequency, for example, monthly, and will be ingested in a monthly batch process. In terms of infrastructure, this type of architecture can be implemented in the company's own servers or in a cloud infrastructure. We consider a cloud infrastructure to be appropriate in our case, as it allows companies to better adapt to market changes and therefore to the data to be managed, as well as to improve cost efficiency.

Finally, it should be noted that the storage and management of data in an organization must follow the rules, policies and processes defined at the Data Governance level. A Data Lakehouse type architecture will facilitate Data Governance tasks. Moreover, these governance processes will facilitate the cataloguing of data and its sharing with third parties where necessary, using a semantic model as standard as possible with that used in the tourism sector. This feature may be of relevance if a company wants to integrate into the Gaia-X digital ecosystem (Gaia-X; Braud et al, 2021). This European initiative proposes an open and secure data infrastructure, complying with the highest standards of digital sovereignty while promoting innovation that can be of enormous interest to a company in the tourism sector. Thus, we consider that a management and storage architecture such as the one proposed allows a company to be prepared to integrate into the Gaia-X ecosystem in the future. In terms of good practices, data standards and interoperability, consideration should also be given to the Tourism Data Space project (Tourism Data Spaces), which proposes a data marketplace for sharing and accessing data at European level. Similarly, we have the European Data Spaces for Tourism project (DATES) that focuses on the development of governance and business models, while providing a shared roadmap that will ensure the coordination of the tourism ecosystem stakeholders. Finally, another interesting reference to consider is the EU guide on data for tourist destinations (Smart Tourism Destination).

2.3 Analytical Techniques and Business Intelligence

The Data Lakehouse architecture outlined in the previous section allows the use of different types of analytical techniques from Business Intelligence (BI) to Machine Learning, each technique will use the most appropriate data areas of the Data Lakehouse depending on whether it requires raw information or more elaborated information. Within the wide range of possible data analysis techniques that can be applied in the tourism sector, the following are the most commonly used. There are multiple BI use cases in the tourism sector, such as the BI architecture proposed by Bustamante et al. (2020), which integrates information from four collaborative sources (Twitter, Openstreetmap, Tripadvisor and Airbnb) and is an example of an architecture focused only on BI and certain sources, but similar to the one proposed in this article. Complementary to classic BI we have the techniques of Data Discovery and self-service BI that give greater freedom when exploring the data. When tourist behaviour is analysed, another widely used analytical technique is clustering (Rodríguez et al., 2018). As mentioned in Sect. 2.1, an important block of data are those coming from mobile devices generally with geo-referenced information that allows the application of geospatial analytics techniques (Yang et al., 2012), as well as the block related to UGC type data are becoming increasingly important and are data in which techniques are applied to analyse texts such as Natural Language Processing techniques (Guerrero-Rodríguez et al., 2023). More advanced analytical techniques such as Machine Learning (Peng et al., 2020) or Deep Learning (Essien & Chukwukelu, 2022) are increasingly used in the tourism sector. In the field of Machine Learning techniques, one of the most widely used in the tourism sector is recommender systems (Esmaeili et al., 2020). Finally, it is worth mentioning the recent applications of Generative AI techniques to the tourism sector, in particular ChatGPT (Carvalho & Ivanov, 2023).

2.4 Business Insights

The final objective of the entire data cycle carried out in the previous sections is to obtain relevant insights in the different use cases of the tourism sector. The knowledge obtained has multiple uses in the tourism sector, we highlight the most frequent cases. Lv et al. (2021) differentiate two main levels of business insights: individual level (consumer behaviour and attitude) and organizational level (marketing management and performance analysis of tourism organizations). Using this division, we first find that there are many studies that put tourists at the center and analyse their behaviour (Miah et al., 2017), their perception (Nave et al., 2018) and their satisfaction (Li et al., 2020). Similarly, with regard to tourism supply, other areas of research are the personalization or recommendation (Esmaeili et al., 2020) of products and services and the co-creation of experiences (Mohammadi et al., 2020). All these use cases try to cover the different phases of the travel lifecycle (before, during and after) and mostly use UGC type data. On the other hand, at the organizational level, we find use cases at the level of tourism destination management such as demand forecasting (Li & Jiao, 2020), planning and development, value proposition, resource management, sustainability management (De Marchi et al. 2022) or reputation analysis (Cillo et al., 2019), as well as multiple use cases in the field of tourism companies such as marketing management or performance analysis (Bi et al., 2018) and pricing (Sánchez-Lozano et al., 2021) of products and services.

2.5 Conceptual Architecture of the Data Platform

Once the different layers of the data platform have been defined, Fig. 2 shows the complete conceptual architecture:

Fig. 2
A conceptual model comprises data architecture and governance, encompassing various data source types like operational and management data. Analytics processes involve the data lakehouse method and advanced models such as BI and clustering. The model ultimately delivers business insights.

Conceptual architecture for a tourism organization data platform

As a practical example, with a similar scope to the proposed data platform, we have the case of the Destination Data Platform within smart tourism ecosystem of the city of Gothenburg (Jansson et al, 2022).

3 Conclusions

In this study we have proposed a conceptual architecture for a data-driven data platform of a tourism organization. We believe that this architecture can help tourism organizations in their digital transformation and in making data-driven decisions. The proposed architecture is based on modern and flexible architectures and facilitates the management, storage and governance of data, taking into account the variety and growth of data types that currently exist in the tourism sector. In addition, this architecture also enables organizations to be prepared to integrate into data ecosystems such as those proposed in the Gaia-X initiative, this will enable both the integration of data from the ecosystem into the organization and the sharing of the organization's own data in the ecosystem in a simple and governed way. Finally, it should be made clear that the proposed architecture is an ambitious one and probably not within the reach of all players in the tourism value chain. Large companies in the hotel or transport sector or public institutions, for example, may be able to tackle this type of architecture, but it may be beyond the reach of other smaller companies in the restaurant and leisure sector, for example. The latter must approach their work with data in a different way. In terms of analytical techniques, these companies should consider Business Intelligence and those advanced analytical techniques that apply to them. In terms of data management, in order to avoid having to generate and maintain a costly architecture, these smaller companies must connect to data platforms generated by public institutions that offer a lot of information already managed, organized and with open data access. Many tourist destinations have tourism intelligence systems or smart destination platforms that generate a lot of useful information for all agents in the tourism value chain, regardless of the size of the company.