Introduction

As part of the Open Data Directive, the European Commission has published a list of high-value datasets (HVDs) that public sector bodies must make available as open data. For each of the HVDs, the list also contains specific data items that must be included in these datasets. At the same time, the European Commission is in the process of creating Common European Data Spaces, domain-specific ecosystems where data producers and consumers can exchange data, ideally in an interoperable way. In the case of HVDs, no technical guidance is given on how to publish these datasets. As a result, each publisher will publish their HVDs in their own way, making the result non-interoperable. The same problem will arise in the Common European Data Spaces unless technical specifications are provided for each of the exchanged datasets.

STIRData,Footnote 1 a project co-financed by the Connecting Europe Facility Programme of the European Union, investigates this issue and sees what exactly needs to be done to ensure the technical, semantic and legal interoperability of the datasets and what the pitfalls are, by using open data from business registries, which is one of the HVD topics.

Other approaches to integration of company data typically keep the source data as it is, i.e., noninteroperable for others, and try to build value for their project by ingesting and cleaning the data for profit. STIRData, on the other hand, aims to improve the datasets at their source, making the datasets interoperable for everyone. The STIRData approach to technical interoperability is based on linked data, and the approach to semantic interoperability is based on a common data specification that reuses the European Core Vocabularies.Footnote 2 Finally, our approach to legal interoperability of the datasets is based on guidelines to establish the appropriate terms of use and an overview of the current state of terms of use of business registry datasets.

The main contributions of this paper, extending [1], are as follows:

  1. 1.

    We demonstrate how to tackle technical and semantic interoperability in a given domain using an example of company data.

  2. 2.

    We summarise our legal interoperability framework for open data and apply it to the domain of data from business registers.

  3. 3.

    We verify the approach on datasets from business registers of 13 countries.

  4. 4.

    By creating a user-orientated platform, we show an example of how anyone can build an application on top of this interoperable data.

  5. 5.

    By providing sample data consumption pipelines, we show how the data can be re-used and combined with other data sources for further data analysis.

The remainder of the paper is structured as follows. In Sect. “Architecture” we introduce the data architecture that supports technically interoperable open data. In Sect. “STIRData: An RDF Model for Company Data” we introduce the STIRData specification that supports the semantic interoperability of company data. In Sect. “Legal Interoperability” we introduce the legal framework to address existing legal interoperability issues. In Sect. “Datasets” we introduce the datasets on which we verified our approach. Section “The STIRData Platform” shows the STIRData platform and describes the necessary data pre-processing steps. In Sect. “Platform Performance Evaluation” we evaluate our approach, in Sect. “Compliant Data Consumption we show how the data can be re-used for further analysis, and in Sect. “Related Work” we discuss related work. In Sect. “Conclusions” we conclude.

Architecture

To achieve semantic data interoperability, there must be a common data format with clearly defined semantics. This is defined by the STIRData specification introduced later in Sect. “STIRData: An RDF Model for Company Data”. To achieve technical data interoperability for all potential data consumers, including the STIRData platform (Sect. “The STIRData Platform”), there must be a uniform data interface at each of the data providers.

This is in contrast to other approaches to company data integration, where technical interoperability was achieved by integrating data from various data sources into a centralised database using ad-hoc transformation scripts, as is the case of the euBuisnessGraph project (see Sect. “Related Work”). The obvious disadvantage of the latter approach is the lack of sustainability and scalability. When another data source is to be added to the system, there needs to be someone who creates the transformation script and manually adds the data source to the centralised system. When that person is no longer available, e.g. due to the end of project funding, the original data sources are left non-interoperable, and the invested effort goes to waste. A similar approach can be intentionally followed to create business value from being in control of the integrated data, which is the case of OpenCorporates (see Sect. “Related Work”), however, that is not the goal of high-value datasets, which should be open to everyone, preferably in an interoperable way.

Given the motivation mentioned above, the choice of RDF [2] and the principles of linked dataFootnote 3 as the way to publish company data on the Web was natural. Regarding the technical interface, since users will typically need to query the data, the SPARQL endpoint [3] interface was chosen. Even though from a research point of view, this architecture is not novel, it is still rarely applied to publishing interoperable public sector information, if at all. Since it is a perfect fit for the HVDs, we demonstrate in this paper how it can be applied on an example of company data and what are the hurdles on the way.

Ideally, each data provider would publish the company data according to the proposed STIRData specification in their own SPARQL endpoint. We denote this way of publishing the data as LD-STIR. However, this will happen gradually, one provider at a time. Until then, there are multiple options of where the company data can be hosted, in which data format, and how it gets to the desired data format and the SPARQL endpoint.

Fig. 1
figure 1

STIRData data architecture

The options are shown in Fig. 1, where the compliant data is indicated by the green colour. In all cases, our aim is to simulate the ideal state. We prepare the necessary data transformation and load each resulting dataset into a separate SPARQL endpoint so that the solution can be taken over by the business registry data providers if and when they wish to do so. This should be the case of at least the Czech Business Registry, since the Ministry of Justice hosting the business registry dataset is part of the STIRData project’s Experts and Collaborators Group, and the Norwegian registry, since the Brønnøysund Register Centre is a partner in the STIRData project. For the transformations themselves, we use open-source tools D2RML [4] and LinkedPipes ETL [5].

The ideal case is represented in Fig. 1 by the BR provider 6, who publishes its data primarily in the LD-STIR form. However, we anticipate that most data providers will have their data primarily in a non-linked data form and transform it to the LD-STIR form directly using one of our tools (BR providers 2, 3, 4, and 5), other tools (BR providers 7 and 9) or through another linked data form such as a national one using national RDF vocabularies (BR provider 1).

The next question is where the compliant data is hosted. This would ideally be at the data provider’s premises (BR providers 1, 3, 6, 7 and 8), other provider’s premises (BR provider 9) or, in the worst case, at the premises of a third-party with a time-limited responsibility (BR providers 2, 4 and 5), in which case the data will eventually become unavailable. However, at least the transformations themselves will remain available to be adopted in the future.

To access the published company data, the STIRData platform, as well as any other application searching for the data, only needs to know the URLs of their respective SPARQL endpoints. These can be found in a data catalogue listing the STIRData compliant datasets, e.g., in the Official Portal for European Data,Footnote 4 via a SPARQL query searching for catalogue records linking to the STIRData specification. Part of the specification is a DCAT-AP record template that shows how to do that. Then, the platform, and any other potential data consumer, can implement both decentralised features, distributing queries to the individual SPARQL endpoints, and centralised features, such as precomputing statistics, which would normally take a longer-than-acceptable time, for real-time user interaction, to compute. Additionally, We illustrate how compliant data can be processed for further analysis in Sect. “Compliant Data Consumption”.

STIRData: An RDF Model for Company Data

To ensure semantic interoperability of company data, a data specificationFootnote 5 was created as an application profile of the EU Core Vocabularies, namely the BusinessFootnote 6 and LocationFootnote 7 Core Vocabularies, by extending them and covering the requirements on the topic of company data from High-Value Datasets. As an additional input to the specification, we also used the results of the analysis of the contents of the datasets of European business registries discussed in Sect. “Datasets”.

Fig. 2
figure 2

STIRData conceptual model. Gray classes represent controlled vocabularies

The conceptual view of the specification is shown in Fig. 2. The central concept is a company. Apart from typical fields such as legal name and founding date, a company may have economic activities (primary, secondary, auxiliary), it may be split into units, and it may be located in sites that may correspond to different establishments, each having an address. Apart from standard information, addresses should include the administrative units to which they belong, to support region-based aggregations. The specification also supports several types of company identifiers (e.g. registry identifier, tax identifier). The LEI codeFootnote 8 field offers interoperability with company data published by GLEIF.Footnote 9

Here we provide an example of a basic compliant company representation in RDF Turtle, including legal name, identifier, registration date, registered address:

figure a

In the following sections, we describe how through the use of common vocabularies, the location of the company, encoded by administrative units, and the economic activity of the company can be seen as two core, hierarchically structured, categorisation dimensions for companies. Finally, we show how, through guidance in metadata description, the disoverability of compliant datasets can be achieved.

Company Location—Administrative Units

For encoding administrative units, the specification prescribes the use of the controlled vocabularies of the Territorial Units for Statistics (NUTS)Footnote 10 and the Local Administrative Units (LAU).Footnote 11 Both are well-established EU-level controlled vocabularies suitable for interoperable classification of location. NUTS is a four-level hierarchy, with levels NUTS-0 to NUTS-3, where the top level is the country level, and LAU further divides the lowest NUTS level to the most fine-grained local administrative units particular to each country. For example, the NUTS-0 code CZ0Footnote 12 represents Czechia, the NUTS-3 code CZ064Footnote 13 represents the South Moravian region, and the CZ0642 LAU code represents the city of Brno.

For NUTS, there is a SKOS-based linked data version published at the EU Vocabularies site. However, this is not yet the case for LAU. Therefore, the SKOS version of LAU was created within the STIRData project from the relevant data provided by Eurostat. The transformations of the raw data were made using techniques similar to those discussed in Sect. “Datasets”. Note that the STIRData specification requires that data items include only the lowest relevant level values, i.e. NUTS-3 and LAU level, since the upper levels can be inferred from them.

Here we provide an example of a company representation with the focus on the NUTS and LAU usage:

figure b

Company Activities

For company activities, the use of the Statistical Classification of Economic Activities in the European Community Rev. 2 (NACE Rev2)Footnote 14 controlled vocabulary is prescribed, as a well-established controlled vocabulary suitable for interoperable classification of company activities. However, each country typically extends the base NACE Rev2 classification with more fine-grained, country-specific economic activity categories. NACE Rev2 itself is a four-level hierarchy, and national extensions typically add a fifth level, but may extend it up to seven levels.

For NACE Rev2, there is a SKOS-based linked data version published at the EU Vocabularies site. However, this is not the case for the NACE national extensions. Therefore, their SKOS versions were created within the STIRData project. The SKOS versions of NACE national extensions were created from the data provided by each country’s relevant national authority. Transformations of raw data were performed using techniques similar to those discussed in Sect. “Datasets”. Note that the STIRData specification requires that data items include only the lowest relevant level values, i.e. a national NACE extension, or the lowest-level NACE code present in the data, since the upper levels can be inferred from them, given that they are mapped properly. The specification expects the lower national levels to be hierarchically mapped using skos:broader, and the corresponding levels overlapping with the global NACE codes to be mapped using skos:exactMatch.

Here, we provide an example of a company representation with the focus on the NACE usage, including national extensions. Note that NACE code 27.90 represents Manufacture of other electrical equipment and we use the Czech NACE-CZ extension here, which adds the fifth NACE level.

figure c

Metadata of Compliant Datasets

The specification contains guidelines on how to describe published compliant datasets by metadata using DCAT-AP, a Data Catalog Vocabulary (DCAT) [6] application profile for data portals in Europe (DCAT-AP),Footnote 15 which is a metadata standard for data portals in Europe. The flow of the metadata is typically from the data publisher to a national open data portal, from which the record is further harvested to the Official portal for European data https://data.europa.eu (EDP), which exposes the harvested metadata using a SPARQL endpoint.Footnote 16

For the compliant datasets and their SPARQL endpoints to be discoverable in EDP, the specification prescribes that the dataset is marked as compliant with the STIRData specification using dcterms:conformsTo, and the SPARQL endpoint Data Service, a distribution of those datasets, to be marked as compliant with the SPARQL 1.1 protocol specification.Footnote 17 Finally, the specification prescribes further information about licensing, data provenance and data freshness to be present in the dataset’s SPARQL endpoint itself.

Here we provide an example of a minimal suitable DCAT-AP record:

figure d

Compliant records can then be found in EDP’s SPARQL endpoint using the following SPARQL query:

figure e

In this phase, we have encountered an obstacle in achieving metadata interoperability. Although DCAT-AP is a European standard for data portals, each national data portal treats it differently, supports a different subset of DCAT-AP, and shapes the data in a different way. To highlight a few key differences, while the Czech National Open Data PortalFootnote 18 allows us to shape the metadata exactly like in the specification, the National Data Catalog of NorwayFootnote 19 does not. It treats standards linked from the catalog records through dcterms:conformsTo a bit differently on the RDF level. While the specification asks for:

figure f

the Norwegian catalog represents the same thing using Skolem IRIs this:

figure g

Both ways are allowed in DCAT-AP, and it is not feasible to change the implementation of the portal, as it is not technically incorrect, and the maintainer of the portal is also a different organisation than the one publishing the dataset. Unfortunately, this makes the SPARQL discovery query more complex, as it has to account for all such differences. Also, this shows how freedom in data representation can hinder data interoperability.

Another difference is in the representation of data services such as SPARQL endpoints in connection to datasets. Although the Czech portal contains all the information about the distributions of the datasets and the related data services in one SPARQL endpoint, the Norwegian portal links to a different dataset for information about data services. Since EDP is unable to harvest the data service metadata from the different portal, the information is not available in the harvested metadata in data.europa.eu, and therefore it cannot be used to discover the SPARQL endpoint containing the STIRData dataset.

Legal Interoperability

Technical interoperability is not the only condition for successful integration and further use of open data, in our case, data from European business registers. Achieving legal interoperability is also essential. The ultimate goal is to maximize the legal certainty of data re-users so they can create new products and services without fear of legal repercussions. In particular, the data provider must ensure that there are no legal obstacles that would limit the further use of the provided data and must properly formulate the terms of use of the data. In this section, we will cover the point of view of a data publisher and their obligations to properly set the terms of use of open data.

The first question the data provider must answer when publishing their dataset, e.g. a business registry, is to what extent the data should be made public. There is no legal instrument at the level of international or EU law that is widely applicableFootnote 20 and comprehensively regulates the obligation of public administration to provide data and information. The issue is largely governed by national law. Regarding our case of company data, the implementing regulation of EU Open Data Directive 2019/1024, the Implementing act on a list of High-Value Datasets, sets out the obligation to provide business register data in the required scope. However, harmonisation is not complete here either, as this obligation only applies to data already held by Member States. The regulation does not impose an obligation to create data to the required extent. This means that the richness of the published business registry datasets differs greatly among EU member states.

The second issue is rights to data. A potential obstacle is the ownership of plain data, a legislation based on national civil law and not regulated at the European or international level. If plain data were subject to property rights, it would be necessary for the provider to give permission for its use when it is published; otherwise, their right of ownership, which is in principle protected against everybody, would be infringed. In practice, however, we are not aware of any legislation in any European jurisdiction that protects plain data by property rights.

The next level of data rights is intellectual property rights, an area harmonised at the EU level.Footnote 21 and therefore the conclusions presented here are relatively generally applicable. Simple data is generally not protected by copyright because copyright law imposes relatively high standards on the originality of the subject matter and on the fact that it is “the author’s own intellectual creation” [7]. In the context of public sector information, the mere fact that "mere intellectual effort and skill" were required for the creation of the intangible result are not relevant for its copyrightability.Footnote 22 In general, therefore, it can be said that copyright will not constitute an obstacle to the further use of data from business registers, and thus it is not necessary to deal with it specifically.

The situation is different in the case of legal protection of databases. The latter is based at the European level by Directive 96/9/EC and consists of copyright protection for the structure of the database and sui generis protection for the maker of the database [8]. Where database protection is present, as may arise in the publication of business register data, and this issue needs to be assessed on a case-by-case basis, it is essential that the provider of the register licenses the further use of its contents. The obligation to do so then also follows directly from the wording of Article 1(6) of Directive 2019/1024 (Open Data Directive). The appropriate way to deal with any obstacles arising from copyright and database law is to use a CC BY 4.0 license.

A final area to consider in terms of potential barriers to the publication and re-use of business register data is personal data protection. Business registers contain records of natural persons involved in registered companies and are therefore subject to EU Regulation 2016/679, the General Data Protection Regulation (GDPR). Although this is a harmonized European regulation, the final decision on whether selected personal data can be published in open data quality is up to the national legislator. Some Member States allow it (e.g. Czechia), others do not. Furthermore, CJEU in its recent case sets limits for the publication and re-use of public sector information with personal data. In the Latvijas Republikas Saeima caseFootnote 23 the CJEU held that the establishment of a legal obligation to make the register of penalty points imposed for road traffic offences public is contrary to the requirements of the GDPR. In Vyriausioji tarnybinės etikos komisija caseFootnote 24 CJEU ruled that while the online publication of data from declarations of private interests is legitimate in relation to the fight against corruption, it does not meet the requirements of the balancing test, in particular taking into account the general principles of data protection such as the principle of minimization. Finally, in Luxembourg Business RegistersFootnote 25 CJEU decided that the obligation to disclose information on the beneficial owners of companies within the meaning of Article 30 of the amended Directive 2015/849 is in breach of the Charter of Fundamental Rights of the EU, given that, although this regulation is an appropriate tool in the fight against corruption, it is not strictly necessary (there is a less invasive solution) and at the same time it does not offer sufficient guarantees for the effective protection of fundamental rights and therefore disproportionately interferes with them. These recent developments in the CJEU’s case law suggest relatively strict limits in European law on how Member States can regulate the obligation to disclose databases containing personal data, without which no further use of the data is possible.

In the absence of a legal mandate to publish personal data, the registry administrator cannot do so on a discretionary basis. In such a case, there is no choice but to anonymize the data. Here, however, it is necessary to draw attention to the necessity to carry out thorough anonymization because the concept of "personal data" is interpreted very broadlyFootnote 26 and there is a risk of deanonymisation [9].

In summary, to ensure legal interoperability, it is essential that the data provider provides clear terms of use for the published data in the dataset. Theoretically, the terms of use may contain the custom conditions that the data provider imposes on the recipient (and will therefore be a contract), but in practice, the possibilities of setting new restrictions are severely limited by the Open Data Directive. Therefore, we strongly advise against creating custom, binding terms of use, not least because their legal enforceability is highly problematic.Footnote 27

The main function of terms of use is to inform the data recipient of potential legal obstacles and their resolution. This makes it possible to ensure legal certainty for the data recipient, while at the same time fragmentation of the law in different jurisdictions ceases to be a problem. Therefore, if the published register is protected by intellectual property rights, it is necessary that this information is present in the terms of use, together with a licence agreement to allow further use of the content.Footnote 28 If personal data is present, it is appropriate to inform the recipient that by further use, they become the controller of the personal data with all the obligations that this entails [10]. If no legal obstacles are present, it is again appropriate to inform the recipient of the data of this fact, thus ensuring their legal certainty and avoiding a possible chilling effect that would negatively affect further use of the data [11].

In conclusion, to ensure legal interoperability, it is essential that the data provider knows what conflicting rights affect the provided data, resolves these obstacles where possible and does not create new ones. When providing the data, they then need to formulate the terms of use in such a way as to fully inform the recipient of the data and thus ensure their legal certainty in the re-use of the data. Although terms of use will generally not be legally enforceable, they are a key element for the effectiveness of the continued use of the data provided.

From a technical point of view, given the relative complexity of the national legislation governing publishing data, simply stating “this dataset is licensed as CC BY 4.0”, which is a common practise, is actually not always sufficient, as this statement does not cover all potentially applicable rights summarised above with necessary legal certainty. This insufficient practice is often caused by a divide between people knowledgeable in copyright law and people implementing technical standards such as DCAT-AP in their data catalogues. For instance, the Czech terms of use are expressed in the necessary detail covering the main 4 areas based on the relevant Czech legislation as the following data structure in its metadata record in RDF Turtle (translated to English):

figure h

The obvious downside of this legally proper approach is that it may be difficult for a foreign user of the data to properly understand the terms as they will not be experts in Czech legislation, even though this particular specification says that the data is free of any legal obstacles, but still not easily comparable to, e.g., a CC0 license, in the purely legal sense.

Datasets

To verify our approach, we used open datasets from business registries of several European countries, with the purpose of transforming them to the LD-STIR form, i.e. linked data according to the STIRData specification, thus making them part of the data architecture of Fig. 1. These source datasets were obtained from either the European or the respective country’s open data portal, or directly from the respective business registry. Each dataset should be available directly for bulk download or accessible through an API that allows obtaining its full contents, so that transformation of the entire dataset into LD-STIR form and its subsequent publication in an RDF store could be possible. The datasets that satisfied these conditions and are currently incorporated into the STIRData platform are for the business registers of Belgium, Cyprus, Czechia, Estonia, Finland, France, Greece (Athens area), Latvia, Moldova, Netherlands, Norway, Romania and the UK.

These datasets, all of which are in non-LD form, are considerably heterogeneous in format and content. Most are available as single or multiple CSV files, but there are also spreadsheets, XML and JSON files. The size of the datasets also varies considerably, depending on the country and the encoded information. For example, the French dataset contains information for about 23 M main legal entities and 33 M establishments, the UK dataset more than 5 M main legal entities, while the Moldovan dataset contains about 235K.

To prepare the datasets for transformation, the data fields and contents of each dataset were analysed and, eventually, each dataset was characterised along a set of dimensions relevant to the STIRData specification. These dimensions and brief descriptions are provided in Table 1. Following the analysis, to convert the source datasets from their original format to STIRData compliant linked data datasets, appropriate transformation workflows were developed and applied using two transformation tools: LinkedPipes ETLFootnote 29 and D2RML.Footnote 30 Both are powerful generic transformation tools capable of handling multiple data formats and performing complex data mappings to linked data representations.

Table 1 Dataset dimensions

The defined transformations, available in GitHubFootnote 31 had a varying level of complexity, depending on the structure of the original data representation. In this respect, we should note that, given the heterogeneity of the source data, not only with respect to format but also to structure due to the different underlying legislations, several modelling and transformation issues had to be addressed. Such an issue is, e.g., the structure of large companies. In most countries, a company appears as a single entry in the business registry datasets, where it has its identifier and a registered address, with no information about actual points of service, while in some countries, like in the French and Belgian business registry datasets, each company’s point of service has an individual entry in the business registry, possibly with a different identifier and set of economic activities. As mentioned in Sect. “STIRData: An RDF Model for Company Data”, such issues were taken into account when designing the STIRData data specification, to make it generic enough to cover different registration practices regarding the structure of companies.

An important part of the company data transformation process was also the mapping of relevant source data entry values to linked data resources of the appropriate vocabularies, as in the case of economic activity codes, that were mapped to the respective NACE national extensions resources, and the addresses from which the underlying administrative units, as NUTS and LAU resources, had to be inferred. Since NACE codes are closed code lists and were included in the data, their transformations were straightforward. For addresses, the mapping was achieved by exploiting company address postcode information in combination with data provided by GISCO.Footnote 32 To enhance data interoperability, the mappings also produced LEI code properties using GLEIF open data.Footnote 33 Finally, to reduce the size of the resulting datasets, when the original datasets contained long historical data of dissolved companies, as in the case of the French business registry, only the most recent data were kept (after 2000).

The resulting LD-STIR datasets were published and made available through different SPARQL endpoints, one for each country (using Virtuoso Open Source Edition as the underlying triple stores). Their size and information on which of five basic dimensions they include is provided in Table 2. The SPARQL endpoints are https://stirdata-semantic.ails.ece.ntua.gr/api/content/xy/sparql, where xy is be, cy, ee, fi, fr, el, lv, md, nl, no, ro, uk, for each country in the order it appears in Table 2. For Czechia, the endpoint is https://obchodní-rejstřík.stirdata.opendata.cz/vsparql. These endpoints are currently managed by the project partners. However, they are ready to be taken over by individual business registries, including the data transformations creating their content.

Table 2 Overview of published STIRData-compliant linked data datasets

As prescribed by the STIRData data specification, all published datasets include information about their provenance (the source dataset), licence, and date of last update. Given that the source datasets are updated periodically, the published STIRData-compliant datasets are also updated periodically after re-applying the transformations on the newer versions of the source datasets.

The STIRData Platform

To demonstrate in a concrete way the value of the proposed approach, we developed the STIRData platformFootnote 34 which, on top of the data architecture described in “Architecture”, and the published datasets described in “Datasets”, provides a user-friendly interface to explore in a uniform manner all business registry data.

The platform in principle adopts a fully decentralised architecture. It assumes that each dataset resides in a separate remote SPARQL endpoint. Apart from some basic information about each dataset, it also centrally stores copies of the shared NUTS, LAU and NACE vocabularies. In addition, to improve performance of the user-facing platform, centrally stored precomputed statistics data and indexes have been added as extensions to the basic platform architecture, making it less dependent on the performance characteristics of the source SPARQL endpoints. We discuss this addition in greater detail later in the paper.

The STIRData compliant business registry datasets offered by the platform are discovered automatically by scheduled tasks that periodically check for new datasets in the Official portal for European data, as well as for updates of already included datasets. Datasets not yet available in the Official portal for European data can be registered manually; in either case, the only required information is a link to the respective SPARQL endpoint.

For the end-user, the platform offers access to the underlying data through three main types of queries: retrieval queries, search queries, and statistics queries.

Retrieval Queries

Retrieval queries are the simplest queries that retrieve information about specific legal entities. A legal entity is identified by its STIRData IRI; based on this, the platform identifies the corresponding business registry and issues an appropriate query to the respective SPARQL endpoint to get the legal entity details.

Fig. 3
figure 3

Example company details page. The bottom levels of the relevant NUTS/LAU and NACE hierarchy are displayed, as well as Czech Trade Inspection Authority data

Apart from the details provided by the STIRData specification, the implementation of retrieval queries also allows for obtaining additional information about legal entities from other sources that have relevant data published as linked data. These sources can be added to the platform through a generic add-on mechanism. An example is the data of the Czech Trade Inspection AuthorityFootnote 35 that is used by the platform to show the inspections that Czech legal entities have undergone. An example of a retrieval result page, showcasing also that feature, is shown in Fig. 3.

Search Queries

Search queries retrieve lists of companies that satisfy conditions based on location, economic activity, and registration date. For example, a user may request all companies registered in the Oslo area in Norway and in the Prague area in Czechia after a certain date that perform one of a specific set of economic activities. The conditions regarding location and economic activities are expressed using NUTS, LAU and NACE Rev2 vocabulary concepts, respectively.

Fig. 4
figure 4

Example queries asking for all companies in the NO0 NUTS-1 region (Norge) performing a subactivity of the nace:46 class (Wholesale trade, except of motor vehicles and motorcycles)

Search queries are, in principle, federated queries, since they involve data residing in different SPARQL endpoints: those hosting the vocabularies and a different endpoint for the data of each involved country. Moreover, answering such queries involves SKOS hierarchy-based reasoning since, e.g., asking for companies in a certain region is actually asking for companies in any subregion thereof, and similarly for economic activities. These features pose significant challenges to query efficiency. One option is to directly use the SPARQL constructs by which such queries can be realised, i.e., federated SPARQL queries and path expressions. Figure 4a shows a direct formulation of an example SPARQL query using these two constructs. However, the evaluation of such complex queries on public endpoints may be problematic. Table 3 (first query column) shows the results of an experimental evaluation of the above query on three different triple stores. As we can see, one did not support that query, one failed to efficiently evaluate it, and only one was able to answer in an acceptable time.

To avoid such problems, our platform leverages domain knowledge and the closed form of such queries, by expanding and splitting each query to a set of simpler queries addressed to the appropriate endpoints so that they can be answered more efficiently. In particular, a query of the form of Fig. 4a is executed in three steps, corresponding to the three parts of the federated query. The first two steps, which we will call effective values computation, is to obtain the list of subactivities of the activity specified in the query by issuing a simple query to the respective endpoint, and do the same thing for the subregions of the region of interest; as the last step, a non-federated query is issued directly to the company data endpoint explicitly listing the effective activity and region values using the VALUES SPARQL construct. This query is shown in Fig. 4b and the results of its evaluation in the second query column of Table 3. The effective values computation is more complex in the case of multiple conditions.

Table 3 Evaluation of the queries of Fig. 4 on three triple stores

As a further example of data interoperability, in addition to the above functionality, the implementation of search (and retrieval) queries by the STIRData platform also allows one to use conditions based on NUTS/LAU territorial topologiesFootnote 36 as well as relevant Eurostat statistics.Footnote 37 Therefore, in addition to the conditions described above, a user can ask, for example, for companies located in the more urbanised areas or coastal areas of a country, or in areas with high availability of tourist accommodation or high unemployment rates.

To be fully integrated with the platform, the implementation of this feature relies on the availability of such statistics as linked data using the RDF Data Cube Vocabulary [12], a vocabulary to publish multi-dimensional data, such as statistics. Because Eurostat datasets are not currently available in this format (they are available as TSV data only), transformation and publication of selected Eurostat statistics has been done in a similar way to the business registry datasets using D2RML transformations. Given that the source data files of all such datasets follow a common format, the mapping that transforms them into the RDF Data Cube Vocabulary is a generic mapping, taking as parameter only the code of the dataset, making very easy to add to the platform any additional dataset. Indicative Eurostat datasets that have been mapped to linked data form and are available as search criteria in the platform are “Households with access to the internet at home” (isoc\(\_\)r\(\_\)iacc\(\_\)h), “Number of establishments and bed-places by NUTS 2 regions” (tgs00112), “Gross domestic product (GDP) at current market prices by NUTS 3 regions” (nama\(\_\)10r\(\_\)3gdp), “Unemployment by sex, age, educational attainment level and NUTS 2 regions” (lfst\(\_\)r\(\_\)lfu3pers), and “Population density by NUTS 3 region” (demo\(\_\)r\(\_\)d3dens).

At query time, constraints expressed using Eurostat statistics are translated by issuing an appropriate SPARQL query to the platform’s endpoint holding the statistics to an effective value list of the regions satisfying the constraints, and that list is then used as described above in the VALUES construct of the final SPARQL query to the registries’ endpoints holding the actual business registry data. To resolve a selected statistic to particular regions, the user must supply any additional needed values, e.g. in the case of the “Unemployment by sex, age, educational attainment level and NUTS 2 regions” statistic, the user must specify an age class, ab educational level, a sex, and also provide a value range for the number of unemployed persons. All statistics require the user to provide a value range, and, if the statistic is multidimensional, select specific values for the several dimensions; these values are categorical values and the user makes the selection through options lists. It should be noted that because, as already mentioned, the STIRData model keeps business location information at the NUTS-3 and LAU levels, when a query involves a higher granularity Eurostat statistics (e.g. available only at the NUTS-2 level), the effective location values list is built by assuming that all subregions of a larger region inherit the statistical properties of the larger regions.

A sample search query page, which also demonstrates the use of the Eurostat statistics feature, is shown in Fig. 5.

Fig. 5
figure 5

Example search page, requesting Czech and Norwegian wholesale companies founded after 2020, located in intermediate urbanity level areas

Analytical Queries

Statistics queries are similar to search queries, but instead of lists of companies, they return aggregated statistical information, namely the number of companies satisfying the desired criteria, along with an analysis of the distribution of companies in the subregions and subactivities specified in the query. Statistics queries have been implemented similarly to search queries.

Because statistics queries provide useful, compact overviews of the underlying data, an important feature of the platform is that it allows users to browse through the location and/or the economic activity hierarchies, displaying the corresponding statistical information. However, because such browsing again requires the execution of multiple complex SPARQL queries against public triple stores containing potentially millions of RDF triples, their real-time computation would result in poor aggregate response times.

For this reason, the STIRData platform adopts the approach of pre-computing offline several of those statistics (for the location, economic activity, and registration date dimensions, and pairs thereof) and caches them in a database, so that they can be served instantly. The statistics precomputation process is activated each time a new business registry dataset is registered or already published data are updated. Pre-computation of the statistics for a country can take from a few minutes to several days, depending on the size and dimensions of the dataset. The results are stored in a PostgreSQL database. A sample page showing administrative units and economic activity statistics for Belgium is shown in Fig. 6.

Fig. 6
figure 6

Interface for browsing country and economic activity statistics

Platform Performance Evaluation

The platform described in Sect. The STIRData Platform is fully operational and allows users to explore multiple business registries, provides interoperability with other datasets (e.g. Eurostat data), and is periodically updated with new versions of the business registry datasets. In this section, we discuss the performance optimizations needed to be implemented to overcome the inherent performance issues of the public SPARQL endpoint-based decentralized architecture.

As mentioned above, our architecture relies on possibly third-party managed SPARQL endpoints to get access to company data. This achieves the desired decentralised data interoperability, and the SPARQL language is expressive enough to meet complex data querying needs. However, as the complexity of the queries posed by a data consumer increases, and also depending on the size of the underlying dataset and the computational resources available to an endpoint, performance problems may arise that can result in poor response times.

Given the decentralised nature of our approach, where ideally each business registry provides its data through an own-managed infrastructure, much in this respect depends on the software and computational resources that each business registry disposes and the continuous availability of the endpoints, which is beyond the control of the platform. We experienced such problems, which were overcome by increasing the resources (allocated memory) available to the SPARQL endpoints and by rewriting queries using more efficient execution plans, as discussed in Sect. The STIRData Platform. As explained there, we also experimented with several software platforms that implement SPARQL endpoints, and their performance varied considerably, with some types of queries answered more efficiently by some of them than by others.

It is important to note that much of the complexity of the queries that leads to the above problems arises from the fact that, as discussed in Sect. 3, the STIRData specification requires keeping for each company only the lowest-level information about the administrative unit and economic activity hierarchies it belongs to. Although this is a sound data modelling assumption that reduces redundancy and is compliant with the linked data principles, it leads to the need of inferring at query time the higher levels of the hierarchies a company belongs to, which may result in computationally demanding queries. In Sect. The STIRData Platform we showed how we alleviated the problem by performing the effective values computation, without compromising the fully decentralised approach and insisting on an on-the-fly execution of SPARQL queries. We also discussed how we pre-computed certain statistics, whose on-the-fly computation is problematic.

However, because not all statistics (e.g. for multiple conditions) can be pre-computed, and complex queries cannot be avoided without severely limiting the data exploration capabilities of the platform, another architectural compromise had to be implemented. As an alternative, a periodically updated, platform-managed materialised cache of the published data was introduced into the platform. It contains all the required inferred and pre-computed information, so that effective values computation on SKOS vocabulary hierarchies is not needed at query time. The cache has been implemented as an index, to allow for even faster query times.

Table 4 Response time for several statistics queries, using direct SPARQL evaluation, pre-computed statistics, and materialized indexed data

In particular, the index consists of several ElasticSearch indices, one for each business registry, which store for each company the fields on which search is expected, namely administrative units, economic activities, and registration dates, including materialised (i.e. SKOS hierarchy-based inferred) values. Some indicative results are shown in Table 4, which compares the response time for several statistics queries, using direct SPARQL evaluation, pre-computed statistics, and the index. The size of the datasets on which the queries were executed is then shown in Table 2. As expected, pre-computed data, when available, are served faster, with performance comparable to the index, which is significantly faster than direct SPARQL query evaluation. Direct SPARQL query response times depend clearly on the size of the underlying data and the complexity of the queries (i.e., on the effective values computation time). Initially introduced to speed up the evaluation of statistics queries, due to the big performance gain it delivers, the index has also been leveraged for answering standard search queries as an alternative to direct SPARQL query evaluation.

It is important to note that the statistic queries return both the number of companies satisfying the specified conditions, as well as a distribution thereof in the relevant subregions and subactivities; this means that not a single but multiple queries have to be executed in each case, which explains the relatively long response times. It is also interesting to note that for the query involving Eurostat conditions, the index also appears to be relatively slow. This can be explained by the fact that in this case the first part of the query evaluation (the effective values computation) cannot be delegated fully to the index (which contains only SKOS hierarchy-based inferred values), since it requires issuing SPARQL queries to the endpoint holding the Eurostat statistics in order to get the effective values for the desired conditions to be used in the actual query to the index. This shows that in more general queries involving the combination of data from different sources, the index cannot always guarantee immediate response times. In conclusion, as expected, the performance improvement using an index is significant, although queries relying (even partially) on SPARQL query evaluation may still suffer slower response times.

Compliant Data Consumption

The STIRData platform described in Sect. The STIRData Platform and evaluated in Sect. Platform Performance Evaluation shows how an end-user-facing application can be developed on top of compliant and decentralised business registry datasets. In this section, we show examples of how the decentralised business registry datasets can be processed and combined with other linked data sources, by a data expert, e.g. for further analyses. We will demonstrate the process using LinkedPipes ETL [5], the same tool that is also used to transform and publish some of the business registry datasets.

Fig. 7
figure 7

LinkedPipes ETL pipeline gathering statistics from business registry datasets and linking it with Wikidata

In Fig. 7, the pipeline computing number of business entities in NUTS-3 regions and their population is shown. First, the pipeline discovers all compliant STIRData datasets in a chosen data catalogue such as data.europa.eu including their SPARQL endpoints, as described in Sect. 3.3. Next, using those endpoints and the STIRData specification described in Sect. 3, it computes the number of business entities in NUTS-3 regions. Finally, it connects those statistics with data from another linked data source, Wikidata,Footnote 38 also available through a SPARQL endpoint, and extracts the population in the NUTS-3 regions from there. The resulting dataset, usable for further analysis, uses the RDF Data Cube Vocabulary [12]. A sample of that dataset is presented here:

figure i

Additional data consumption examples, including a dataset profiling pipeline, can be seen in GitHub.Footnote 39

Related Work

STIRData is, of course, not the first project dealing with the integration of data on companies from various data sources. OpenCorporatesFootnote 40 makes business out of the integration and cleaning of company data, and the euBusinessGraphFootnote 41 project built a marketplace for such datasets. However, both of these approaches have one thing in common, which is that they keep the source data as it is, i.e., noninteroperable for others, and they build value for their project by ingesting and cleaning the data for profit. In contrast, STIRData aims at improving the datasets at their source, making the datasets interoperable for everyone, and showing how such interoperable datasets can be reused.

There is also the Business Registers Interconnection System (BRIS),Footnote 42 which allows human users to manually search for companies in the integrated business registers using a web page. However, this is all that BRIS offers. It is a specialised information system connecting directly to the individual business registries, in a completely closed manner, and has nothing to do with open data or the Common European Data Spaces.

Conclusions

In this paper, an extended version of [1], we elaborate on the results of STIRData, a project that implements a linked data-based approach to the publication of open data from European business registries in an interoperable fashion. Our interoperability approach addresses various aspects of data interoperability, including the technical, semantic, and legal dimensions, as well as demonstration of reusability of the compliant data. The semantic interoperability approach is based on the European Core Vocabularies and their profiling, the technical interoperability approach is based on the linked data technologies such as RDF [2] and SPARQL [3]. Moreover, we suggest a legal interoperability framework for open data in general and emphasize the non-ideal situation of today’s data publishers and consumers as to the legal certainty when publishing and using open data.

Compared to other company data integration approaches, the main difference of STIRData is that we make the data interoperable on the publisher’s side, i.e. for everyone, not centrally, and not for profit. Next, we demonstrate a way to build applications on top of interoperable data, including ways of overcoming performance difficulties inherent to the linked data-based architecture, by presenting the STIRData platform for data browsing and analysis. Moreover, we showcase how data published using our approach can be re-used by data experts for further analysis. In addition, we see the need for a similar approach also in the Common European Data Spaces, which are currently being established and which go beyond the scope of open data. As part of our future work, we will therefore investigate the possibilities of application of alternative linked data interfaces such as the Linked Data Fragments [13] and the Linked Data Event Streams [14] to see whether they can help with the issue.