A sustainable process and toolbox for geographical linked data generation and publication: a case study with BTN100
We describe the process and tools that we have used to generate and publish the BTN100 Linked Dataset, based on the original data from the Spanish Topographic Base (1:100.000 scale) from the Spanish Instituto Geográfico Nacional. We have taken into account the limitations and lessons learned from our initial experience on the generation and publication of Linked Data from a range of geographical sources in Spain, in 2010, and we have now refined the process in order to facilitate: declarative mappings for the transformations from existing open data (shapefiles), automation of transformations whenever there are changes in the original data sources, version control, and alignment with INSPIRE URIs. As a result of this transformation and publication process we have also updated the reference ontology for geographical features and aligned with general ontologies such as GeoSPARQL.
KeywordsGeospatial data Linked data Linked dataset Ontology
One of the activities of the Spanish Instituto Geográfico Nacional1 (IGN) is to produce geographical information for all the territorial entities in Spain. IGN is responsible for maintaining and making accessible cartographic and topographic databases for the representation of the Spanish territory. Their catalogs publish data related to transport networks, geodetic information, administrative units, etc. making it possible for everyone to download them from their data portal2 under an open data license compatible with CC By 4.03.
Governments, via their many agencies and organizations, are constantly producing data that may be highly interrelated, but in practice become isolated data due to lack of interoperability. Cartographic and topographic information from IGN may easily enrich information from other government entities data, e.g. data from the National Institute of Statistics, Institute of Cultural Heritage, General Direction of Cadastre, Geological and Mining Institute, etc.
However, the generalized lack of use of semantics standards in the descriptions of the data elements within the data sources make it difficult to reuse them. Although progress in data availability, there are still plenty issues related to semantic interoperability; this is the ability of information systems to exchange data with unambiguous, shared meaning.
There are several initiatives around the world that have focused on generating and publishing Linked Data from a range of geospatial data sources, and a W3C/OCG Working Group was running between 2015 and 2017 with the title “Spatial Data on the Web” producing recommendations on how to publish different types of geospatial, sensor, and temporal data on the web in a principled manner. The LinkedGeoData initiative4 aimed to make available the information collected by Open Street Map5 as RDF and interlinks this data with other knowledges bases. Ordnance Survey Linked Data6 publishes a number of products from the Great Britain’s national mapping agency as Linked Data and provides access to them through a SPARQL endpoint. Swiss Linked Data Service7 publishes the geospatial datasets from the Swiss Federal Spatial Data Infrastructure via a Link Data Frontend, which provides a search, querying and visualization interface.
In Spain, there was also some pioneering work on producing geospatial Linked Data from a range of data sources (many of them from IGN), as described in [1, 2]. Additionally, an ontology of administrative units has been created and published at http://vocab.linkeddata.es/datosabiertos/def/sector-publico/territorio and some regions have produced Linked Data about their administrative units, such as Aragón . In 2010, we worked on the GeoLinked Data initiative in order to enrich the web of data with Spanish Linked Data. We used as input several relational data bases and Excel spreadsheets about administrative units, hydrography and statistical domains. Then we modeled an ontology to represent this data. Later, we generated the RDF with the Geomety2RDF plugin8 to deal with the geometrical transformations. The generated RDF was compliant with the WSG84 vocabulary9 and the GML ontology10. We added some links between terms from each data source, published the resulting RDF in a triplestore and made it available via a visualization tool (Map4RDF). Nevertheless, our process had some limitations. Geometries were not making use of GeoSPARQL, since it was an emerging standard with little tool support. Our transformations were originated from special access to Oracle Spatial databases, instead of already published open data. No automation was included to deal with the evolution of the data sources, what made the Linked Data state quickly. Manual intervention was needed for this update process.
In this paper we describe the transformation process of Spanish Topographic Base in scale 1:100.000 (BTN100) catalog into Linked Data. We used Github11 for version control and as archival to store all the RDF transformations. The process includes defining the semantic model for this geospatial data, generating the data transformations, publishing them as a SPARQL endpoint, and maintaining the Linked Dataset.
Our work represents a forward step to improve semantic interoperability in the geospatial domain. We make use of the open BTN100 dataset and define semantics for its data. Through the semantic model we represents complex geometrical shapes, e.g. multi-lines, polygons, multi-polygons, etc. This makes it possible to visualize these geometric entities and to infer assertions such as those related to any geographical entity being embedded within another entity. In addition, we provide an automatic way to deal with changes in the data source in order to provide an always up-to-date Linked Dataset.
The paper is organized as follows: “Methodology and results” section describes all the followed steps for generate and maintain the Linked Dataset. “Conclusions and future directions” section shows the conclusions and future work.
Methodology and results
Generating Linked Data is a process that involves several activities and decisions in order to obtain a high-quality Linked Dataset. We followed the Methodological Guidelines for Publishing Government Linked Data . These guidelines cover all the steps and details that are necessary for the activities involved. The activities described by the guidelines are: specification, modelling, generation, publication and exploitation. Each activity involves one or more tasks and some techniques for carrying out them.
The first task of this activity is focused on the identification of the data sources, formats, information within the datasets and general requirements for the resulting Linked Dataset. In our case, we used the open BTN100 catalog as our data source. This data source is available from the Spanish National Center for Geographic Information12 (CNIG) as shapefiles in the ETRS89 and REGCAN95 Coordinate Reference System (CRS). The BTN100 catalog contains geographic information about topographic and thematic data; it was designed following the INSPIRE Directives13. It clusters the data in the following themes: administrative units, protected zones, buildings and population entities, transport networks, energy and conduction, geodetic vertices, altimetry and hydrography.
Defining the URIs
The final task of this activity is the definition of the license of the Linked Dataset. We decided to reuse the IGN license14 for the BTN100 Linked Dataset. It is a Creative Common Attribution 4.0 International (CC BY 4.0) license.
In order to represent all themes of the dataset, we generated an ontology, which replaces the former http://geo.linkeddata.es ontology. The ontology development was made by following the LOT Methodology described on-line15 (originally proposed in ) and used for example in . Reusing ontologies was important through our development process. We focused our analysis on the common spatial vocabularies recommended by the W3C Working Group Note . We decided to reuse the GeoSPARQL vocabulary16 to represent geospatial data, since it makes it possible to use specialized functions for geometries.
The GeoSPARQL vocabulary does not allow representing elements such as identifiers for resources, labels for geographical objects, altitude, etc. In order to address these shortcomings, we developed the btn100 ontology17. This model represents all the geographical objects from our dataset. The btn100 has links to SKOS thesauri that were developed to represent some categories of elements in our dataset; for instance, type of highway access, type of roadway, etc. These thesauri will be linked in the future to these maintained in the INSPIRE registry. All files generated during the ontology development, including the requirements, ontologies, thesauri and documentation are available into a Github repository at https://github.com/oeg-upm/ontology-BTN100.
The model depicted in Fig. 2b presents the classes defined in order to represent all concepts from the administrative units and protected zones themes from those mentioned before in Specification subsection. The btn100 documentation in HTML format, including further details of the representation of the other themes and their diagrams, is available at https://datos.ign.es/def/btn100.
In order to deal with the transformation tasks we used GeoKettle21. With this tool we created a transformation file for each shapefile and configured a workflow to perform the activities described as follows. First, we cleaned the data, e.g. correcting malformed/incompatible datatypes. Then, we mapped the data to their corresponding equivalents in the SKOS concepts. After, we converted ETRS89 and REGCAN95 into WGS84 CRS in order to represent data in the GeoSPARQL standard. Last, we transformed the data, via TripleGeo plugin22, into triples according to the model defined in the btn100 ontology.
Finally, for the linking task we used the owl:sameAs relationship to align our resources with DBpedia26 and other resources from the Spanish government open data portal27. All files generated during the linking task are also available at the Github repository.
This step aims to provide access to the resulting dataset. We stored the RDF files into a Virtuoso triplestore28. Virtuoso provides a SPARQL endpoint, available at https://datos.ign.es/sparql. Some use cases with their SPARQL queries are available at https://datos.ign.es/casos-de-uso.html.
We also provided a web interface to the SPARQL endpoint via Pubby29. A web portal about this work is available at https://datos.ign.es; it delivers a single entry point to all the resources (e.g queries, ontologies, skos, etc.).
Despite a maintenance activity is not included in the followed guidelines, we considered it is important to ensure the dataset will always have the most current version of the data source. In order to automatize the generation and updating of the Linked Dataset we developed a Pyhton script33.
The script, which will be periodically executed, starts by downloading the BTN100 data source, then it makes a testing process between the downloaded source and the previous one. If a change is detected, the script identifies the elements that need to be updated. Then it generates, via GeoKettle, the new RDF files and publishes the updates in the SPARQL endpoint. Finally the script also updates the thesauri and sameAs files in the triplestore.
As we mentioned, Github is used as our environment to deal with file versioning and storing. However, Github does not allow to push files larger than 100 MB. For this reason, the script makes another test in order to check if some of the updated data sources has more than 90 MB. If there is a file with this condition, the script breaks it down into files up to 90 MB and then uploads the resulting files into Github.
Conclusions and future directions
In this paper we have described our updated approach for the previous Spanish GeoLinked Data work, specifically for the representation of the BTN100 catalog. We have presented the process to generate and publish the BTN100 as Linked Data. The dataset has been generated by using the btn100 ontology, which reuses GeoSPARQL vocabulary. This ontology provides complex geospatial representations and makes data more interoperable with other similar datasets. Our dataset has been tested against competency questions posed by domain experts; it is modular and therefore easily extensible.
In this work we have entirely supported the process by Github, in order to provide a collaborative, distributed and version development tool. Our work also provided an automatic script in order to perform the whole process, from data extraction to publication. This script allows updating the dataset whenever a change is detected in the data source.
Our model represents all the BTN100 themes; however, we are only generating linked data for territorial units. We are using Virtuoso as the technology behind our endpoint; this is due to our previous experience with this technology. However, Virtuoso does not fully support GeoSPARQL functions; it would be important to complement our work with another triplestore in this domain. Our work addresses a specific need from the IGN, it is available in Spanish. However, the approach is applicable to other scenarios in this domain and not possibly in other languages.
We acknowledge the work done by developers who have contributed parts of the software used: Lissete Moscoso, Francisco Siles, Victor Saquicela and Luis Vilches.
This work has been funded by Centro Nacional de Información Geográfica and DATOS 4.0: RETOS Y SOLUCIONES - UPM Spanish national project (TIN2016-78011-C4-4-R).
Availability of data and materials
All files generated during this work are available in our GitHub repositories. Ontologies and SKOS thesuauri files are available at https://github.com/oeg-upm/ontology-BTN100. Original and transformed files, plugin and scripts are available at https://github.com/oeg-upm/btn100.
All authors participated fully in this work from inception to completion. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.de León A, Saquicela V, Vilches LM, Villazón-Terrazas B, Priyatna F, Corcho O. Geographical linked data: a spanish use case. In: Proceedings of the In I-SEMANTICS ’10 6th International Conference on Semantic Systems. New York: ACM: 2010.Google Scholar
- 2.Atemezing G, Corcho O, Garijo D, Mora J, Poveda-Villalón M, Rozas P, Vila-Suero D, Villazón-Terrazas B. Transforming meteorological data into linked data. Semant Web. 2013; 4(3):285–90.Google Scholar
- 3.Corcho O, Pérez IS, Lafuente H, Portolés D, Cano C, Peris A, Subero JM. Publishing linked statistical data: Aragon, a case study. In: Joint Proceedings of the International Workshops on Hybrid Statistical Semantic Understanding and Emerging Semantics, and Semantic Statistics (HybridSemStats). Aachen: 2017.Google Scholar
- 4.de León A, Wisniewki F, Villazón-Terrazas B, Corcho O. Map4rdf - faceted browser for geospatial datasets. In: Using Open Data: policy modeling, citizen empowerment, data journalism.2012.Google Scholar
- 5.Villazón-Terrazas B, Vilches-Blázquez LM, Corcho O, Gómez-Pérez A. Methodological guidelines for publishing government linked data. In: Linking Government Data. New York: Springer: 2011. p. 27–49.Google Scholar
- 6.Spatial Data on the Web Best Practices. 2017. https://www.w3.org/TR/sdw-bp/.
- 7.Technical Interoperability Standard for the Reuse of Information Resources. 2013. https://administracionelectronica.gob.es/pae_Home/dam/jcr:a8d2c143-ce9a-4fc7-afe7-ef5d9ba7c4a1/ENGLISH_Interoperability_Agreement_for%20the%20Reuse%20%20of%20Information%20Resources.pdf.
- 8.Poveda-Villalón M. A reuse-based lightweight method for developing linked data ontologies and vocabularies. In: 9th Extended Semantic Web Conference (ESWC). Berlin: Springer: 2012. p. 833–7.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.