Keywords

1 Introduction

The Location Index (LocI) is the spatial one of three data ‘spines’ (the others being business and people) the Australian government is creating to better enable reliable machine linking, integration and processing of government data with the goal of improving policy advice. LocI will enable government agencies to better geospatially integrate and analyze data across government portfolios and information domains.

Almost all government data contain some location information, because ‘everything happens somewhere’. Data on service delivery to citizens, business productivity, population demographics, transport infrastructure, grants programs, and weather & climate all contain information about location. Similarly, location information is almost always contained in new and emerging ‘big data’ streams such as sensor technologies embedded in just about everything around us. Information from these sources will become increasingly important as we transform towards digital government services.

Joining data for analysis requires the ability for objects represented in multiple data sets to be identified uniquely and unambiguously. For example, data for Australian businesses is collected and identified using a unique Australian Business Number (ABN). This well-governed identifier is used across multiple data sets (Fig. 1).

Fig. 1.
figure 1

A project brochure image of LocI the Location Spine’s position with respect to Australian government Environment, Society and Economy data

Spatial features (features that exist in geographic space; e.g. local government areas, properties, rivers) can often have multiple identifiers. For example, a Local Government Area may have one identifier in the Australian Bureau of Statistic’s Australian Statistical Geographic Standard, another in Commonwealth administrative data and yet another in State data sets. As very few spatial features have well governed identifiers, it is not possible to use them to join data sets together with certainly. This lack of shared unique identifiers means that every user must address integration challenges every time data are joined. These processes are often manual and unrepeatable, and do not add value to the data assets.

So far, it distributes three major Australian spatial datasets as Linked Data: the Geocoded National Address File (GNAF), the Australian Hydrological Geospatial Fabric (Geofabric) and the Australian Statistical Geography Standard (ASGS). These distributions draw from authoritative non-Semantic Web datasets and are converted to Linked Data using ontologies based on their particulars and also generic upper ontologies for multi-dataset consistency. All objects modelled by the various ontologies – from whole datasets to individual dataset objects and also the ontologies themselves – are assigned persistent HTTP URIs managed as part of Australian government interoperability efforts.

The project has also delivered 7 dataset crosswalks – Linksets – which join spatial objects with spatial relations. LocI implements stand-alone Linksets to ensure crosswalks are able to be independently governed.

Several Linked Data-specific clients are currently deployed or under development: an identifier downloader, a query builder and data processor. These are being built in response to challenges around the use of spatial data as a mechanism for integrating other data. These capabilities will offer LocI end users new ways of working. Also under development are RDF validation and inference tools.

This project and its products are an evolution of earlier work done to deliver Linked Data identifiers for spatial data in Australia known as the Spatial Identifiers Reference Framework (SIRF) [5]. For LocI, Linked Data was again chosen as a technical approach due to its perceived ability to deliver open data in a consistent and application-independent way across the Internet. Semantic Web modelling was chosen due to its multiple available, interoperable and modern models.

LocI is also investing heavily in the design of inter-departmental ’social architecture’ (enabling social and institutional mechanism) to enable the continued generation of Linked Data and on-going infrastructure maintenance.

Section 2 follows detailing LocI project requirements, then Sect. 3 addressing governance issues, Sect. 4 describing the LocI systems and Sect. 5 mentioning some project tools developed. Finally a short concluding note is given in Sect. 6.

2 Requirements

2.1 Initial Datasets

LocI is required to deliver an interoperable set of foundational Australian spatial data products initially consisting of three major datasets created by different agencies:

  • Australian Statistical Geography Standard (ASGS)

    • published by the Australian Bureau of Statistics (ABS)Footnote 1

    • contains a hierarchy of approximately 500,000 geographies with which to aggregate census statistics

    • available via a seriesFootnote 2 of public Web Feature Service (WFS)Footnote 3 services

  • Australian Hydrological Geospatial Fabric (Geofabric)

    • owned by the Australian Bureau of Meteorology (BoM)Footnote 4

    • contains 35,000 surface hydrology features (lakes, rivers, catchments) data extracted from survey maps and remotely sensed digital elevation models. Of prime importance are the hierarchy of hydrological catchments which are monitored by river and stream gauges

    • available via a single, public, WFSFootnote 5

  • Geocoded National Address File (GNAF)

    • published by PSMA Australia Ltd. (PSMA)Footnote 6

    • contains at least one entry for every one of the approx. 14.5 million street address in Australia as well as address aliases and relations (units within lots; addresses within localities)

    • available as a down-loadable databaseFootnote 7

A fourth dataset, the Australian Place Names Gazetteer (PNGaz)Footnote 8 is an additional dataset that is partially deployed within the LocI project.

2.2 Future Datasets

The project has identified many more datasets to be added in future phases of LocI:

  • national or state cadastral datasets

  • irrigation areas and other socio-environmental administrative areas

  • electoral boundaries and other social geographies

  • electrical energy use data

Preparation for these datasets has been commenced but, as of the date of this publication, none have yet been incorporated into the published set.

2.3 Data Governance

Establishing appropriate institutional arrangement to sustain LocI has been identified as a priority activity. An analysis of existing spatial data supply chains and current governance arrangements for spatial and Linked Data domains is being undertaken. This will be used to inform the co-design of future state of trusted spatial Linked Data supply chains with cross-departmental identifier governance arrangements. This is to ensure that LocI-published datasets (together with their constituent spatial features) are seen to carry the authority, and therefore trust, of the original non-Linked Data datasets. This will assist in the adoption process as authority, persistence and trust are necessary precursors for community adoption. To this end, while technical delivery of the three core datasets as Linked Data products was mostly conducted by Australia’s research agency, CSIRO, the approval to publish the products was gained from the original owner agencies. Section 3 details some of the institutional arrangements in place to assist with LocI system management and data publication.

The requirement to implement trust-ensuring measures stems previous projects (SIRF, [5]) lessons where Linked Data versions of products where not operationalised as they lacked user perceived legitimacy/authority.

2.4 Data Delivery

In addition to delivering spatial data as Linked Data online, LocI also delivers spatial data object identifiers for offline use within secure Australian government data analysis systems, such as the Multi-Agency Data Integration Project (MADIP)Footnote 9 and Business Longitudinal Analysis Data Environment (BLADE)Footnote 10.

3 Identifier Governance

LocI resources are designed to be identified both authoritatively and persistently by HTTP URIs. All LocI data publishers have previously joined the Australian Government Linked Data Working Group (LDWG)Footnote 11 which is a technical advice group made of members from both Australian Commonwealth and State governments. The LDWG has a semi-formal mandate to advise on Linked Data matters and manages the use of the domain linked.data.gov.au for Linked Data URIs. The persistence of this domain, which is agency-neutral, is currently protected by a Memorandum of UnderstandingFootnote 12 between Linked Data-publishing agencies and the technical owner of the domain, the Australian Digital Transformation AgencyFootnote 13. Continued persistence of this domain is critical for LocI’s long-term stability and more formal governance arrangements are being explored through the social architecture component of LocI.

The LDWG formalises how and what URIs may be requested with GuidelinesFootnote 14 based on academic and government URI management work [6, 13] and adjusted over 5 years of operation in Australia to their current form. The current guidelines require multi-agency URI proposal submission, review and acceptance, based on the publication standard ISO11179 [9]. For the LocI project, URIs have been created for datasets, linksets and ontologies using dataset, linkset & def namespace path segments, for example, http://linked.data.gov.au/dataset/geofabric for the Geofabric dataset (see below).

4 Architecture

LocI has developed component architecture as shown in Fig. 2. Few central systems are implemented compared with predecessors such as SIRF [5] and, instead, datasets are distributed and published independently while they may also be cached for particular client use. This is to ensure LocI datasets are able to be used individually or in any combination and not just as originally demonstrated in this first LocI project phase.

Fig. 2.
figure 2

Major LocI architectural components

4.1 Ontologies

The over-arching “LocI Ontology”Footnote 15, shown in Fig. 2, is used to structure LocI data for governance purposes and specializes multiple well-known ontologies to convey familiar data patterns: DCAT [7], VoID [2] and GeoSPARQL [11]. All LocI datasets must be published as DCAT’s dcat:Distribution objects related to conceptual dcat:Dataset objects representing the original (likely non-RDF) datasets. Spatial elements within all LocI datasets must be geo:Feature objects published in reg:Registers and are expected, but not required, to relate to geo:Geometry objects. Instances of void:Linkset – a specialised void:Dataset for the “description of RDF links between [other] datasets” [2] – are published that join LocI datasets and, when published, must contain provenance information at the link level, which was not originally supported by VoID. The LocI ontology overview diagram is shown in Fig. 3.

Each LocI dataset is published according to at least one specialised Web Ontology Language (OWL) [14] ontology to convey the the specifics of the data it contains, e.g. the Geofabric models its contents using the HY_Features ontology. The three published LocI datasets to date, and the partly published PNGaz, have their specialized ontologies indicated in Table 1.

Table 1. LocI datasets and the ontologies used to publish them

In addition to the LocI Ontology for governance, a “GeoSPARQL Extensions Ontology”Footnote 16 has also been created to describe spatial relations not well handled in existing ontologiesFootnote 17. So far properties for transitive spatial overlaps, spatial within/contains inverse and spatial resolution have been modelled with more likely to be added.

4.2 Dataset Publication

LocI datasets have been modelled as dcat:Distribution instances of their underlying dcat:Dataset objects delivered in both Linked Data and SPARQL service [8] forms. These forms are seen as information-equivalent to the non-Linked Data versions, even though significant gains are to be had from their substantially altered formats and availability.

Fig. 3.
figure 3

Major LocI ontology classes and their relationships

Dataset publication as Linked Data here means their distributions are not only available in RDF but also HTML with web pages for the overall dataset objects, each major register of objects within datasets (as a Registry OntologyFootnote 18 reg:Register objects) and each object within datasets. The Geofabric dataset’s River Region 9400216 object is shown in Fig. 4. The various dataset, register and object representations are available via URIs derived from the namespace URI of the dataset and formats can be requested in different formats as per HTTP Content Negotiation as well as via specialised Query String Arguments. Examples of the URIs for the GNAF are given in Table 2.

Fig. 4.
figure 4

A River Region object’s HTML landing page from the Geofabric dataset

Linked Data URIs for spatial objects can also be used offline and also other online, non-Linked Data systems, such as OGC Web Feature Services (WFS)Footnote 19. Spatial object’s information can be accessed via their URIs and reprocessed to be delivered according to the WFS standard allowing for more traditional Geographic Information Systems use.

Objects in datasets may be published according to multiple ontologies, for example, the GNAF’s gnaf:Address instances, are published according to both the GNAF Ontology, ISO16160 (see above) and also Schema.orgFootnote 20. RDF (and sometimes HTML) representations conforming to these ontologies are available independently by requesting a particular view of a resource, selected from a list of available views. This allows for ways of obtaining LocI objects conforming to particular standards and aligns with new W3C work to standardize Content Negotiation by Profile [12]. Views and URIs for a gnaf:Address object are as shown in Table 3.

As required at LocI project establishment, URIs acting as identifiers for all LocI dataset’s elements are available for download for use in offline environments. To facilitate this, client software for the dataset’s publication as Linked Data has been writtenFootnote 21. This allows for the extraction of all element’s URIs via the registers containing them and for their delivery in either Comma Separated Values or Microsoft Excel formats. All LocI datasets may be downloaded in this way at any time with this or any other client software, however a static copy of the dataset’s register’s first 1,000 items is also available for demonstration purposes due to the large sizes of full downloads: https://github.com/CSIRO-enviro-informatics/loci-dataset-download.

4.3 Linkset Publication

LocI publishes multiple links between spatial datasets. Since these links sometimes take effort to generate (typically by geo-processing) and the results are often not shared, publication of them will reduce analyst task duplication. As of the date of this paper, seven Linksets have been published, see https://github.com/CSIRO-enviro-informatics?q=linkset. They have been produced using various methods including spatial intersections using GIS software, database processing, SPARQL CONSTRUCT queries and data conversion from offline data. Unlike LocI datasets, Linksets are not yet published as Linked Data, only as RDF resources which can be loaded for use within RDF databases.

Table 2. GNAF dataset Linked Data URIs
Table 3. views for a GNAF gnaf:Address object

The Addresses/Catchments LinksetFootnote 22 links gnaf:Address objects in the GNAF to geofabric:ContractedCatchment objects in the Geofabric and was produced using spatial intersections. This Linkset contains as many individuals (links) as the total number of gnaf:Address objects (14.5M) and in RDF Turtle form is a file of approximately 1 GB in size. It is also delivered as size-reduced Comma Separated Values file (0.5 MB).

In RDF form, this Linkset’ structure extends on the basic VoID Linkset structure by using reification to associate further facts with each link. Where in VoID, each link is a regular triple (subject, predicate, object), here each triple’s information is published as an rdfs:Statement object containing not only a subject, predicate and an object but also a loci:hadGenerationMethod property which indicates how it was produced by providing a shortcut to an object containing instructions. Each published Linkset contains extensive methodology notes and a register of LocI Linkset creation methods will be published in time.

So far, Linksets are published to crosswalk all LocI datasets and all versions of them. In time, multiple Linksets between the same datasets will be published to allow for different crosswalking methodologies to be selected for use by clients.

4.4 Data Validation

At present LocI only performs basic data validation however when caches of data from Datasets and Linksets are made for client consumption in mid-2019, SHACL [10] constraint language templates will be created for the LocI Ontology and GeoSPARQL Extensions ontology and other upper-ontologies with which to validate instance data. For this task, we have provisioned a new stand-alone SHACL validator tool (see RDF Validator below).

Given that LocI operations require data to conform to multiple ontologies, some of which may be in a derivation hierarchy, the power to model such dependencies and validating artefacts related to ontologies using the new Profiles Ontology [3] is being tested for use.

It is expected that 3rd parties may want to contribute datasets and Linkets to the pool of LocI resources and, when they do, validation systems will need to be in place. For this, tooling has been developed, see RDF Validator below.

4.5 Graph Expansion

Graph expansion – generation of inferred knowledge from ontological rules – is expected to be important for LocI as new business rules are added on top of the total data holdings in the forms of ontological axioms. So far inference data has not been published by LocI but both small- and large-scale interencing capability is currently under test (see OWL Reasoners below).

4.6 Clients

While data owners and system maintainers will publish ontologies, datasets and linksets, validate data and expand graphs, LocI end users will most likely not perform these actions. Data analysts and government policy advisers are expected, at least initially, to use specialised Linked Data clients and Linked Data-supplied element landing pages (see Fig. 4). Extensive stakeholder engagement is currently underway to ascertain client tooling requirements and, so far, 3 clients for LocI are currently being designed to meet discovered needs:

  1. 1.

    IderDown

    1. (a)

      for offline environment identifier use, see Sect. 4.2

  2. 2.

    Genquery

    1. (a)

      point-and-click browsing visual graph explorer for LocI ontologies, establishing paths joining classes and creating SPARQLFootnote 23 queries based on the paths to execute against RDF data

  3. 3.

    Excelerator

    1. (a)

      a data re-apportioning tool, uses dataset crosswalks (Linksets) to reapportion tabular data

4.7 Graph Caches

LocI works directly with the Linked Data APIs, SPARQL services or downloadable forms of the Datasets & Linksets or with cached graphs of content for performance reasons. Clients are built to prefer public access, where possible, ensuring open data principles are honored/maintained. Cached graphs of some/all LocI information will likely also be available as LocI assets for general use, depending on provisioning costs.

So far, the cached graph test are storing approx. 50M objects (all Dataset & Linkset objects) with approx. 40 triples per object, so approximately 2 billion triples. The feasability of supporting open access to this resource has not yet been determined.

5 Tools

LocI has built and extended a number of tools to achieve its data publication so far. A non-exhaustive list follows.

pyLDAPI - Linked Data API 

LocI has tested and extended the Python Linked Data API, pyLDAPIFootnote 24, which has been added to Python’s rdflibFootnote 25 family of RDF manipulation tools. In addition to the content negotiation by profile supported by this tool (see Sect. 4.2), the mechanics it uses to list objects’ available views by providing an alternates view for every object it delivers, are informing the design of Content Negotiation by Profile [12].

Fig. 5.
figure 5

Components within the GNAF dataset’s Linked Data delivery system

pyLDAPI deployed for the GNAF with internal components shown and in relation to all the other parts of the total dataset, is given in Fig. 5.

pyLDAPI client 

The pyLDAPI client softwareFootnote 26 was created for LocI and is being extended to act as a general-purpose Linked Data Registry harvester.

Persistent ID systems 

LocI’s precursor SIRF [5] and the LDWD’s persistent ID services were used a custom “PID Service” toolFootnote 27 which LocI has now replaced with web server proxy tools. These toolsFootnote 28 are extremely simple and rely only on a common web serverFootnote 29 and its redirect module for operation. PID establishment is assisted by a series of redirection testing scripts which validate all deployed redirects. Redirect routing and API responses are tested multiple times a day to ensure all LocI assets perform.

RDF Validator 

A new SHACL validator tool, pySHACLFootnote 30 has been created and it too has been added to the Python’s rdflib. Until the completion of this tool, there was no freely available SHACL validator software available for mainstream programming languages like Python which is required for integration within mainstream web architectures such as those likely to be able to be implemented by non-research agencies. So far, this tool is the second most feature-complete SHACL tool, after the specification’s author’s own.

OWL reasoners 

To assist with inference experimentation, the LocI team has updated, published as a software package the OWL-RL inferencing toolFootnote 31, within rdflib. It will be used alongside RDF databases capable of various forms of inference. It is also necessary for RDF validation and pySHACL depends on it.

6 Conclusions

LocI represents a major investment in Semantic Web technologies to create a national-scale, Spatial Data Infrastructure bridging spatial and observation data. It is a 2nd or 3rd generation Linked Data project benefiting from previous attempts in Australia and elsewhere to create a distributed yet interoperable collections of spatial datasets with accessible spatial objects. It is implementing new technical and social mechanisms to overcome issues such as fragmented data governance and lack of authority, leading to poor adoption of Linked Data outside of research agencies. Critically, the project is being implemented collaboratively with operational data delivery agencies (including ABS, GA and DoAWR), to ensure alignment with individual organisational drivers, buy-in at the early stages of development, and capacity uplift, to ensure a smooth path to transition these prototype capabilities to their eventually organisational homes. Many new yet simple tools have been developed to allow for Linked Data work in common web infrastructures to further ease the path to adoption.

LocI has delivered its expected milestones to date (dataset delivery as Linked Data and Linkset generation) and the next test (by mid 2019) is full client testing. Subsequent years (2020, 2021) will see LocI stabilise operational systems and grow the pool of datasets.

It has also redeveloped Linked Data API tooling now deployed 7+ times, client tooling deployed 3 times, the most feature complete, open source SHACL validatorFootnote 32 and an update version of OWL-RL, thus the project greatly contributes to supporting Linked Data capacity generally.