1 Introduction

\(\textit{JSON}\) (JavaScript Object Notation [1]) has become the standard format for representing any kind of data published and shared on the Internet. Indeed, in spite of the original aim to rely on XML (eXtensible Mark-up Language [2]) as the universal format for the Internet, \(\textit{JSON}\) has become very popular; probably, \(\textit{JSON}\) is now more popular than XML.

However, \(\textit{JSON}\) (as XML) is a generic format. To represent specific data or documents, it is necessary to define specific document structures: GeoJSON [3] is one of the most famous examples of a well structured format that relies on the \(\textit{JSON}\) syntax. GeoJSON was designed by the GIS (Geographical Information Systems) community to represent “information layers”, i.e., a pool of spatial entities that, all together, provide homogeneous spatial information. Examples are roads, green areas, municipalities, and so on. The key aspect of GeoJSON information layers is that all the contained entities are “geo-tagged”, i.e., they are provided with their spatial description, also said “geometry”, which relies on a coordinate system such as WGS-84 [4].

After this brief presentation, the reader could think that collecting and managing GeoJSON documents is easy, since the advent of NoSQL (Not-only SQL) database technology [5] to store \(\textit{JSON}\) data sets should provide a significant help. Specifically, we are thinking about the category of “\(\textit{JSON}\) document stores” (i.e., databases whose data model is not the classical relational data model), which are able to store and query “collections of \(\textit{JSON}\) documents”. Among them, MongoDB [6] is nowadays widely popular.

Unfortunately, things are not so easy, for various reasons. (i) A GeoJSON information layer is a single, possibly giant, \(\textit{JSON}\) document; consequently, querying it is not immediate, since spatial entities must be previously unnested from within this unique giant document. (ii) \(\textit{JSON}\) document stores are usually designed to deal with a large number of small \(\textit{JSON}\) documents. The result is that, often, large GeoJSON documents cannot be even stored within the \(\textit{JSON}\) document store. (iii) Spatial querying in \(\textit{JSON}\) document stores, when supported, can be performed only if \(\textit{JSON}\)documents are spatially indexed in advance, thus strongly limiting the flexibility of querying documents on the fly.

The previous considerations are, more or less, the same considerations that inspired the idea of developing a novel framework to easily manage collections of \(\textit{JSON}\) documents, possibly stored within \(\textit{JSON}\) document stores. The J-CO Framework (see [7]) relies on a query language (named J-CO-QL \(^{+}\)) that provides high-level and declarative statements, which are specifically designed to perform complex transformations on collections of \(\textit{JSON}\) documents. The language is undergoing the extension with soft-querying capabilities (see [8]), by means of the evaluation of membership degrees of documents to fuzzy sets. So, the J-CO-QL \(^{+}\) language, when applied to GeoJSON documents, would provide analysts with the capability of soft querying entities in GeoJSON information layers, but due to their intrinsically-complex structure, queries would result long and tedious to write.

In [9, 10], we argued that a Domain-Specific Language (DSL), specifically designed to query GeoJSON documents as a whole, based on a SQL-like syntax and natively dealing with soft querying, could significantly help analysts to perform complex soft queries on GeoJSON documents. In [9], we showed that a first version of this language could be easily translated into J-CO-QL \(^{+}\) queries, but it was limited to work on one single GeoJSON document. After that preliminary investigation, we decided to name the language as GeoSoft [10], we extended it to support the join operation on GeoJSON documents and carefully redesigned it; we also implemented a translator of GeoSoft queries into J-CO-QL \(^{+}\) scripts.

The contribution of this paper is to provide a comprehensive presentation of GeoSoft. The first contribution is the semantic model on which GeoSoft is built. The second contribution is the GeoSoft language itself: specifically, we present its syntax and clauses, through a couple of information layers downloaded from a real Open-Data portal. The third contribution is to show that executing GeoSoft queries is possible; in particular, we show how to translate them into J-CO-QL \(^{+}\) scripts, that can be actually executed. The final contribution is evaluating the GeoSoft proposal under three different points of view such as “flexibility”, “accessibility” and “efficiency”; in particular, as far as this last point is concerned, we present a comparison with the classical PostgreSQL/PostGIS solution.

The paper is organized as follows. Section 2 reports relevant related work. Section 3 presents the background of this work, i.e., NoSQL databases for storing \(\textit{JSON}\) documents (Sect. 3.1) and the J-CO Framework (Sect. 3.2). Section 4 introduces the GeoJSON format, by showing its adoption for representing two Geographical Information Layers that will be exploited throughout the paper. Section 5 presents the GeoSoft language, by discussing syntax and semantics through several examples. Section 6 addresses the problem of executing GeoSoft queries, by translating them into J-CO-QL \(^{+}\) scripts; thus, Sect. 6.1 briefly introduces J-CO-QL \(^{+}\) scripts, while Sect. 6.2 shows the translation strategy and the translation algorithm from GeoSoft to J-CO-QL \(^{+}\). Section 7 evaluates the GeoSoft proposal from several points of view; i particular, Sect. 7.1 evaluates “flexibility”, Sect. 7.2 evaluates “accessibility” and Sect. 7.3 evaluates “Efficiency”. Finally, Sect. 8 draws the conclusions and highlights possible future works.

The Appendix completes the paper, for the interested readers, with detailed presentations of specific topics. In particular, Appendix A presents the \(\textit{JSON}\) format, Appendix B presents the GeoJSON format and Appendix C briefly reports basic concepts about fuzzy sets; Appendix D reports details about the JOIN operator in the GeoSoft language; Appendix E extensively discusses the J-CO-QL \(^{+}\) script obtained by translating the more complex GeoSoft query presented in this paper.

2 Related Work

In Sect. 4, we will show that a GeoJSON document is a particular case of \(\textit{JSON}\) document, with a specific structure: indeed, apart from the \(\textit{JSON}\) syntax, a GeoJSON document actually represents a set of spatial features. Consequently, it is reasonable to figure out a query language specifically designed to query GeoJSON features; however, to the best of our knowledge, in the literature there are no other proposals, apart from the preliminary version of GeoSoft (see [9]). Consequently, we had to refer to proposals designed for working on standard \(\textit{JSON}\) documents. A distinctive characteristic of GeoSoft is soft querying, as well as another distinctive characteristic is geographical querying. The above considerations motivate the three different sub-topics addressed in this section: (i) languages for expressing soft queries; (ii) languages for querying \(\textit{JSON}\) data; (iii) languages for querying geographical data.

This paper assumes that readers are familiar with basic concepts concerning fuzzy sets. For those that are not familiar, Appendix C reports a brief introduction to fuzzy sets.

2.1 Languages for Expressing Soft Queries

Soft querying relies on Fuzzy-Set Theory, an extension of the classical Set Theory that was introduced by Zadeh in [11]: the idea is that items can belong to a set only partially (see Appendix C). As an effect, vague concepts can be expressed as fuzzy sets; thus, “soft selection conditions” can be expressed as fuzzy selection conditions [12], because the membership degree to a fuzzy set quantifies the closeness of the information carried by the x item with respect to the condition. Possibility Theory [11, 13], together with the concept of linguistic variable defined within Fuzzy-Set Theory [14], provides a valuable formal framework for managing imprecise, vague and uncertain information [15].

Historically, the first attempts to develop languages for soft querying in databases were developed in the context of relational databases. Specifically, two main approaches were followed: (i) preserving the classical relational data model; (ii) extending the data model towards “fuzzy-relational data models”.

The rationale behind the first approach was to provide soft-querying capabilities on top of already-existing relational databases, which are still widely used within information systems. Thus, many extensions of SQL towards soft querying were proposed, which provide capabilities of soft querying table rows through Fuzzy-Set Theory. Here, we mention only a few of them, in particular SQLf (see [16, 17]) and the attempt to implement it (in [18]); the second proposal we mention is FQUERY for Access (see [19, 20]), which was designed to work within Microsoft Access; finally, we mention the proposal named SoftSQL (see [21,22,23]), which also covers the definition of user-defined “linguistic predicates” through a dedicated statement, to be used in soft selection conditions within the SELECT statement.

The second approach was to define a “fuzzy-relational data model” (refer to [24, 25]), i.e., an extension of the classical relational data model that was able to natively represent uncertain values and data within the database. Here, we mention FSQL, presented in [26, 27], which is the most remarkable proposal, in our opinion.

To the best of our knowledge, the most recent paper on this topic is [28]. In this work, the authors briefly present a library for PostgreSQL, named PostgreSQLf, which provides capabilities for dealing with fuzzy values within PostgreSQL. Although the description is synthetic, the library is still available for download. In our opinion, its point of strength is also its weakness, i.e., it does not extend the query language, but provide functions to call within the classical SQL queries.

2.2 Languages for Querying \(\textit{JSON}\) Data

The advent of many systems specifically designed to store \(\textit{JSON}\) documents is a matter of fact; they are database systems that does not rely on the classical relational data model, but manage collections of \(\textit{JSON}\) documents, independently of their structure. Among them, the most popular is MongoDB,Footnote 1 which is used for storing and querying large collections of small \(\textit{JSON}\) documents. Another important representative is CouchDBFootnote 2 [29].

The advent of these kind of database systems has caused the definition of many query languages for \(\textit{JSON}\) data sets. Popular representatives are Jaql [30], SQL++ [31] (which is, the query language of CouchDB), JSONiq [32] and the query language provided by MongoDB [5], which is generally called MQL. Certainly, we can say that the most used of them is MQL, due to the popularity of MongoDB.

In recent works, we are observing the attempt to resume the ideas explored in the area of soft querying on top of relational databases (topic (i) in 2.1), by adapting them to the new context of \(\textit{JSON}\) document stores.

For example, the work [33] proposed fMQL, an extension of MQL (the MongoDB Query Language). The authors worked under the hypothesis that \(\textit{JSON}\) documents are previously labeled with “fuzzy labels”.

Another recent work [34] proposed an extension of the MongoDB data model towards a fuzzy \(\textit{JSON}\) document store, supporting fuzzy values in single fields.

In our extension of the J-CO-QL \(^{+}\) language towards soft querying [35, 36], we are substantially following the first approach, i.e., the query language is built on top of the standard \(\textit{JSON}\) data model (the reader will see more in the remainder of the paper).

2.3 Languages for Querying Geographical Data

GeoJSON is a standard format proposed by the community working on “Geographical Information Systems” (GIS); consequently, it is necessary to give a look at this research area as well. In this domain, many research works have been done, in particular addressing issues concerned with data storage, indexing and query optimization [37]; indeed, a GIS is supposed to store huge amounts of complex data, so as to graphically represent them. As a consequence, many extensions of the relational data model and of the SQL language have been proposed [38, 39]: the goal is to make Data Base Management Systems (DBMSs) able to store and retrieve spatial information, i.e., geo-referenced geometries that represent the spatial shape of entities described by data. Clearly, extensions of SQL are appreciated by users that are familiar with writing SQL queries, while they are not good for analysts who do not have such a skill. Consequently, other proposals were made: tabular approaches [40] (that extend QBE, Query By Example), graphical languages [41, 42], visual languages [43], and “hypermaps” [44] (integrated into hypermedia techniques [45]) provides users with powerful graphical approaches to specify queries by directly operating on maps. An interesting and recent survey the reader can refer to is [46], in which it is possible to find a comprehensive literature review about the approaches concerned with spatial-data management in non-relational databases.

However, \(\textit{JSON}\) and GeoJSON in particular are changing the panorama. In fact, if the geographical information is stored as GeoJSON documents, most systems provide low (or none) capability to query them in a geographical way. Indeed, very few query languages are specifically targeted to GeoJSON. We can cite GeoPQLJ [47]: it is a pictorial query language, which provides drawing facilities (which rely on an underlying algebra) to formulate complex queries; this way, it should be possible to remove semantic ambiguity. However, although it works on GeoJSON documents, it does not explicitly rely on peculiarities of GeoJSON documents, as GeoSoft does.

An early work that somehow anticipates GeoSoft is [48]. The authors proposed an extension of SQL for soft querying geographical data; the idea inspired the SoftSQL proposal [22] and, many years later, the J-CO-QL \(^{+}\) language [7, 8, 49] and, finally, the GeoSoft proposal presented in this paper.

3 Background

The goal of this section is to present the background on which GeoSoft relies. First of all, we present basic notions about NoSQL databases for natively storing \(\textit{JSON}\) data sets; then, we will introduce the J-CO Framework.

3.1 \(\textit{JSON}\) and NoSQL Databases

\(\textit{JSON}\) is the acronym for JavaScript Object Notation [1]. It was born in the context of the JavaScript object-oriented programming language, so as to define constant objects. However, its simplicity and immediacy has made it autonomous from the JavaScript language, as a powerful format to represent data with complex structures in a simple and easy-to-process way. Currently, \(\textit{JSON}\) is used in a multitude of applications and has become very popular.

In this paper, we assume that the reader is familiar with the \(\textit{JSON}\) format and the related terminology. The reader that is not familiar, can find a brief introduction to the \(\textit{JSON}\) format in Appendix A.

The term NoSQL stands for “Not only SQL” [50]. It denotes a variety of approaches to database systems that do not rely on the classical relational model (thus, they do not provide the SQL query language).

Among all NoSQL approaches, the category of “document database” is relevant for our work. A database system that falls into this category is able to store and retrieve “structured documents” (i.e., not plain text) whose structure is not fixed but can freely vary. Typically, such systems adopt \(\textit{JSON}\) as the format to represent documents, so they are named “\(\textit{JSON}\) document stores” (or simply “\(\textit{JSON}\) stores”).

The most popular \(\textit{JSON}\) store in the world is MongoDB [51],Footnote 3 but other \(\textit{JSON}\) stores have become popular too, such as CouchDB [29] (which has been adopted as the database system for the HyperLedger Fabric permissioned block-chain platform [52]). Similarly, a tool that is not exactly a database system but an information-retrieval system is Elasticsearch [53]; however, it receives and provides data as \(\textit{JSON}\) documents, so it can be somehow considered as a \(\textit{JSON}\) store.

Since MongoDB is very popular, its data model is popular too and is considered the reference data model for \(\textit{JSON}\) stores.

  • A “collection” is an unordered multi-set of heterogeneous documents, i.e., it can contain multiple copies of the same document. No limitations about the structure of contained documents is posed: heterogeneous documents can be contained within the same collection.

  • A “database” is a “set of collections”. Each collection has a unique name within the database.

3.2 The J-CO Framework

Fig. 1
figure 1

Organization of the J-CO Framework

Currently there is not an official standard for \(\textit{JSON}\) stores; as a results, data models are not perfectly compatible and, primarily, query languages are very different each other (when provided). Consequently, analysts and data engineers have to face a very difficult scenario, when they have to deal with multiple \(\textit{JSON}\) stores based on different technologies and query languages, to integrate data sets.

These considerations inspired us to develop the J-CO Framework [49, 54, 55]. It is a pool of software tools designed to provide analysts with a powerful support for gathering, integrating, transforming and querying collections of \(\textit{JSON}\) data sets. The core of the framework is its query language, named J-CO-QL \(^{+}\), which has been extensively presented in [8, 56,57,58]. The organization of the framework is depicted in Fig. 1; hereafter, we illustrate it.

  • J-CO-QL \(^+\) Engine. This component executes J-CO-QL \(^{+}\) queries (or scripts). It retrieves data from \(\textit{JSON}\) stores (for example, managed by MongoDB or Elasticsearch) and saves results into them; it also retrieves data sets from Web sources.

  • J-CO-UI. This is the user interface, by means of which J-CO-QL \(^{+}\) scripts are interactively written and submitted to the J-CO-QL \(^+\) Engine.

  • J-CO-DS. This component is a \(\textit{JSON}\) document store specifically designed to store large single documents [59], so as to overcome limitations of other \(\textit{JSON}\) stores (such as MongoDB, which does not accept documents that are larger than 16 MB in its BSON internal binary format). J-CO-DS does not provide computational capabilities such as a query language, because it is a component of the J-CO Framework, whose query language is J-CO-QL \(^{+}\). J-CO-DS has been recently upgraded [60] with new functionalities, to manage three different types of collections. (i) “Static collections” are persistently stored in the database. (ii) “Virtual Collections” are not materialized within the database; they are defined by means of a pool of URLs that provide the data sets; when the collection is accessed, its content is dynamically built by calling the associated URLs. (iii) “Dynamic collections” are persistently stored in the database, but their content is dynamically acquired; a pool of URLs is associated with the collection, so as the engine periodically acquires new versions of the data (so, data sets are accessible for a user, even when the framework is temporarily off-line). As a major result, J-CO-DS provides a database view of Web sources, which is a unique feature in comparison with other \(\textit{JSON}\) stores.

  • J-CO-BATCH has been recently added to the J-CO Framework [60]. Its goal is to provide “batch execution” of J-CO-QL \(^{+}\) scripts. By means of this tool, it is possible to create parametric templates of J-CO-QL \(^{+}\) scripts, whose execution cam be scheduled, either one-shot or repeatedly in time.

The J-CO Framework has originated from our participation to the Urban Nexus project [61], whose goal was to integrate Big Data coming from various sources, such as Open-Data portals and social media, so as to study citizens mobility. In that project, we developed several tools for integrating and analyzing data [62,63,64,65]. During that project, it was clear that the support provided by \(\textit{JSON}\) stores for managing \(\textit{JSON}\) data sets was too raw for analysts; indeed, we felt that higher-level tools were necessary. These are the origins of the J-CO Framework.

4 Geographical Information Layers as GeoJSON Documents

GeoJSON,Footnote 4 is an interchange format for spatial data. It is designed to represent geographical spatial entities (called “features”) and their non-spatial properties. GeoJSON is based on \(\textit{JSON}\) as host syntactic format: this characteristic allows for processing GeoJSON documents as any other \(\textit{JSON}\) document. Specifically, it is suitable for encoding “geographical information layers”, i.e., aggregations of spatial entities that are somehow homogeneous or strongly correlated (e.g., roads, buildings, rivers, and so on). GeoJSON is independent of any geographical Coordinate Reference System (CRS); however, most of GeoJSON documents implicitly adopt the World Geodetic System 1984 (WGS-84) and decimal-degree units.

Another advantage is that GeoJSON is a human-readable format, since it is a plain-text format; however, this characteristic makes it verbose. As an effect, GeoJSON documents can become much larger than other formats for representing spatial data, such as ShapefileFootnote 5 or GeoPackage.Footnote 6

For readers that are not familiar with the GeoJSON format, Appendix B provides a brief introduction to it. Essentially, a GeoJSON document, in its complete form, can be seen as a “collection of features”, where a “feature” has “registry properties” and a “geometry” (that is the spatial footprint of the feature). Figure 2 reports an excerpt of a GeoJSON document: the features field is an array that contains \(\textit{JSON}\) documents, one for each feature; each nested document contains the properties field and the geometry field.

Fig. 2
figure 2

Excerpt of GeoJSON document representing municipalities (towns)

In order to provide a case study for presenting GeoSoft, we now present two information layers described by two GeoJSON documents, that will be exploited in the remainder of the paper.

Fig. 3
figure 3

Towns in Lombardy Region. The thick lines denote the region borders. The medium-thick lines denote borders of provinces. The thin lines denote borders of cities. The black box, on the left, reports the properties contained in the GeoJSON feature representing the city of Milan (whose name is in red)

Information Layer 1

The GeoJSON document reported in Fig. 2 was downloaded from Regione Lombardia Open Data portal.Footnote 7 It represents the information layer of the 1506 municipalities (or towns) in Lombardy, the Italian region that includes Milan and Bergamo. Consequently, each feature represents a municipality. Notice the root-level features field, which is an array of documents describing features; each feature contains properties (the properties field) and the geometry (the geometry field). To facilitate the reader in recognizing the structure of the GeoJSON document, we enclosed each feature into blue-border boxes, as well as, inside each feature, we enclosed the properties and geometry fields into red-dashed-border boxes. An analogue highlighting strategy is used in the rest of the paper for the subsequent examples.

The reader can notice that each geometry field represents a MultiPolygon (see Appendix B), since, in the general case, the territory of a municipality may be non-continuous; implicitly, the WGS-84 Coordinate Reference System is adopted in GeoJSON.

Each properties field contains a wide variety of fields, including the name of the municipality (the nome_com field), the name of the region (the nome_reg field) and the name of the province (the nome_pro field) to which the municipality belongs; other fields report the length of the borders (the shape_len field, expressed in meters) and the area of the municipality (the shape_area field, expressed in square meters). For the sake of simplicity, we do not describe the other fields.

Finally, notice how the names of the properties are quite unclear: parts in Italian and parts in English; often, contracted forms are used. Notice also that all numerical field values are expressed as strings: this fact denotes poor data design and makes it hard to exploit them.

Fig. 4
figure 4

Excerpt of GeoJSON document representing highways

Figure 3 shows the content of the GeoJSON document drawn on a map. The reader can notice the multitude of polygons that actually constitute municipalities.

Information Layer 2

The GeoJSON document reported in Fig. 4 describes highways in Lombardy (Italy); it was downloaded from Regione Lombardia Open Data portalFootnote 8 as a Shapefile, and then converted into a GeoJSON document by means of the QGIS tool.Footnote 9 The document describes 94 highway sections, each for one single direction, including also highway junctions. Differently from the GeoJSON document that describes municipalities (in Fig. 2), at root-level we find the name and the crs fields as well (see Appendix B): the goal is to give a name to the information layer and to explicitly set the Coordinate Reference System to WGS-84.

Each feature describes a highway; more precisely, it describes the section of the highway that traverses Lombardy. Notice that the geometry field represents a MultiLineString geometry (see Appendix B), since, in the general case, the path of a highway may be non-continuous (the same highway could exit and enter the same region multiple times). The properties field provides registry data about highways, such as identifier (the COD_PE field), name (the NOME_PERCO field) and total length in Lombardy (the SHAPE_LEN field, expressed in meters) of the highway section. Again, for the sake of simplicity, we do not describe the other fields.

From Fig. 4, the reader can notice that field names are badly designed again: parts of them are in Italian and parts of them are in English, as well as contracted forms are used; we also remark that, this time, the SHAPE_LEN field has a numerical value.

Fig. 5
figure 5

Highways in Lombardy Region. The black-box in the left-upper corner reports the properties contained in the GeoJSON feature representing the highway highlighted in purple in the center-bottom of the map

Figure 5 depicts the content of the presented GeoJSON document by drawing, in green color, all the highway segments on a map. The reader can notice the purple-highlighted highway in center-bottom of the map, whose properties are reported in the black-box in the left-upper corner of the figure: this is the section of the highway named "A1" that traverses Lombardy, whose final ending is Naples; notice that the reported total length, in the SHAPE_LEN field, is about 56 km, while the actual total length of the "A1" highway is about 760 km.Footnote 10

5 GeoSoft

The language we propose is named GeoSoft. To make it easy for analysts to use it, we exploited the well known syntax of the SQL SELECT statement, but it works on features within GeoJSON documents. In Sect. 5.1, we first present the semantic model which the language relies on. Then, in the remainder of this section, we present the GeoSoft language by exploiting the information layers introduced in Sect. 4.

5.1 Semantic Model

We present the first contribution of the paper, i.e., the semantic model which the GeoSoft language relies on is presented. Specifically, we distinguish between “external semantic model” (i.e., how a query behaves on GeoJSON documents, independently of their storage or provenance) and “internal semantic model” (i.e., how the query behaves internally); then, we have to consider the application scope of the language, so we complete the semantic model by defining its relation with databases managed by \(\textit{JSON}\) stores.

Fig. 6
figure 6

Internal model of crisp and soft GeoJSON documents, with generic features, as they are managed by GeoSoft queries

5.1.1 External Semantic Model

We start by considering the external semantic model, i.e., how a GeoSoft query gsq is expected to behave, independently of how it is internally structured.

Definition 1

GeoJSON Feature. A (crisp) GeoJSON feature is a tuple

       \(f = \langle {\texttt {properties}}, {\texttt {geometry}} \rangle \)

where properties and geometry (see Sect. 4 and Appendix B) are \(\textit{JSON}\) documents describing, respectively, the properties and the geometry of the spatial entity described by the feature.

A feature is the elementary item a GeoSoft query is expected to work on.

Definition 2

GeoJSON Document. A (crisp) GeoJSON document gd is modeled as a possibly-empty set of features, i.e., \(gd = \{ f_1, \dots , f_n\}\). The domain of GeoJSON documents is denoted as GJ.

Figure 6a shows an example of (crisp) GeoJSON document, as it is intended in our semantic model. Notice that features are represented as tuples; the overall document is represented as a set of tuples.

Based on these definitions, it is possible to define the “external semantics” of a GeoSoft query, i.e., how the query is expected to behave outside the query itself.

Definition 3

External Query. Consider the domain GJ of GeoJSON documents. An external GeoSoft query gsq is a function

       \(gsq: GJ^n \rightarrow GJ\)

where \(GJ^n = GJ \times \dots \times GJ\) is the n-ary Cartesian product (with \(n > 0\)) on the GJ domain.

In the simplest form, i.e., with \(n=1\), the gsq query takes one single GeoJSON document as input; in the complex form, i.e., \(n > 1\), the gsq query works on multiple input GeoJSON documents. The gsq query always generates one single GeoJSON document as output.

5.1.2 Internal Semantic Model

We now define the semantic model that is internally followed by GeoSoft queries. Clearly, this semantic model takes soft querying into account.

Definition 4

Soft Feature. A soft feature \(\overline{f}\) is a feature for which membership degrees to some fuzzy sets are considered. A soft feature is defined as a tuple

       \(\overline{f}= \langle {\texttt {properties}}, {\texttt {geometry}}, {\texttt {fuzzysets}} \rangle \)

where properties and geometry are defined as in Definition 1. Each feature can belong to several fuzzy sets with a specific membership degree (the membership degree to each fuzzy set denotes the degree with which the feature belongs to the fuzzy set). In the \(\overline{f}\) tuple, the fuzzysets member is a key/value map, that associates a fuzzy set name fsn (the key) to the membership degree (the value) of the \(\overline{f}\) feature to the fuzzy set fsn.

The reader can notice that the universe of fuzzy sets is the domain of spatial features; in other words, we consider the membership of a spatial feature to a fuzzy set as a whole.

Introducing the concept of soft feature allows us to evolve the concept of GeoJSON document into the concept of “soft GeoJSON document”.

Definition 5

Soft GeoJSON Document. A soft GeoJSON document \(\overline{gd}\) is modeled as a possibly-empty set of soft features, i.e., \(\overline{gd} = \{ \overline{f}_1, \dots , \overline{f}_n\}\). The domain of soft GeoJSON documents is denoted as \(\overline{GJ}\).

As an example, Fig. 6b–d show different instances of soft GeoJSON document, as intended in the semantic model. Notice that the generic feature in Fig. 6b has no membership degree (the fuzzysets map is empty); the generic feature in Fig. 6c has one membership degree (one entry in the fuzzysets map); finally, the generic feature in Fig. 6d has two membership degrees (two entries in the fuzzysets map)-

Consequently, we argue that a GeoSoft query should work on soft GeoJSON documents internally, while externally it must work on crisp GeoJSON documents (as defined by Definition 3). Hereafter, the formal semantic framework is extended.

Definition 6

Internal Query. Consider the domain \(\overline{GJ}\) of soft GeoJSON documents. An internal GeoSoft query \(\overline{gsq}\) is a function

       \(\overline{gsq}: \overline{GJ}^n \rightarrow \overline{GJ}\)

where \(\overline{GJ}^n = \overline{GJ} \times \dots \times \overline{GJ}\) is the n-ary Cartesian product (with \(n > 0\)) of the \(\overline{GJ}\) domain.

Thus, it is necessary to create a bridge between the external and the internal semantics, in that crisp input GeoJSON documents gd must be transformed into input soft GeoJSON documents \(\overline{gd}\) for the internal query, while the output soft GeoJSON document from the internal query must be translated into a crisp GeoJSON document. This is done by the following definition.

Definition 7

Consider the domain GJ of crisp GeoJSON documents \(gd \in GJ\) and the domain \(\overline{GJ}\) of soft GeoJSON documents \(\overline{gd} \in \overline{GJ}\). An external GeoSoft query \(gsq: GJ^n \rightarrow GJ\) (with \(n > 0\)) is defined as

\(gsq \left( gd_1, \dots , gd_n\right) = toCrisp (\overline{gsq}(toSoft\left( gd_1\right) , \dots , toSoft\left( gd_n\right) )\).

We make use of \(toCrips: \overline{GJ} \rightarrow GJ\), which is a function that removes the fuzzysets member from soft features, thus obtaining crisp features.

We also make use of \(toSoft: GJ \rightarrow \ \overline{GJ}\), which is a function that extends features with an empty fuzzysets member.

The reader can notice that input GeoJSON documents are automatically “fuzzified”, i.e., translated to the domain \(\overline{GJ}\) of soft GeoJSON documents; in contrast, output soft GeoJSON documents are automatically “de-fuzzified”, i.e., translated to the domain GJ of crisp GeoJSON documents.

Figure 6b shows an example of soft GeoJSON document immediately after “fuzzification”: the generic feature has the empty fuzzysets map.

5.1.3 Queries and Database

To complete the semantic model, it is necessary to consider the database. Indeed, we consider a scope in which input GeoJSON documents are stored within a \(\textit{JSON}\) document store and output GeoJSON documents are saved into a \(\textit{JSON}\) document store. Thus, first of all we define the notion of database.

Definition 8

Consider a \(\textit{JSON}\) document store. A “collection” c is a multi-set \(c = \{ d_1, \dots , d_n \}\), where multiple instances of the same document are possible. A collection c has a property c.name.

A “database” db is a set of collections \(db = \{ c_1, \dots , c_h \}\), such that the name \(c_i.name\) (for a collection \(c_i \in db\)) uniquely identifies \(c_i\) (in other words, there cannot be two collections \(c_i, c_j \in db\) such that \(c_i.name =c_j.name\)).

The external GeoSoft query gsq was defined (see Definition 3) to operate on the domain GJ of (crisp) GeoJSON documents. However, the GeoSoft language must be defined in such a way it takes GeoJSON documents from one or more databases and saves the output GeoJSON document into a database. The following definition defines the final concept of “database query” dbq.

Definition 9

Database Query. Consider a “collection descriptor” \(cd = \langle cname, dbname, docname \rangle \), where cname is the name of a collection and dbname is the name of a database; if docname has a non-null value, it denotes the name of a GeoJSON document. A “database query” is defined as

       \(dbq\left( to, from_1, \dots , from_n\right) =\)

       \(= save\left( to, gsq(get\left( from_1\right) , \dots ,\left( from_n\right) )\right) \)

where to and \(from_1\), \(\dots \), \(from_n\) are collection descriptors. dbq is a procedure that receives \(n + 1\) (with \(n > 0\)) collection descriptors: to denotes the collection which the output GeoJSON document must be saved into, while \(from_1\) to \(from_n\) denote the collections from which the input GeoJSON documents must be acquired.

We make use of \(get(from_i)\) (with \(0 < i \le n\)), which is a function that actually accesses the \(from_i.dbname\) database and gets all GeoJSON documents from the \(from_i.cname\) collection. Specifically, non-GeoJSON documents (possibly contained in the collection) are not considered, while all GeoJSON documents actually present in \(from_i.cname\) are fused into one single GeoJSON document. Furthermore, if the \(from_i.docname\) has a non-null value, only GeoJSON documents whose name field (see Appendix B) equals \(from_i.docname\) are considered.

We also make use of save(togd), which is a procedure that actually saves the gd GeoJSON document into the to.cname collection in the to.dbname database (by dropping the previous instance of the collection, if present). Furthermore, if to.docname has a non-null value, the name field of the saved GeoJSON document (see Appendix B) is set to to.docname.

Through the above definition of database query dbq, we fully define the application scope of the GeoSoft language. Notice that the fact that source GeoJSON documents are contained in collections of \(\textit{JSON}\) documents is solved by applying Definition 2. The set-oriented view of a GeoJSON document provides the way to solve the issue of multiple GeoJSON documents stored within the same input collection: features within them are united into one single GeoJSON document.

5.2 Simple Queries

We can now introduce the second main contribution of the paper, i.e., the GeoSoft language. Here, we present simple queries, i.e., queries that take one single GeoJSON document as input and produce a new GeoJSON document, without nested queries or joins.

5.2.1 Selection and Projection

Suppose that a user wants to query the GeoJSON document that describes highways (see Information Layer 2), which is depicted in Fig. 4. The goal of the user is: generating a new GeoJSON document that contains only highways whose length is greater than 3 km. Listing 1 reports the (simple) GeoSoft query. We present it hereafter.

First of all, notice the GET CONTEXT directive, by means of which it is possible to load the specified file, whose content is the “execution context”. This is a preamble written in J-CO-QL \(^{+}\); this is due to the fact that GeoSoft query is translated into a J-CO-QL \(^{+}\) script and then executed by the J-CO-QL \(^+\) Engine; the notion of execution context allows us to exploit fragments of scripts written in J-CO-QL \(^{+}\), without replicating constructs in GeoSoft.

Specifically, the GET CONTEXT directive in Listing 1 loads the jcoContextDb.jco file, whose content is reported in Listing 2: it declares the connection to the database (notice that the database is managed by J-CO-DS, the \(\textit{JSON}\) document store provided by the J-CO Framework, remember from Sect. 3.2).

The SELECT statement actually performs the query, as explained hereafter. Remember that the goal of the query is “generating a novel GeoJSON document that contains only features describing highways whose length is greater than 3 km”.

figure a
figure b
  • The FROM clause specifies the collection in the database where the document is stored. Specifically, in the query we refer to the highways collection, which is contained in the geosoftDb database. Notice the WITH NAME option: when specified, it allows us to focus the acquisition only on GeoJSON documents whose root-level name field (see Appendix B) has the specified value (discarding the other documents that are possibly present in the same collection).

  • The WHERE clause selects features in the GeoJSON document, by means of a classical Boolean condition. Specifically, features having the value for the SHAPE_LEN property greater than 3 km (the length is in meters) are selected.

  • The list that follows the SELECT keyword (in the following, we will refer to it as “the SELECT clause”) specifies the list of properties of interest; we can say that “features are projected on a subset of properties”. Specifically, the list of properties is projected on the COD_PE property (which is renamed as highwayId by the subsequent alias) and on the SHAPE_LEN property (which is renamed as highwayTotalLength).

  • Finally, the SAVE AS clause saves the resulting GeoJSON document into a database collection. Specifically, the document is saved into the ProperHighways collection in the geosoftDb database. Notice the optional SETTING NAME specification: if present, it adds the root-level name field to the output GeoJSON document (see Appendix B), with the specified value.

The reader can notice that properties are referred through the dot notation, e.g., “.SAMPLE_LEN”. Remember that properties are within the nested properties sub-document; properties can be, in turn, nested sub-documents. Consequently, the first dot is rooted within the properties sub-document (member of the feature, see Definition 4).

As a final remark, the choice for relying on the basic SQL syntax is towards users: indeed, they do not have to learn a completely new language.

5.2.2 Soft Querying

The query reported in Listing 1 does not exploit soft-querying, while this is done by the query in Listing 3. As the reader can see, the basic structure of the query is the same; clearly, novel clauses are used, to deal with fuzzy sets.

figure c

Suppose that a user wants to query the GeoJSON document describing highways (Information Layer 2), with the following goal: generating a novel GeoJSON document that contains features describing highways that are more or less medium-length highways; both the degree with which they satisfy the requirement and the degree with which they do not satisfy the requirement are of interest. Hereafter, we show how the GeoSoft query reported in Listing 3 expresses this soft query.

  • The FROM clause specifies again the highways collection in the geosoftDb database. Remember from Definition 7 that features in the source GeoJSON document are converted into soft features (Definitions 4 and 5) by adding the fuzzysets key-value map.

  • The WHERE clause selects those features whose value of the SHAPE_LEN property is greater than 3000 meters.

  • The subsequent USING clause works on the set of features selected by the WHERE clause and evaluates a “soft condition”, so as to evaluate the membership degree of features to a specific fuzzy set. The general syntax of this clause is as follows:USING softCondition FOR FUZZY SET fuzzySetName (, softCondition FOR FUZZY SET fuzzySetName)* i.e., the clause can specify several branches. In each branch, softCondition is a condition whose terms are fuzzy-set names and “fuzzy operators” (a user-defined tool that evaluates membership degrees from properties; they will be presented hereafter); the resulting membership degree is referred to the fuzzySetName specified after the FOR FUZZY SET keywords. Specifically, two soft conditions are specified in Listing 3: the first one evaluates the membership degree to the MediumLengthHighways fuzzy set, to obtain the positive satisfaction degree to the request; the second one evaluates the membership degree to the NotMediumLengthHighway, to obtain the degree of non satisfaction; notice that both the soft conditions exploit the J-CO-QL \(^{+}\) MediumLengthHighwayOp fuzzy operator, that is defined in the jcoContextOp1.jco file (reported in Listing 4 and discussed hereafter). Referring to Definition 4, the key fuzzySetName is added to the fuzzysets map; the value is the one obtained by evaluating the softCondition. Figure 6b, c show such an evolution of the fuzzysets map.

  • The ALPHACUT clause specifies a minimum threshold for the given fuzzy-set name(s), so as to select only those features whose membership degree is no less than the specified threshold(s). Specifically, only features whose membership degree to the MediumLengthHighways fuzzy set is no less than 0.8 are selected. This way, highways that are evaluated as not-completely medium-length highways, but at a sufficiently high degree, are selected.

  • The SELECT clause projects properties of features only on properties of interest, by renaming them. Notice the wanted property, whose value is the membership degree to the MediumLengthHighways fuzzy set; the value is obtained through the MEMBERSHIP_TO built-in function; in the same way, the notWanted property is derived by taking the membership degree to the NotMediumLengthHighway fuzzy sets, to denote the degree of non satisfaction to the request. In fact, when the output GeoJSON document is generated, features are “de-fuzzified”; however, the membership degree is useful to know the relevance of the feature with respect to the query, thus the only way is to add a property with that value.

  • Finally (remember Definition 7), the GeoJSON document is “de-fuzzified” (i.e., the fuzzysets map is removed from features) and the resulting crisp GeoJSON document is saved into the MediumLengthHighways collection in the geosoftDb database. Again, the SETTING NAME option defines the value for the root-level name field in the output GeoJSON document.

The novel USING and ALPHACUT clauses are optional, as the WHERE clause: if they are present, they operate on the soft GeoJSON document generated by the previous clause (if present, it is the WHERE clause, otherwise it is the FROM clause). In particular, notice the clear distinction between crisp conditions (in the WHERE clause) and soft conditions (in the USING clause): this way, there is no semantic ambiguity, because crisp conditions keeps their well known crisp behavior; furthermore, soft conditions are not extensions of crisp conditions but are specified by a novel construct, clearly distinct from crisp conditions.

Thus, through basic queries it is possible to specify complex soft queries on GeoJSON documents, even based on the evaluation of multiple membership degrees to fuzzy sets.

figure d

Defining Fuzzy Operators. In J-CO-QL \(^{+}\), the key to evaluate membership degrees of \(\textit{JSON}\) documents from their fields is the concept of “fuzzy operator”: it is a user-defined operator whose goal is to compute a membership degree, on the basis of the needs of the users. In GeoSoft, we decided to rely on fuzzy operators as they are provided by J-CO-QL \(^{+}\): this way, we avoid to introduce a construct that would replicate the J-CO-QL \(^{+}\) construct to define fuzzy operators in an exact way. Clearly, this decision is justified by the fact that a GeoSoft query is translated into a J-CO-QL \(^{+}\) script, to be executed.

As far as the GeoSoft query is concerned, the definition of a fuzzy operator constitutes its execution context, as for connections to databases. Consequently, a file specified in the GET CONTEXT directive can contain J-CO-QL \(^{+}\) definitions of fuzzy operators. In Listing 3, the second GET CONTEXT directive loads the jcoContextOp1.jco file, which contains the definition of the MediumLengthHighwayOp fuzzy operator, as reported in Listing 4. Hereafter, we present it in details.

The operator is named MediumLengthHighwayOp, because it will be used to evaluate if and how much a document (feature) describes a medium-length highway.

  • The PARAMETER clause defines input formal parameters; their actual value will be provided by expressions on document fields.

  • The PRECONDITION clause expresses a condition to be met before evaluating the membership degree; if the condition is not met, the evaluation is stopped. Specifically, the precondition is met if the length parameter is greater than 0.

  • The EVALUATE clause evaluates a mathematical expression on parameters; the resulting value will constitute the basis for obtaining the actual membership degree. In line 2, the expression is very simple: it just contains the length parameter as it is.

  • The last clause is the POLYLINE clause: it defines a membership function whose co-domain is [0, 1]. Figure 7 depicts the polyline, for the sake of clarity: provided the x-axis value by the EVALUATE clause, the corresponding y-axis value will be returned as membership degree. In case of x-axis value less than (respectively, greater than) the minimum (respectively, maximum) value, the y-axis value corresponding to the minimum (respectively, maximum) x-axis value is returned.

Fig. 7
figure 7

Membership function of the list:ContextOp1 fuzzy operator (in Listing 4)

GeoSoft has inherited fuzzy operators from J-CO-QL \(^{+}\). In our previous works [8, 36, 56, 60], we presented several examples of fuzzy operators that rely on complex polylines. Traditionally, in the literature triangular and trapezoidal functions are considered, because they are intuitive to use; however, we argued that providing users with the possibility to define more complex (and not necessarily convex) shapes could increase the degree of flexibility for power users. As an example, in [56], we exploited this feature to compensate anomalous behaviours of the Jaro–Winkler string-similarity metric. Clearly, users that are familiar with trapezoidal functions can define them.

5.3 Nested Queries

When queries become more and more complex, “nested queries” could significantly help either keep complexity under control or specify sophisticated transformations (as in the classical SQL). GeoSoft allows nested queries to be specified only in the FROM clause.

The query reported in Listing 5 derives from the one presented in Listing 3; consequently, the addressed problem is the same.

The query relies on two nested queries. The inner-most query is a crisp query, since no membership degree is evaluated for features; in contrast, the intermediate one is a soft query. This difference is only apparent, since the semantic model is the same: when loaded, features become soft features, so they have the fuzzysets map; however, this map remains empty when no membership degree is evaluated. The FROM clause of the outer query receives a soft GeoJSON document from the nested query, independently of the fact that the nested query is crisp or not.

figure e

Hereafter, we explain the query reported in Listing 5.

  • The innermost query retrieves the highways GeoJSON document from the database; features are projected on the highwayId field and on the highwayTotalLength (obtained by renaming the SHAPE_LEN property). The fuzzysets map of features in the output GeoJSON document is empty.

  • The intermediate query evaluates the membership degrees to the MediumLengthHighways and to the NotMediumLengthHighways fuzzy sets; the resulting soft GeoJSON document contains all the features selected by the WHERE clause; as a result, the fuzzysets map has two entries, for each feature.

  • The outermost query receives the soft features produced by the intermediate query and selects those of interest, i.e., those having a membership value no less than 0.8, and generates the output crisp GeoJSON document.

Again, the semantic model is still the same. We can understand that the internal semantic \(\overline{gsq}\) corresponds to the GeoSoft clauses from SELECT to ALPHACUT. When a collection name is specified in the FROM clause, the input crisp GeoJSON document is implicitly transformed into a soft GeoJSON document. Finally, the SAVE AS clause is allowed only in the dbq database query, because (i) it transforms the output soft GeoJSON document into a crisp GeoJSON document and (ii) it saves the output crisp GeoJSON document into the database.

We can show that composition is encompassed by the semantic model. Indeed, if we denote the innermost, intermediate and outermost internal queries as \(\overline{gsq}_i\), \(\overline{gsq}_m\) and \(\overline{gsq}_o\), respectively, the external query can be written as (moving from Definition 7):

\(gsq (gd) = toCrisp(\overline{gsq}_o(\overline{gsq}_m(\overline{gsq}_i(toSoft(gd)))))\).

Considering the overall database query (Definition 9), the full semantic expression is the following one:

\(dbq(to, from) =\)

\(= save(to, toCrisp(\overline{gsq}_o(\overline{gsq}_m(\overline{gsq}_i(toSoft(get(from)))))))\).

5.4 JOIN Queries

A key success factor for a query language on GeoJSON documents is certainly the capability to integrate them. The construct we introduced to this end is the JOIN operator. This operator is a completely novel contribution, in comparison to our previous works on GeoSoft (see [9, 10]).

Syntax The JOIN operator is allowed in the FROM clause (as in the classical SQL). The syntax is as follows:

figure f

source\(_1\) AS \(n_1 \)JOIN source\(_2\) AS \(n_2\)

   ON [GEOMETRY spatialCond]

      [PROPERTIES propertyCond]

   SET GEOMETRY

        [SET FUZZY SETS fuzzySetSpec]

that we explain hereafter.

  • With source\(_1\) and source\(_2\) we denote either a collection name or a nested query. A collection name can be aliased by specifying a name after the AS keyword; a nested query must be aliased, necessarily.

  • The ON keyword is followed by two distinct join conditions: the first one is introduced by the GEOMETRY keyword and specifies a join condition on geometries of features (it is denoted as spatialCond); the second one is introduced by the PROPERTIES keyword and specifies the join condition on properties of features (it is denoted as propertyCond). One of the two conditions can be omitted, but at least one of them must be specified.

  • The SET GEOMETRY clause specifies how to determine the geometry of the output features (denoted as geometrySpec). We will say more in Appendix D.2.

  • The SET FUZZY SETS clause is optional; if present, it provides a way to choose what membership degrees to choose from the ones of the source features. Since it is quite complex, it is not fully described hereafter; Appendix D.2 presents it in details.

The semantics of crisp JOIN operator is presented in details in Appendix D. The interested reader can refer to it for a fully understanding it.

Example Listing 6 reports a simple GeoSoft query that illustrates how to exploit soft spatial functions. The query looks for pairs of municipalities that share a significant part of their border. The query is explained hereafter.

  • The FROM clause joins the GeoJSON document stored within the towns collection (see Information Layer 1) with itself. The resulting features pair two municipalities if their borders touch. The condition on properties that is specified after the PROPERTIES keyword is necessary to avoid coupling of a municipality with itself.

  • The membership degrees to two fuzzy sets are evaluated by the SET FUZZY SETS clause. They make use of the HOW_MEET function, to obtain the degrees with which the left (respectively, right) border meets the right (respectively, left) border. These fuzzy sets cannot be evaluated later, because the HOW_MEET function works on the source geometries, which will be lost after the FROM clause. Notice that HOW_MEET is an example of fuzzy spatial relationship between geometries.

  • Neither the WHERE clause nor the USING clause are present in the query; indeed, the ALPHACUT clause suffices to select the features of interest, i.e., those features whose membership degree to the BorderingLeft fuzzy set is no less than 0.4.

  • The SELECT clause projects on the names of the two municipalities, as well as on the membership degrees to the two evaluated fuzzy sets, renamed as SharedBorder1 and SharedBorder2. Finally, the output crisp GeoJSON document is saved into the BorderingTowns collection.

Semantics The semantics of the JOIN operator is as follows.

  • Consider two soft GeoJSON documents \(\overline{gd}_1\) and \(\overline{gd}_2\). The JOIN operator generates a new soft GeoJSON document, here denoted as \(\overline{gd}_3\). Then, consider two features \(\overline{f}_1 \in \overline{gd}_1\) and \(\overline{f}_2 \in \overline{gd}_2\). A new \(\overline{f}_3\) feature is generated.

  • \(\overline{f}_3.\)properties contains two properties: the \(n_1\) (alias of the first source GeoJSON document) property has the value \(\overline{f}_1.\)properties; the \(n_2\) (alias of the second source GeoJSON document) property has the value \(\overline{f}_2.\)properties.

  • If spatialCond is specified (after the GEOMETRY keyword), the spatial condition is evaluated on the two source geometries \(\overline{f}_1.\)geometry (i.e., left geometry) and \(\overline{f}_2.\)geometry (i.e., right geometry). Several predicates are admitted, such as INTERSECT, MEET, and so on. A detailed introduction is reported in Appendix D.

  • If propertyCond is specified (after the PROPERTIES keyword), it is evaluated on the properties of the novel \(\overline{f}_3\) feature, i.e., on \(\overline{f}_3.\)properties. Classical comparison predicates and mathematical expressions are admitted.

  • \(\overline{f}_3.\)geometry assumes the value specified by geometrySpec after the SET GEOMETRY keywords, such as INTERSECTION, UNION, and so on (see Appendix D for a detailed introduction).

  • \(\overline{f}_3\).fuzzysets is populated as specified by the SET FUZZY SETS clause. The presence of membership degrees associated with features in the input soft GeoJSON documents to the JOIN operator is explicitly addressed by GeoSoft. Specifically, the JOIN operator has to deal with the following issues.

    • Dealing with joined features both belonging to the same fuzzy set. When the joined features belong to the same fuzzy set (possibly with different membership degree), it is necessary to manage this situation, so as to assign the output feature with the proper membership degree.

    • Selecting the fuzzy sets of interest. Not necessarily all the fuzzy sets for which membership degrees of the joined features are known are of interest for the output features. In other words, only a subset of them should be selected.

    • Deriving membership degree to novel fuzzy sets. Fuzzy spatial relationships that concern coupling of two features must be evaluated when they are paired. For example, the degree of inclusion of a geometry within the other geometry cannot be evaluated later, because the source geometries are lost.

    Appendix D.2 presents details about the soft part of the JOIN operator; the interested reader can refer to it for a full understanding.

  • The novel \(\overline{f}_3\) feature is inserted into the output \(\overline{gd}_3\) soft GeoJSON document if all the specified conditions (i.e., either spatialCond or propertyCond or both) are satisfied. If at least one of spatialCond and propertyCond is false, \(\overline{f}_3\) is not inserted into \(\overline{gd}_3\).

Notice that the semantics of the JOIN operator is perfectly integrated with the semantic model introduced in Sect. 5.1. Indeed, independently of the fact that the FROM clause contains one single source or a JOIN of sources, its output is always a soft GeoJSON document, on which subsequent clauses are evaluated.

5.5 A Complete Example

To conclude the introduction to GeoSoft, we show a complete example that exploits the main characteristics of the language.

Suppose that the user wants to find out those municipalities in the province of Milan that can be considered “medium towns”, whose territory is crossed by “medium-length highways”, such that the segment that traverses the municipality is a significant part of the overall highway. The GeoSoft query that solves this problem is reported in Listing 7.

As usual, the beginning of Listing 7 includes the GET CONTEXT directives. Notice that now all previous context files are specified, together with two novel files, named jcoContextOp2.jco (reported in Listing 8) and jcoContextOp3.jco (reported in Listing 9).

Listing 8 defines the RelevantPortionOp fuzzy operator. It receives two formal parameters, which are named lengthInTown and totalLength: the former is the length of the highway fragment that traverses a given municipality, the latter is the total length of the highway in the region. Once the precodition is satisfied, the EVALUATE clause computes the percentage of lengthInTown on totalLength; this value is the x-axis value used to get the final membership degree from the membership function defined by the POLYLINE clause. The membership function is depicted in Fig. 8: notice that it is 0 up to \(1\%\), then it starts increasing, reaching 1 for \(5\%\) (meaning that we consider a highway fragment as fully interesting if its length in a municipality is at least the \(5\%\) of the overall length in the region).

Listing 9 reports the content of the jcoContextOp3.jco file. It contains the definition of a third fuzzy operator, named MediumTownOp. It is possible to see that it is a simple fuzzy operator, which receives the area of a municipality to evaluate if and how much a municipality is a “medium town”; Fig. 9 depicts the polyline defined as membership function. Again, remember that the GET CONTEXT directives exploit J-CO-QL \(^{+}\) scripts as “execution context’, in particular database connections and declarations of fuzzy operators; clearly, the content of this files is not part of the GeoSoft language, but will be inserted into the script obtained by translating the query into a J-CO-QL \(^{+}\) script (see Sect. 6 and Appendix E.1).

Now, we are ready to explain the GeoSoft query.

figure g
figure h
Fig. 8
figure 8

Membership function for the fuzzy operator named RelevantPortionOp (in Listing 8)

figure i
Fig. 9
figure 9

Membership function for the fuzzy operator named MediumTownOp (in Listing 9)

Fig. 10
figure 10

Excerpt of the GeoJSON document produced by the GeoSoft query reported in Listing 7

  • The blue box highlights the first nested query. Its goal is to work on features in the GeoJSON document describing towns (presented in Information Layer 1 and depicted in Fig. 2), stored within the town collection in the geosoftDb database. The fuzzy operator named MediumTownOp is called in the USING clause, to evaluate the membership degree to the fuzzy set named MediumTowns. Definitely, the goal of the nested query is to establish if municipalities, in the province of Milan, are likely to be considered medium towns. Thus, the output soft features have the fuzzysets map with one single entry, i.e., the membership degree to the MediumTowns fuzzy set. The output soft GeoJSON document is aliased as t.

  • The green box highlights the second nested query. It moves from the GeoJSON document describing highways (presented in Information Layer 2 and depicted in Fig. 4), which is stored within the highways collection in the geosoftDb database. The nested query evaluates if and how much a feature describes a medium-length highway or not (see the two evaluated fuzzy sets, i.e., MediumLengthHighways and NotMediumLengthHighways). Consequently, soft features in the output soft GeoJSON document have two entries in the fuzzysets member. The output soft GeoJSON document is aliased as h.

  • The red dashed box highlights the JOIN expression. First of all (see the ON GEOMETRY clause), features in the two soft GeoJSON documents (aliased as t and h) are joined based on geometries: features are paired if their geometries intersect. If pairing succeeds, the intersection of source geometries becomes the geometry of the novel feature (as specified by the SET GEOMETRY clause). Remember that properties in the novel feature are t (containing all properties coming from the left feature) and h (containing all properties coming from the right feature). Finally, the SET FUZZY SETS clause specifies which membership degrees to fuzzy sets the novel feature must have. Specifically, the membership degree to the MediumLengthHighways is taken from the right feature (the h soft GeoJSON document), the membership degree to the MediumTowns fuzzy set is taken from the left feature (the t soft GeoJSON document), while the membership degree to the NotMediumLengthHighways fuzzy set is taken from the right feature (see Appendix D for details about the SET FUZZY SETS clause). This way, the pool of soft features to evaluate in the following root-level clauses has been computed.

  • The WHERE clause selects those composite features that describe municipalities (from the first nested query, in the province of Milan) that are crossed by a true highway, i.e., whose length is greater than 3000 m (to discard connections, whose length is a few hundred meters). This selection can be performed only in a crisp mode, thus it is specified in the WHERE clause.

  • The USING clause actually performs the soft query. Specifically, we have two branches, because the membership degrees to two distinct fuzzy sets are evaluated. The first fuzzy set that is evaluated by the USING clause is named RelevantSegments. Evaluated through the RelevantPortionOp fuzzy operator (reported in Listing 8), the resulting membership degree denotes if a significant portion of the highway crosses the municipality. The second branch evaluates the membership degree to the Wanted fuzzy set: given a feature, its membership degree denotes the degree with which it satisfies the request. The soft condition is a fuzzy AND among three previously computed fuzzy sets, i.e., MediumTowns, MediumLengthHighways and RelevantSegments: this way, the soft condition linguistically expresses what the user is looking for, i.e., features that represent a medium town crossed by a medium-length highway, such that a significant portion of the highway crosses the town territory (the resulting membership degree is the minimum among the three mentioned fuzzy sets).

  • The ALPHACUT clause selects only those features whose satisfaction degree is no less than 0.8, so as to keep only those features with a very-high satisfaction degree.

  • Finally, the SELECT clause flattens the structure of properties. Furthermore, it adds two extra properties: the first one is named highwayLengthInTown, whose value is provided by the GEOMETRY_LENGTH built-in function (it provides the length, in meters, of the geometry); the second property is named wanted, whose value is the membership degree to the Wanted fuzzy set, by exploiting the MEMBERSHIP_TO built-n function.

  • The SAVE AS clause transforms the final soft GeoJSON document into a crisp one and saves it into the collection named highwayTowns, within the geosoftDb database.

Fig. 11
figure 11

GeoJSON document produced by the GeoSoft query reported in Listing 7 drawn on a map. The black-box in the left-upper corner reports the properties contained in the GeoJSON feature representing the highway segment that is highlighted in purple in the left-upper corner of the map

The reader can notice that, although the query is not trivial, it is not so complicated: in particular, the choice to rely on the well known syntax of the SQL SELECT statement, as well as the key semantic choice of viewing a GeoJSON document as a set of features, greatly simplify thinking and writing the query for analysts that are used to write SQL queries.

Figure 10 reports an excerpt of the GeoJSON document produced by the GeoSoft query in Listing 7, when applied on the same data sets described in Information Layer 1 and in Information Layer 2, while Fig. 11 shows the same GeoJSON document drawn on a map. In Fig. 11, the reader can notice that the highlighted (in purple) feature is a non-continuous line. This is the effect of the intersection of a line representing a highway with the area of a town, which can have a convex shape. The GeoJSON format supports these cases by means of the MultiLineString geometry type (see Appendix B).

6 From GeoSoft to J-CO-QL \(^{+}\)

GeoSoft works on a specific type of \(\textit{JSON}\) documents, i.e., GeoJSON documents. J-CO-QL \(^{+}\) (the query language of the J-CO Framework) is actually able to manage GeoJSON documents, since it is designed for manipulating any kind of \(\textit{JSON}\) document; however, it is quite complex to do and we conceived the idea of developing GeoSoft. Indeed, the advantage provided by GeoSoft is that it is explicitly designed to query features within GeoJSON documents; consequently, queries are much easier to write than the corresponding J-CO-QL \(^{+}\) scripts. Consequently, the J-CO Framework becomes the engine that performs GeoSoft queries, provided that these are translated into J-CO-QL \(^{+}\) scripts.

In this section, we present the translation technique we implement into the GeoSoft compiler, to translate GeoSoft queries into J-CO-QL \(^{+}\) scripts. To this end, it is necessary to introduce the data and execution models of J-CO-QL \(^{+}\).

6.1 Brief Introduction to J-CO-QL \(^{+}\)

In order to fully understand how a J-CO-QL \(^{+}\) script works, it is worth briefly introducing the underlying data and execution models.

6.1.1 Data Model

J-CO-QL \(^{+}\) works on collections of standard \(\textit{JSON}\) documents (see Appendix A). However, a special meaning is given to fields whose name begins with the “” character: such field names are fully compliant with \(\textit{JSON}\)naming rules, so documents with such fields can be saved into \(\textit{JSON}\)stores without any problem. Currently, two special root-level fields are managed by J-CO-QL \(^{+}\): they are named and .

  • The field works as a map fsn\( \rightarrow md\), where fsn is a fuzzy-set name and md is the corresponding membership degree; this way, the degrees of membership to multiple fuzzy sets of a document can be simultaneously represented.

  • The field represents geometries of spatial entities possibly represented by \(\textit{JSON}\) documents [7, 55]. We chose to rely on the same format for geometries of GeoJSON, which was described in Sect. 4. The advantage of this choice is straightforward: it is a standard format, which is world-wide adopted and can be easily managed. As far as this paper is concerned, it is the same format to deal with when querying GeoJSON features and their geometry fields.

6.1.2 Execution Model

We now briefly introduce the execution model of J-CO-QL \(^{+}\) scripts.

A query \(q = (i_1, \dots , i_n )\) is a sequence of n (with \(n > 0\)) instructions \(i_j\) (with \(1 \le j \le n\)). Each instruction \(i_j\) takes a query-process state \(s_{(j-1)}\) as input and generates a query-process state \(s_j\) as output.

A query-process state is a tuple

\(s = \langle tc, IR, DBS,FO, UDF \rangle \).

Hereafter, we explain each member.

figure o
  • The tc member is named temporary collection and contains the current collection of \(\textit{JSON}\) documents to process.

  • The IR member is a local and volatile database, where the query can temporarily save Intermediate Results.

  • The DBS member is the set of database descriptors, so as to connect to databases to retrieve/store collections of \(\textit{JSON}\) documents.

  • The FO member is the pool of fuzzy operators defined throughout the query, which are used to evaluate membership degrees of documents to fuzzy sets (see Sect. 5.2.2).

  • The UDF member is the set of user-defined user-defined functions (written either in JavaScript or Java), defined to empower the query with additional computational capabilities (see [58]).

This execution model allows for writing complex and long queries, in a way that preserves the natural order with which transformations on collections are thought by human beings.

6.1.3 Brief Description of the J-CO-QL \(^{+}\) Script

The J-CO-QL \(^{+}\) script reported in Listings from 10 to 14 are obtained by translating the GeoSoft query reported in Listing 7. In principle, it could be written by hand directly as a J-CO-QL \(^{+}\) script, but clearly using GeoSoft is much simpler. Hereafter, we provide a brief description, by shortly introducing the statements. This brief description is necessary to allow the reader to understand the translation strategy and algorithm; the reader that is interested in understanding in depth how the J-CO-QL \(^{+}\) script works can find a complete description in Appendix E.

Databases and Fuzzy Operators Listing 4 reports the preliminary part of the script, i.e., the definition of connections to databases and the definitions of the fuzzy operators used later in the script. They are the same previously shown in the execution contexts of GeoSoft queries.

figure p

Loading a GeoJSON Document Listing 11 corresponds to the first nested query in Listing 7.

  • The first three instructions (lines from 5 to 7) are necessary to acquire the GeoJSON document and adapt it to the J-CO-QL \(^{+}\) data model. Specifically, the GET COLLECTION instruction retrieves the content of the specified database collection and makes it the current temporary collection.

  • Then, the EXPAND instruction unnests \(\textit{JSON}\) documents from within the features array, so as to obtain a single \(\textit{JSON}\) document for each feature.

  • Finally, the FILTER instruction adds the special geometry field, using the former geometry field from features. Notice how the CASE WHERE crisp selection condition is used to specify which documents to work on, while the GENERATE clause contains all transformations performed on the selected documents, to generate the output ones (SETTING GEOMETRY specifies the geometry of the document, while BUILD restructure the output documents).

This is the general sequence to perform to load GeoJSON documents; indeed, the first three lines of Listing 9 (which corresponds to the second nested query in Listing 7) are identical.

Crisp and Soft conditions We now consider how J-CO-QL \(^{+}\) deals with crisp and soft conditions on \(\textit{JSON}\) documents.

  • The FILTER instruction on line 8 of Listing 8 actually selects the features of interest and evaluates (through the CHECK FOR clause) the membership degrees to fuzzy sets (by adding the special fuzzysets

  • The SAVE instruction on line 9 saves the temporary collection into the database of Intermediate Results, to exploit it later.

Clearly, in Listing 9, which corresponds to the second nested query in Listing 7, lines 13 and 14 are identical, apart from the fact that the membership degrees to two fuzzy sets are evaluated.

figure s

Joining Documents In Listing 7, line 15 actually joins documents generated by Listings 8 and 9.

  • The ON GEOMETRY clause specifies the spatial condition, while the SET GEOMETRY clause specifies how to derive the geometries of resulting documents. Specifically, given a document l in the left source collection and a document r in the source right collection, the output o document contains two fields named as the source aliases, whose value is the source l (respectively, r) document.

  • The ADD clause adds extra fields to the o document: in this case, the properties field is added, so as to be coherent with the semantic model of features in GeoSoft.

  • The SET FUZZY SETS clause evaluates membership degrees to fuzzy sets by exploiting both membership degrees already evaluated for source documents and spatial fuzzy relationships.

  • Finally, the CASE WHERE clause selects documents of interest and the CHECK FOR clause evaluates membership degrees to fuzzy sets, as in the previous FILTER instructions. The ALPHACUT clause selects documents having a membership degree for the specified fuzzy set no less than the specified threshold.

  • Finally, output documents are restructured (by the BUILD action).

figure t

The temporary collection generated by Listing 7 contains documents that correspond to the features to put into the final GeoJSON document. This is generated by the tail of the script.

Generating the Output GeoJSONDocument

figure u

The final task to perform is to build he output GeoJSON document. This is done by the tail of the script, reported in Listing 14.

  • Line 16 in Listing 7 prepares the documents to comply with the structure of features in the GeoJSON standard (see Appendix B). Documents are also “de-fuzzified”, i.e., the fuzzysets field is dropped, thus removing any reference to fuzzy sets.

  • The GROUP instruction on line 17 groups together all documents, so as to obtain a unique document with the features array field.

  • The SAVE instruction on line 18 actually saves the output GeoJSON document into a database.

6.2 Translation Strategy and Algorithm

We can now present how a GeoSoft query can be translated into a J-CO-QL \(^{+}\) script. Hereafter, we first present the translation strategy, that provides the rational behind the algorithm.

6.2.1 Translation Strategy

The translation algorithm is presented in Sect. 6.2.2. Before presenting it, we illustrate the rationale behind the translation strategy.

  1. 1.

    Source collections stored within the database are retrieved by J-CO-QL \(^{+}\) through a GET COLLECTION instruction; then, features within the features array field must be unnested, so as to transform the source GeoJSON document into a collection of \(\textit{JSON}\) documents, one for each feature. This is the general pattern to follow to acquire GeoJSON documents from the database.

  2. 2.

    Soft features are natively represented by \(\textit{JSON}\) documents, enriched by means of the fuzzysets field (see Sect. 6.1.1); furthermore, the original geometry member in features is translated into the geometry field (see Sect. 6.1.1), which represents geometries in the data model of J-CO-QL \(^{+}\).

  3. 3.

    A nested query is translated into a sequence of J-CO-QL \(^{+}\) instructions. The output temporary collection becomes the input of the instructions that correspond to the clauses that follow the FROM clause in the outer query.

  4. 4.

    In the case of a FROM clause that contains the JOIN operator, the operands are preliminarily processed; since there is only one temporary collection in the query-process state of a J-CO-QL \(^{+}\) script (see Sect. 6.1.2), it is necessary to save the temporary collection that contains the soft features provided by an operand into the Intermediate Results (IR) database. The J-CO-QL \(^{+}\) instruction that actually performs the join (named JOIN OF COLLECTIONS, see Sect. 1) will later refer to this collection saved into the IR database.

Consequently, the general translation model is the following.

  1. 1.

    First of all, the “execution context” contained in the files specified by the GET CONTEXT directives is load as a general preamble. Remember that the execution context defines database connections and fuzzy operators, through J-CO-QL \(^{+}\) instructions.

  2. 2.

    Input GeoJSON documents are acquired and features are unnested; indeed, a GeoJSON document is a collection of features; this collocation is represented as a collection of \(\textit{JSON}\) documents that have the properties, geometry and fuzzysets fields; in particular, fuzzysets is automatically added when the first membership degree to a fuzzy set is evaluated.

  3. 3.

    Nested GeoSoft queries are translated into a sequence of J-CO-QL \(^{+}\) instructions, whose output temporary collection represents a collection of soft features.

  4. 4.

    The outermost GeoSoft query has to generate again a GeoJSON document; consequently, documents in the temporary collection that represent the output soft features are de-fuzzified and aggregated into one single GeoJSON document (this is the tail of the J-CO-QL \(^{+}\) script).

6.2.2 Translation Algorithm

The translation algorithm is reported in Algorithm 1. We adopted a pseudo-code inspired to the Pascal programming language as far as the general syntax is concerned. Hereafter, we present it.

figure ab
  • The RewriteDBQuery function is the entry point of the algorithm. As denoted by its name, it actually deals with the full database query dbq. It is organized as explained hereafter.

    • The function receives one single parameter named dbq; it is a structured object that represents the database query (see Definition 9) to translate. The function returns the string containing the output J-CO-QL \(^{+}\) script.

    • Line D.1 calls the RewriteInternalQuery function, which actually translates the outermost internal query (from the SELECT clause to the ALPHACUT clause), by generating a string with the corresponding J-CO-QL \(^{+}\) script. The string is assigned to the T variable.

    • Line D.2 calls the GenTail function, whose goal is to generate the last part of the J-CO-QL \(^{+}\) script. An example of tail is reported in Listing 14 (we discussed it in Section E.4). The string containing the tail is concatenated with the content of the T variable and assigned to the new R variable, whose value is returned by line D.3. Since the work performed by the GenTail function is straightforward, we do not report it in Algorithm 1.

  • The RewriteInternalQuery function is responsible for translating an internal query, independently of its nesting level. We explain it hereafter.

    • The function receives one single parameter, named iq: this is a structured object that describes the structure of the internal query. The function returns a string with the corresponding J-CO-QL \(^{+}\) script.

    • The If instruction on line I.1 discriminates whether the FROM clause of the iq internal query contains a JOIN expression, because the translation strategy changes.

      • If the FROM clause of the iq internal query does not contain a JOIN expression, then J-CO-QL \(^{+}\) instructions corresponding to the translation of the FROM clause generate a temporary collection that can be processed by a J-CO-QL \(^{+}\) FILTER instruction (see, for example, Sect. 1 and Section E.2, with Listing 11 and Listing 12). Consequently, line I.2 calls the RewriteSource function to obtain the translation of the source (either collection or nested internal query), to which the "FILTER" constant string is appended; its CASE WHERE block will be generated by line I.4.

      • If the FROM clause contains a JOIN expression, its translation must save temporary collections into the IR database and must exploit the J-CO-QL \(^{+}\) JOIN OF COLLECTIONS statement (see Sect. 1 and Listing 13). The corresponding translation is generated by calling the RewriteJoinExpression on line I.3.

      The RewriteSource function and the RewriteJoinExpression function will be explained later.

    • Independently of the specific case dealt with by lines from I.1 to I.3, the translation of clauses from WHERE to ALPHHACUT is the same, as illustrated in Sect. 1 and Sect. 1, with Listing 11 and Listing 13: a CASE WHERE block must be generated and appended to the translation so far obtained. This is obtained by calling the GenCaseWhere function on line I.4.

    • Finally, line I.5 returns the translation.

    Since the work performed by the GenCaseWhere function is straightforward (although it is not trivial), we do not report it in Algorithm 1.

  • The RewriteSource function generates the translation of an s source specification in a FROM clause. It returns the string containing the generated translation. The function is explained hereafter.

    • If the source is not a nested internal query, i.e., it is a collection name, line S.2 calls the GenAcquisition function. It generates a sequence of J-CO-QL \(^{+}\) instructions that acquire the content of the collection and expand nested documents (see Sect. 1 and Listing 11).

    • If the s source is a nested internal query, the RewriteInternalQuery function is recursively called by line S.3 to generate the translation of the nested internal query.

    Since the work performed by the GenAcquisition function is straightforward (although it is not trivial), we do not report it in Algorithm 1.

  • The RewriteJoinExpression function translates a je JOIN expression and returns the corresponding string. It is explained hereafter.

    • Line J.1 generates two temporary names, which are assigned to the \(Name_1\) and the \(Name_2\) variables. Remember from Section E.2 that, in case of JOIN expression, it is necessary to save temporary collections describing soft GeoJSON documents into the IR database; the two temporary names will be used for this purpose.

    • In line J.2, by calling the RewriteSource function, the left operand of the JOIN expression is translated; the resulting J-CO-QL \(^{+}\) script is concatenated with the SAVE AS instruction that saves the temporary collection into the IR database. Line J.3 performs the same task on the right operand of the je JOIN expression. Notice that the translations are assigned to the \(T_1\) and \(T_2\) variables.

    • Line J.4 calls the GenJoinOfCollections function, whose goal is to generate the first part of the J-CO-QL \(^{+}\) JOIN OF COLLECTIONS instruction that actually performs the soft join (see Sect. 1 and the part of Listing 13 that precedes the CASE WHERE clause). The translation is assigned to the \(T_J\) variable. Since the work performed by this function is straightforward (although it is not trivial), its code is not reported in Algorithm 1.

    • Finally, the three translations are concatenated by line J.5 and returned by line J.6.

Algorithm 1 shows that translating a (possibly complex) GeoSoft query is possible; the translation process is linear and well structured.

In order to avoid boring the reader, we do not explain in details how the GeoSoft query reported in Listing 7 is translated. The interested reader can find an extensive description in Appendix E.

7 Evaluation

Before concluding the paper, we summarize the results by evaluating three critical aspects, i.e., “flexibility”, “accessibility” and “efficiency”.

7.1 Flexibility

First of all, we consider “flexibility”. What kind of flexibility is provided by GeoSoft?

  • Typically, query languages that rely on fuzzy sets and linguistic predicates are considered as “flexible query languages”, in the sense that queries expressed by means of which are tolerant to imprecision and vagueness. From this point of view, GeoSoft is certainly a flexible query language.

  • The extent of flexibility provided by GeoSoft is focused at the level of feature: it does not consider single fuzzy values; in contrast, it considers fuzzy sets defined in the universe of spatial features. As an effect, a spatial feature can belong to multiple fuzzy sets (of spatial features).

  • Fuzzy operators are used to evaluate membership degrees of spatial features to fuzzy sets, on the basis of values of feature properties. This approach has been inherited from J-CO-QL \(^{+}\): in particular, a distinctive characteristic is the adoption of a polyline as membership function, in place of more traditional triangular and trapezoidal functions (that can be easily defined by means of polylines). This choice further increases the level of flexibility, because complex shapes can be exploited by analysts to deal with complex situations (as we did in [56]).

  • Currently, GeoSoft does not provide “quantifiers”: a quantifier is an aggregator of several membership degrees, whose goal is to express quantification of linguistic predicates. The reason why quantifiers are not considered in GeoSoft is that they are not provided by J-CO-QL \(^{+}\) either; we are aware of this lack, but we decided to delay addressing this problem, waiting for the right time in the development of the language. Indeed, we are currently addressing the issue of defining user-defined fuzzy aggregators (and quantifiers can be seen as specific fuzzy aggregators). When this concept is available in J-CO-QL \(^{+}\)  we will promptly introduce it in GeoSoft, together with the GROUP BY clause.

  • GeoSoft accompanies spatial join with specific functions to evaluate fuzzy spatial relationships among geometries. This way, through a spatial join, the membership degrees to fuzzy sets of resulting features on the basis of spatial fuzzy relationships can be evaluated at once.

  • In our opinion, another issue towards flexibility is the clear separation between crisp conditions and soft conditions. Indeed, previous proposals usually modify the semantics of the classical WHERE clause from crisp conditions to soft conditions. In our opinion, this approach is not the right one, because it is true that crisp predicates can be seen as soft predicates whose membership degree is either 0 or 1, but often the crisp and soft world should not be mixed. This is why GeoSoft provides both the WHERE clause (crisp selection condition) and the USING clause (soft conditions that evaluate membership degrees to fuzzy sets).

Nonetheless, we can consider another meaning of flexibility, i.e., the ability to simplify data management and analysis tasks. In this sense, GeoSoft is the most flexible solution for analyzing GeoJSON information layers stored within a \(\textit{JSON}\) document store, for several reasons.

  • It works directly on the source GeoJSON documents and generates GeoJSON documents.

  • It does not require to perform tedious transfers of GeoJSON document into a different DBMS, such as PostgreSQL/PostGIS, fighting against import/export tools.

  • It is completely independent of the specific database technology and of its specific query language.

7.2 Accessibility

The second aspect we consider to evaluate GeoSoft is “Accessibility”, which we can be intended as the capability of the proposed tool to remove obstacle to manage GeoJSON information layers.

  • Section 6 presented the translation method we implemented to translate GeoSoft queries into J-CO-QL \(^{+}\) scripts. The length of J-CO-QL \(^{+}\) scripts obtained from (possibly short) GeoSoft queries clearly show how GeoSoft greatly improve the accessibility of users to GeoJSON querying; the choice for adopting a SQL-like syntax for GeoSoft further improve accessibility.

  • In this paper, we did not consider other query languages for \(\textit{JSON}\) documents, such as MQL (the MongoDB Query Language), but we can say a few words. Syntactically, MQL is a JavaScript method calls, where a query is specified as a \(\textit{JSON}\) document provided as parameter: this choice makes quite hard to directly work on document nested within array fields, so a GeoJSON document should be preliminarily split into single \(\textit{JSON}\) documents that must be saved into the database itself, with the effect of making the database dirty. Furthermore, users coming from SQL are not familiar with this approach, as well as MQL does not support soft querying. Consequently, we can state that GeoSoft provides users with a significantly higher degree of accessibility for querying of GeoJSON documents.

  • Would it be possible to perform the same kind of analysis with MongoDB? The answer is “no”, because MongoDB supports only “range queries” on previously-indexed \(\textit{JSON}\) documents (not on GeoJSON information layers) on the basis of their geo-tagging.

  • Would it be possible to perform similar queries with PostgreSQL/PostGIS? the answer is “yes”, because PostGIS is quite complete. Nonetheless, queries cannot be performed directly on GeoJSON documents, but on tables; this means that features within GeoSoft documents must be preliminarily imported into tables, processed and exported again as GeoJSON documents. GeoSoft overtakes all these passages, enabling users to directly work on GeoJSON information layers stored within \(\textit{JSON}\) data stores.

Table 1 Execution times for each J-CO-QL \(^{+}\)instruction reported in Listings from 10 to 14

7.3 Efficiency

The last aspect we consider is “efficiency”. To evaluate this aspect, we performed experiments with the real data within the GeoJSON documents presented in Information Layer 2 and information Layer 1. Experiments were conducted on a common laptop powered by an Intel quad-Core i7-8550-U Processor, running at 1.80 GHz, equipped with 16 GB RAM and 250 GB Solid-State Drive. The J-CO-QL \(^+\) Engine is implemented with the Java language.

GeoSoft executed via J-CO-QL\(^{+}\) script The GeoJSON query reported in Listing 7 is translated into the J-CO-QL \(^{+}\) script reported in Listings from 10 to 14, which was summarized in Sect. 6.1.3 and is extensively presented in Appendix E.

Specifically, the Highways information layer describes 94 highways; the Towns information layer describes 1506 municipalities. This data set is named Full Data Set, to distinguish it from the other two data sets that will be discussed later.

Table 1 reports the result of our experiments. For each data set, the execution times measured for each single J-CO-QL \(^{+}\) instruction are reported (left column, labeled as Partial) and the sum of execution times is reported on the right column (labeled as Total).

The total execution time for the Full Data Set is 26 sec; looking at the single instruction, it is possible to see that the JOIN OF COLLECTIONS instruction takes 1.5 sec, so, there are other instructions that determine the overall execution time. These instructions are those from 5 to 7: they correspond to the acquisition of the Towns GeoJSON document and its transformation into single \(\textit{JSON}\) documents: in particular, the EXPAND instruction is the slowest, because it has to unnests documents with very complex geometries.

In order to better evaluate this behavior, we built a second data set, named Milan Data Set; in this data set, we selected only the 133 municipalities in the province of Milan; so the input Towns GeoJSON document now contains about one-tenth of the initial features.

Looking at Table 1possible to see that now the overall query is executed in 4.3 s; in particular, the critical instructions (from 5 to 7) now take 2 s, i.e., ten times faster (as expected).

Finally, we further restricted the municipalities in the Towns document, considering only the 22, in the province of Milan, whose area is between 15 \(km^2\) and 70 \(km^2\) (which correspond to the area, for a municipality, to have a full membership to the MediumTowns fuzzy set, according to the MediumTownOp fuzzy operator defined in Listing 9). The Highways document still contains all the initial features. The data set is named Medium Data Set.

The overall time is now 2.5 sec; in particular, notice that the JOIN OF COLLECTIONS instruction now takes only 0.4 sec Considering that the implementation of the JOIN OF COLLECTIONS statement is not optimized by spatial indexing, it behaves satisfactorily, not being the bottleneck of the process.

Thus, the bottleneck is the loading of large GeoJSON documents in which features have very complex geometries, such as borders of municipalities.

Comparison with PostegreSQL/PostGIS In order to compare GeoSoft (and its execution through J-CO-QL \(^{+}\)) with traditional solutions, we created an Object-Relational database managed by PostgreSQLFootnote 11 and its extension for spatial data, named PostGIS.Footnote 12 The two information layers were uploaded, both in the Full Data Set version and in the Medium Data Set version; two PostGIS tables for each version were created, having one row for each feature. A join query based on the spatial intersection of geometries was performed, so as to spatially intersect towns and highways; this query is comparable with the work performed by the JOIN OF COLLECTIONS instruction in the J-CO-QL \(^{+}\) script, even though the PostGIS join does not evaluate soft conditions (as in the J-CO-QL \(^{+}\) JOIN OF COLLECTIONS).

Processing the Full Data Set, PostGIS takes 1.2 sec, against 1.6 sec taken by the J-CO-QL \(^{+}\) JOIN OFCOLLECTIONS.

Processing the Medium Data Set, PostGIS takes 0.2 sec, against 0.4 sec taken by the J-CO-QL \(^{+}\) JOIN OF COLLECTIONS.

It is possible to observe that in absolute terms PostGIS is faster than J-CO-QL \(^{+}\), in particular if the overall script is considered. However, this experiment does not consider how to upload information layers to PostGIS tables and how to export PostGIS tables to GeoJSON documents. To this end, several tools can be used, but they require the user to execute them manually, thus resulting in tedious activities that more or less can take (when things go fine at the first attempt) at least the same time needed by the overall J-CO-QL \(^{+}\) script on the Full Data Set, i.e., 26 sec. In contrast, the GeoSoft query works directly on the information layers stored within \(\textit{JSON}\) document stores: users do not have to waste time in other activities, they can concentrate on the analysis they want to perform.

8 Conclusions and Future Works

The contribution of the paper is the GeoSoft proposal: it is conceived as a domain-specific language for writing sophisticated queries on features described by GeoJSON documents. Hereafter, we summarize the major points that characterize this work.

  • The choice for the GeoJSON format is motivated by its popularity. Furthermore, moving from the high-level view of a GeoJSON document as a “set of features”, it is straightforward (for us) to conceive to adopt the same approach on which SQL relies for querying features of GeoJSON documents, so as to obtain other GeoJSON documents.

  • The idea of defining GeoSoft originated from the work we are performing on the J-CO Framework. It is a pool of software tools built around the J-CO-QL \(^{+}\) language, designed to operate on \(\textit{JSON}\) data sets at a high level of abstraction, if compared with other languages for querying \(\textit{JSON}\) documents (see [7] for a detailed comparison). In particular, native capabilities of managing geo-tagging and soft querying \(\textit{JSON}\) documents makes J-CO-QL \(^{+}\) a unique proposal. Furthermore, the independence of any specific \(\textit{JSON}\) store makes the J-CO Framework particularly indicated to integrate data sets coming from multiple \(\textit{JSON}\) document stores.

  • However, any general-purpose language (even if it is very powerful) may fall into situations that can be solved only by writing long and complex sequences of instructions. This happens with J-CO-QL \(^{+}\) as well, in particular when dealing with formats based on many nesting levels, such as GeoJSON: writing J-CO-QL \(^{+}\) scripts for querying features in GeoJSON documents is usually cumbersome, even though such J-CO-QL \(^{+}\) scripts are effective. The approach followed for GeoSoft, i.e., devising a query language specifically designed for features in GeoJSON documents, which exploits the syntax of the classical SELECT statement in SQL, dramatically reduces the effort to write complex queries.

  • Nevertheless, the distinctive features provided by J-CO-QL \(^{+}\) are still precious, in particular native management of geo-tagging and soft querying capabilities: GeoSoft exploits them. Consequently, the J-CO Framework plays the role of execution engine of GeoSoft queries, because they are translated into J-CO-QL \(^{+}\) scripts.

Both the J-CO Framework and the GeoSoft compiler are publicly available on a GitHub repository.Footnote 13

As a future work, several research lines could be followed, which concern the J-CO Framework in general and possibly affect GeoSoft.

  • The J-CO-QL \(^{+}\) language is undergoing a continuous evolution, in particular as far as support for soft querying is concerned. Specifically, we are going to address the problem of defining complex fuzzy aggregators, so as to enable soft queries that rely on complex soft aggregations. Once defined, we will address, in GeoSoft, both the definition of quantifiers and the introduction of the GROUP BY clause.

  • Soft Web Intelligence is the concept we are exploring in a parallel research work [66]. It can be considered as a modern interpretation of the 20-year-old notion of Web Intelligence. Currently, we demonstrated that the J-CO Framework can play a key role in envisioning the World-Wide Web as a giant data store, supporting its exploitation by means of soft computing and soft querying. From GeoSoft a novel language could derive, that could provide analysts with the abstraction to easily write complex (fuzzy) aggregations on data set coming from the Web.

  • From the point of view of applications, the line of developing domain-specific languages for particular formats of \(\textit{JSON}\) documents can be replicated for other formats. For example, a possible application context could be “linked data”: for example, the JSON-LD format [67] is meant to substitute the Resource Description Framework (RDF) [68] in the context of semantic web.

  • As we said, the J-CO Framework and J-CO-QL \(^{+}\) are general purpose, thus they could be applied to many different application contexts. We already experimented its use in the context of geographical-data integration [56]. The result we obtained encourages us to explore novel application contexts, such as building a product search-engine based on queries written in natural language over product reviews posted by users and customers (moving from our past experience [69, 70]).