FLASc: a formal algebra for labeled property graph schema

Contemporary labeled property graph databases are either schema-less or schema-optional to support frequent changes in the structure of data found in domains requiring high flexibility. However, the lack of structure impacts data transformation and loading operations from heterogeneous sources into graph databases. We present a formal algebra FLASc for specifying and generating graph schema for labeled property graph databases. We formally define FLASc and demonstrate the use of FLASc generated graph schemas to systematically transform and load data-sets related to domains of cyber-physical systems, big data analytics and tourism. Findings from three disparate case studies show that FLASc-generated schemas assist in enforcing integrity constraints that reduce the chance of data corruption, hence assuring data consistency and integrity.


Introduction
Labeled property graph database henceforth graph database are storage systems that allow modeling of real-world entities as nodes and relationships between entities as edges Angles et al. (2018).Nodes and edges in a graph database have associated labels.Data is stored inside nodes and edges as properties that exist in the form of key-value pairs Angles et al. (2017); Angles and Gutierrez (2008).Graph databases 37 Page 2 of 45 are efficient in storing and managing highly interconnected data-sets related to domains such as transportation networks, social media, bioinformatics, chemistry and astronomy (Angles and Gutierrez 2008;Angles 2012;Angles et al. 2017;Bell et al. 2009;Tetko et al. 2016).Graph databases suit big data applications as they provide a better alternative for modeling and handling complex information (Rodriguez andNeubauer 2010, 2012).Graph databases are more efficient than relational databases for extracting information from highly interconnected data-sets (Sharma et al. 2019;Sharma and Sinha 2019;Sharma 2020;Sharma et al. 2021).
The interconnections between data represent the underlying meaning of a graph data-set.Therefore, maintaining data consistency and integrity is vital in graph databases (Angles and Gutierrez 2008;Kunii 1987).Obtaining a database that is sound and consistent requires embracing good database modeling principles (Badia and Lemire 2011).In contrast to relational databases, modeling principles for graph databases are ad-hoc and not well-grounded (Park et al. 2014).Contemporary graph databases lack mechanisms to ensure data consistency and integrity, especially when the data being stored comes from multiple heterogeneous sources (Reina et al. 2020).A primary reason is that graph databases are either schema-less or schemaoptional (Reina et al. 2020).A schema represents the overall structure of a data-set and assists in understanding data semantics (Pokornỳ 2016).Furthermore, schemas aid in defining integrity constraints that are sets of rules for ensuring consistency and integrity in the database that conforms to the schema (Codd 2002;Ghrab et al. 2014).The lack of schema and integrity constraints poses significant challenges in ensuring data consistency and integrity (Khan et al. 2012), in performing advanced analytics (Sharma 2021) and achieving data interoperability (Sciore et al. 1994), and for data integration, query optimization and processing (Frozza et al. 2020).
Traditional database modeling consists of three stages conceptual, logical and physical modeling (Badia and Lemire 2011).In graph databases, the conceptual modeling stage represents gathering requirements of a given problem domain that are then used for defining entities and relationships between them.The logical modeling stage represents the enforcement of integrity constraints, including mandatory, optional and unique properties associated with entities and relationships defined in the conceptual modeling stage.The physical modeling stage represents the realization of graph schema formulated at the conceptual and logical modeling stage into database creation scripts.
An open problem in graph database design is that practitioners do not have proper guidelines for designing conceptual models (Pokornỳ 2016;Badia and Lemire 2011) that can facilitate systematic transformation and loading of data from heterogeneous sources into graph databases.Conceptual modeling stage is not used in the majority of graph database solutions (Fitzgerald et al. 1999;Brodie and Liu 2010).Graph databases lack abstraction tools Angles and Gutierrez (2008) and most current research is primarily focused on logical and physical modeling (Reina et al. 2020;Pokornỳ et al. 2017;Pokorny 2017).These observations lead us to the following research questions:

RQ1
What are the key strengths and limitations of existing approaches used for modeling graph databases?RQ2 What mechanisms can be designed to formulate conceptual and logical graph schemas for labeled property graph databases?RQ3 In order to ensure data consistency, how can the graph schema generated by RQ2 be used to systematically import data from heterogeneous sources into a labeled property graph database?RQ3. 1 How can the Extract-Transfrom-Load design pattern be extended in order to support loading data-sets for heterogeneous sources into graph database?
We answered these research questions using a mixed-methods research methodology (Johnson et al. 2007).Firstly, for addressing RQ1 a literature review was carried out to identify existing evidence and gaps in the literature related to the research question.We addressed RQ2 by proposing an algebra FLASc which is based on conceptual graphs introduced by (Sowa 2008(Sowa , 1992(Sowa , 1999)).The three operators of JOIN, DETACH and DELETE_NODE provided by FLASc serve as mechanisms for formulating conceptual graph schemas which are further extended to logical graph schemas.The three FLASc operators presented in this research paper can be used for designing schema generation and manipulation algorithms.Hence a major utility of FLASc is that it serves as a formal basis for designing future data definition languages for graph databases.For addressing RQ3 and RQ3.1, we illustrate the integration of FLASc with the well known Extract-Transform-Load (ETL) design pattern.The graph schemas generated by FLASc can be used to enforce integrity constraints and assist in the systematic generation of database creation scripts hence ensuring data consistency.To demonstrate the utility of our approach we consider three distinct case studies related to industrial cyber-physical systems (Sharma et al. 2019), big data analytics (Khalajzadeh et al. 2019(Khalajzadeh et al. , 2020) ) and tourism (Airbnb 2018;Sharma and Sinha 2019).We generate graph schemas for the heterogeneous data-sets provided in the three case studies and produce database creation scripts in using the FLASc integrated ETL design pattern.
37 Page 4 of 45 The formalism for labeled property graph schemas presented in Sharma and Sinha (2019) and Sharma et al. (2021) is foundational for designing our algebra .The work presented in this research paper empowers users of to design robust graph schemas for labeled property graph databases.
The rest of this article is organized as follows.Section 2 presents background information and related work.The gaps identified in Sect. 2 are used to build FLASc which is presented in Sect.3. In Sect. 4 we illustrate how the conceptual and logical graph schema formulated using FLASc can be used to enforce several integrity constraints in Neo4j graph database.In Sect. 5 we present the integration of FLASc with ETL design pattern and experimentally demonstrate its use for data transformation and loading of heterogeneous data-sets into Neo4j graph database.Finally, in Sect.6 we summarize our major findings, key contributions and future directions of this work.

Background and related work
This section enables us to address RQ1.We present a brief survey of the existing approaches that have been proposed for modeling graph databases.

Graph database design and modeling
Graph databases use graphs consisting of nodes and edges as elementary data structures for modeling any problem domain (Angles 2012;Angles et al. 2017;Angles and Gutierrez 2008).All graph databases use slight variations of the basic graph data structure.For example, graph databases proposed in academia such as GOOD (Gyssens et al. 1994), Gram (Amann and Scholl 1993), GraphDB (Güting 1994), GDM (Hidders 2003;Paredaens et al. 1995) and (Graves et al. 1995) use directed labeled graphs.Graph database such as hyper log (Levene and Poulovassilis 1990;Levene and Loizou 1995) use hyper node and hyper edge based graphs.Resource Description Framework (RDF) proposed by W3C (W3C 2021) use directed labeled graphs while Neo4j (2021), Oracle (2021) use directed, labeled and attributed graphs which are also known as property graphs (Angles 2018).There are three main stages of modeling a graph database: conceptual, logical and physical.

Conceptual modeling
Conceptual modeling represents the initial stage in which knowledge is collected in the form of requirements and specifications related to a problem domain.Using graphs for representing knowledge was first proposed by Sowa (2008Sowa ( , 1992Sowa ( , 1976Sowa ( , 1999)).Subsequent works (Kunii 1987;Chein and Mugnier 2008;Mugnier and Chein 1992) also propose the use of graphs to represent knowledge at the conceptual modeling stage.Graphs provide a natural and intuitive interface for understanding the semantics of data (Sowa 2008;Badia and Lemire 2011).Knowing the semantics of data is vital for understanding the overall structure of the database (Pokornỳ 2016) that aids in creating, modifying and retrieving data.Schemas created at the conceptual modeling stage provide a level of abstraction that aids in the natural modeling of data (Angles 2012).Conceptual graph schemas are used to define entities that belong to the database and relationships between those entities (Badia and Lemire 2011).Moreover, determining nodes, edges, and the direction of edges are vital for conceptual modeling (Griffith 1982).

Physical modeling
Physical modeling represents the realization of the graph schema designed during conceptual and logical modeling into actual database (Finkelstein et al. 1988).There are two approaches discussed in literature for physical modeling: integrated and layered (Šestak et al. 2016).In the integrated approach, mechanisms to support the enforcement of integrity constraints are directly deployed on the database.These mechanisms are developed by altering and/or modifying the source code of a database system.In the layered approach, APIs specific to the database platform are used to create an additional layer that communicates with the database.This consist 37 Page 6 of 45 of wrappers written in programming languages such as Java, Python that contains database creation scripts and logic to enforce the integrity constraints.

Integration of logical and physical modeling
There exist many studies to support the integration of logical and physical modeling aspects of graph databases.For instance, Ghrab et al. (2016) follow a layered approach and propose the construction of a wrapper that can be used to enforce integrity constraints, including graph and path pattern constraints over Neo4j graph database.An integrated approach to extend the source code of OrientDB to support the enforcement of integrity constraints, including uniqueness, key, cardinality, and edge degree constraints, has been studied in Reina et al. (2020).Similarly, the extension of Cypher query language to support additional integrity constraints such as uniqueness, node property, edges pattern, and mandatory properties is presented in Pokornỳ et al. (2017), de Sousa andCura (2018).A layered approach to demonstrate the enforcement of uniqueness integrity constraint on two different graph databases Neo4j and Apache Tinkerpop, is proposed in Šestak et al. (2016).The use of integrated and layered approach together to perform graph database manipulation operations on Neo4j graph database is proposed in Barik et al. (2016).Authors in Daniel et al. (2016) propose the model-driven engineering based approach for converting and loading of UML diagrams into tinkerpop blueprints. 1

Gaps in current literature
Several studies have been proposed in the last decade that address the problem of modeling graph databases.These studies mainly focus on the integration of logical and physical modeling aspects.A primary reason of this due to the emergence of several graph data models such as resource description framework (RDF) (Lassila et al. 1998;Pérez et al. 2006), labeled property graphs (LPG) (Angles 2018;Sharma et al. 2019;Sharma and Sinha 2019;Sharma 2020Sharma , 2021) ) and creation of query languages such as SPARQL (2013), Cypher (Neo4j) 2021, Gremlin (Apache) (2021), PGQL (Oracle) (2021) and GSQL (TigerGraph) (2020) to support data modeling and retrieval.More recently, projects such as ISO/IEC 39,075, 2 openCypher (2018) and Linked Data Benchmark Council (LDBC) Alex and Norbert (2013) have been proposed for developing a standard query language for the labeled property graph data model.Most of these studies focus on extending the existing query languages to support logical and physical modeling while conceptual modeling is done in an ad-hoc manner.Authors in Ghrab et al. (2016), Roy-Hubara et al. (2017), Hartig and Hidders (2019) present a formal approach for logical modeling of graph databases.However, physical modeling in these research papers are not discussed in detail 1 https:// github.com/ tinke rpop/ bluep rints. 2 https:// www.iso.org/ stand ard/ 76120.html.(Šestak et al. 2021) and application of the proposed formalisms on real-world datasets are considered as future work.
To obtain a robust graph database that captures semantics of the problem domain conceptual modeling stage is vital.A sound conceptual graph schema ensures that logical and physical modeling stages are also robust (Mior et al. 2017).The graph data modeling approaches proposed so far do not provide the means to create robust conceptual graph schemas.Authors in Park et al. (2014), Roy-Hubara et al. (2017), Daniel et al. (2016) propose the use of existing visual modeling tools such as entity relationships diagrams (ERD) and unified modeling language (UML) for creating conceptual and logical graph schemas.The schemas generated by visual models such as UML diagrams are based on node-labeled graphs (Sharma and Sinha 2019) where only the nodes can have properties associated with them.According to Chen (1976), ERDs are based on node and edge labeled graphs where edges are also attributed.However, in order to support the creation of relational databases, attributed edges in ERDs have to be represented as strong and weak entities (or attributed nodes) 3 .Modeling tools such as ERD and UML are generic and while they can be used to model LPG schema, they do not capture subtleties like edge labels and attributes without carefully considered extensions.Our algebra directly supports LPG schemas that have labels and properties associated with nodes and edges (Sharma and Sinha 2019;Sharma et al. 2021).Both UML and ERD are semi-formal modeling tools whereas FLASc provides a formal basis for LPG schemas.This opens up the opportunity to define a FLASc-driven schema-generation language based on formal languages such as conjuntive queries and first order logic Sharma (2021).However, such extensions of FLASc are not in the scope of this research paper.
In this research, we present FLASc a formal tool that assists in the formulation of robust conceptual and logical graph schemas which is an advancement over existing studies in graph database modeling.The majority of integrity constraints presented in the existing studies can be specified in graph schemas generated by FLASc.Furthermore, syntax and semantics of FLASc presented in this study assist in its implementation at the physical modeling stage.FLASc assists in the integration of conceptual, logical and physical modeling stages which currently is lacking in graph database research.

FLASc: formal algebra for conceptual and logical graph schema
This section addresses RQ2, we present the formal algebra FLASc that assists in formulating conceptual and logical graph schemas for labeled property graph databases.We use the concepts from Sowa's conceptual graphs identified in Sect.2.1.1 to propose the operators of FLASc.We use a formal approach for constructing FLASc which assures the robustness of its design (Marciniak 1994;Clarke and Wing 1996).FLASc has sound mathematical basis that enables a user to precisely 37 Page 8 of 45 define: (i) connections between entities of a graph database (intensional information) and (ii) properties associated with entities and relations in a graph database (extensional information) (Sowa 1976(Sowa , 1992(Sowa , 1999(Sowa , 2008)).
We consider a data-set from Airbnb Sharma and Sinha (2019); Sharma et al. (2021) as our first case study related to the tourism domain that assists in illustrating various definitions and concepts of FLASc.This data-set consists of three CSV files that contain information related to property listings, reviews and calendar data.This data-set is highly interconnected, making it a prime candidate for graph database design and implementation (Sharma et al. 2021;Sharma 2021).

Basic terminology
Definition 1 (Directed Multigraph) A directed multigraph G = (N, E, S, T) is a tuple where N is a set of nodes and E is a set of edges.Two associated func- tions, S ∶ E → N and T ∶ E → N , map each edge to its source and target nodes, respectively.
Each edge in a directed multigraph has unique source and target nodes.Edges with same source and target nodes are allowed (hence the term multigraph.We use the short hand n i → n j to represent an edge e k where S(e k ) = n i and T(e k ) = n j .
Graph can contain labels over nodes and edges.Given a set of node labels L N and a set of edge labels L E such that L N ∩ L E = � .A labeling is simply a map f ∶ S 1 → S 2 such that for every element a ∈ S 1 , there is a unique element f (a) ∈ S 2 .We can define an edge-labeled graph as follows.
Definition 2 (Edge-Labeled Graph) A graph G = (N, E, , S, T) is called an edge- labeled graph if there exists a labeling ∶ E → L E which maps all edges to labels in a set of edge labels L E .We use the short-hand e k = n i l � � � → n j for any e k ∈ E and (e k ) = l.
Similarly, we can define a node labeled graph.Definition 3 (Node-Labeled Graph) A graph G = (N, E, , S, T) is called a node- labeled graph if there exists a labeling ∶ N → L N which maps all nodes to labels in a set of node labels L N for any n i ∈ N and l ∈ L N if l is mapped to n i then (n i ) = l.

Conceptual graph schema
A conceptual graph schema is used to capture intensional information.Conceptual modeling is easier for the user to understand and contribute.Therefore, a conceptual graph schema must be closer to the semantics of natural languages like English.It must reflect real-world entities, and relations that are not directly represented by the conceptual graph schema must be accessible to infer (Sowa 1992;Mugnier and Chein 1992).As discussed in Sharma and Sinha (2019) to define relationships, we use the (subject,predicate,object) format from semantics web (Berners-Lee et al. 2001) where the subject can be a noun, the predicate can be a verb, and an object can also be a noun.
Definition 4 (Conceptual graph schema) Given a set of node labels L N and a set of edge labels L E , conceptual graph schema G s is a tuple (N s , E s , s , s , L N , L E , S s , T s ) where, • N s is a finite set of nodes and E s is a finite set of edges of the graph schema.
We use the shorthand notation G s = (N s , E s , s , s , S s , T s ) to represent the concep- tual graph schema.

Example 1
The conceptual graph schema generated for Airbnb case study as discussed in Sharma and Sinha (2019) is presented in Fig. 1.The graph schema consists of six labels including , , and and four edge labels , _ , and .In the Airbnb data-set (2018) a person using Airbnb service can write a review for a listing that was recently visited by him or her.A conceptual graph schema in such a scenario consists of entities such as user and review.Relationships can be of the form (users,wrote,review) where users is the subject, wrote is the verb and review is the object.

Basic conceptual graph schema
Basic conceptual graph schemas are restricted form of conceptual graph schemas.They serve as building blocks for formulating conceptual graph schemas.Formally basic conceptual graph schemas are defined as follows.
Definition 5 (Basic conceptual graph schema) Given sets of node and edge labels either be a singleton set or an empty set.
• (N b , E b , S b , T b ) is a restricted from of directed multigraph supporting only one directed edge between two nodes.
Example 2 The Airbnb data-set consists of several basic conceptual graph schemas including ) .The basic conceptual graph schema is used to represent the intensional information that a review was written by a user and review was written for a listing.
Basic conceptual graph schemas serve as a starting point for a database designer and assist in conceptual modeling.A basic conceptual graph schema can contain nodes that are not connected to one another by an edge.A designer can create separate basic conceptual graph schemas for each requirement and/or use case.We now present our algebra FLASc for creating robust conceptual graph schemas from basic conceptual graph schemas.

Syntax and semantics of FLASc
An algebra consists of sets, constants that belong to the sets and some functions or operators that are used to manipulate data stored inside the sets (Tucker and Stephenson 2003).Our algebra FLASc is defined as follows: Definition 6 ( FLASc ) An algebra defined over a finite set of basic conceptual graph schemas G B , is a tuple ⟨G B , G, F⟩ where: • G is the set of all conceptual graph schemas over G B , with G B ⊂ G.
• F is a set containing three operators: is a conceptual graph schema formed by the union of two conceptual graph schemas.Let G 1 = (N - - ) is a conceptual graph schema formed by applying ring sum over the edge sets of where L N 2 is a set of node labels and L E 2 is a set of edge labels associated with G 2 .The resultant conceptual graph schema consists of all the nodes present in graphs G 1 and G 2 that is (N 1 ∪ N 2 ) .While the ring sum operator is only applied over the edge sets of two graphs that is -3 is defined as Otherwise if -T 3 is defined as ) is a conceptual graph schema formed by applying ring sum over the node sets of G 1 and Then the resultant conceptual graph schema after applying the DELETE_NODE operator consist of nodes that are formed by applying the ring sum over the node sets of two graphs that is FLASc provides JOIN, DETACH and DELETE_NODE operators over basic conceptual graph schemas to formulate composite conceptual graph schemas.We can now discuss the semantics of these three operators and provide some examples.
JOIN is used to combine together two or more conceptual graph schemas.We follow the similar notion of join compatible mapping as discussed in Angles et al. (2017); Castro and Soto (2017); Pérez et al. (2006).Two conceptual graph schemas are join compatible if they share common nodes.That is

Example 3
The basic conceptual graph schemas presented in Example 2 are join compatible because both graphs share a common node n 2 that have the node label review.Figure 2 shows that applying the JOIN operator over basic conceptual graph schemas Graphs G b1 and G b2 are join com- patible because the target node of edge e 1 ∈ E 1 that is T 1 (e 1 ) and source node of edge e 2 ∈ E 2 that is S 2 (e 2 ) are same.Moreover the node labels associated with these two nodes are also same that is 1 (T 1 (e 1 )) = 2 (S 2 (e 2 )) = .
Two join compatible conceptual graphs share common nodes.This assists in connecting smaller graphs.When two conceptual graph schemas are not join compatible, then application of the JOIN operator creates a union of two disconnected conceptual graph schemas.
DETACH is used to delete edges from a conceptual graph schema.This operator is useful if a database designer wishes to delete existing relationships in a conceptual graph schema.The graph produced after applying a DETACH operator over two conceptual graph schemas contain nodes from both the graphs.While edges of the new conceptual graph schema are calculated by applying the ring sum operator over the edges of conceptual graph schemas that provided as input to the DETACH operator.Applying the DETACH operator over two conceptual graph schemas If one graph is a sub-graph of another conceptual graph schema then applying DETACH operator over such graph represents set difference of the edge set.An edge can only be deleted using DETACH if (E 1 ∩ E 2 ) ≠ � which means that both conceptual graph schema must share some common edges.Furthermore, the labels associated with these edges must be same that is, ∃e 1 ∈ E 1 and ∃e 2 ∈ E 2 such that 1 (e 1 ) = 2 (e 2 ) .The applica- tion of DETACH removes existing edges from a conceptual graph schema.The resulting conceptual graph schemas after the application of DETACH may contain disconnected nodes.

Example 4
Edges can be deleted from a conceptual graph schema by using DETACH.As shown in Figure 3 applying DETACH between conceptual graph schemas G b1 and G b3 results in conceptual graph schema G b4 that only contains an edge between node n 2 and n 3 .That is . Furthermore, node n 1 is not the source and target of any edge in the conceptual graph schema.
DELETE_NODE is used to delete disconnected nodes in a conceptual graph schema.This operator is useful if a database designer wishes to delete existing nodes that are not connected to any other nodes in a conceptual graph schema.That is nodes that are neither the source nor the target of any edge in a conceptual graph schema.As mentioned in Definition 6 the set of nodes in This means that both graph must share common nodes, fur- thermore ∀n i ∈ N 1 and ∀n d ∈ N d such that 1 (n i ) = d (n d ) which means that both nodes must have same node label.Otherwise, all nodes in N d shall be added to the conceptual graph schema resulting from DELETE_NODE(G 1 , G d ).
Example 5 Disconnected nodes can be deleted from a conceptual graph schema by using the DELETE_NODE.As shown in Fig. 4 applying the DELETE_NODE operator between conceptual graph schemas G b4 and G d results in a conceptual graph schema G b7 that only consists of nodes n 2 , n 3 and an edge connecting nodes n 2 and n 3 .The resulting graph does not contain any disconnected node.That is and (n 3 ) = .The Using JOIN and DETACH together become helpful if the label and/or direction of edges in a conceptual graph schema have to be altered or changed.These operators, when used together, enables a designer to alter intensional information stored in a conceptual graph schema.

Example 6
For instance if a designer wishes to alter the label and direction of an edge between node n 1 labeled as user and node n 2 labeled as review in the conceptual graph schema G b3 presented in Example 3. As shown in Fig. 5 a designer can apply DETACH between graphs G b1 and G b3 which results in graph ) .The designer can now define a basic conceptual graph schema G b5 where (n 1 ) = and (n 2 ) = .Applying the JOIN operator between graphs G b4 and G b5 results in conceptual graph schema G b6 = (G b4 , G b5 ) as shown in Figure 5.

Logical graph schema
A logical graph schema is used to capture extensional information of the entities and relations stored in a graph database.A logical graph schema is formed by enforcing integrity constraints on conceptual graph schema.Label uniqueness constraints are automatically enforced in the logical graph schema since the node, and edge labels used in conceptual graph schema are unique.For defining property-based constraints, we first define properties that can exist in graph databases.Properties in graph databases exist as key-value pairs where property values are atomic entities and have an associated data type.Logical graph schema stores properties as key-type format.Properties can be mandatory as well as optional for instance, properties such as ids must be unique.This information must be stored in a logical graph schema.
Let s be a set of infinite keys (e.g., id, name, age, etc.) and s be a finite set of data types (e.g., String, Integer, etc.) We define a set of properties s ⊆ ( s × s ) .The property set is of two types (i) mandatory property set ( m ) and (ii) optional property set ( o ) such that s = m ∪ o .Mandatory property set can have some properties that have unique values associated with them.Let be a set of Boolean values, we define a uniqueness function U ∶ m → that maps elements from man- datory property set to TRUE or FALSE signifying that some values associated with a mandatory property must be unique.
Edges in a graph schema also have semantic information such as cardinality associated with them which refers to total number of edges that can exist between any two given nodes of a graph database.Cardinality of an edge represents a range where the minimum value of cardinality refers to minimum number of edges that must exist between any two nodes of a graph databases.Similarly, maximum value of cardinality refers to maximum number of edges that can exist between any two nodes in a graph database.
Let ∈ represent a minimum cardinality set which belongs to a set of whole numbers.Let ∈ ℕ represents a maximum cardinality set which belongs to a set of natural numbers.We define a set of cardinalities as ⊆ ( × ) with a condition that if min ∈ and max ∈ then min ≤ max.This means that minimum cardinality can never be greater than maximum cardinality.The minimum cardinality belongs to a set of whole numbers which means that minimum cardinality can be zero.On the other hand maximum cardinality belongs to a set of natural numbers therefore, the smallest value that can be associated with maximum cardinality is 1.Furthermore, in such a scenario minimum cardinality can be either 0 or 1.
A logical graph schema extends the conceptual graph schema discussed in Definition 4 by labeling the nodes and edges with mandatory and optional properties.Moreover, in a logical graph schema edges are labeled with cardinality values.Formally, a logical graph schema is defined as follows: is a mandatory property labeling function that maps all nodes and edges to the non empty subset of the mandatory property set where P + ( m ) represents the powerset of mandatory property set excluding the empty set. .Given an edge e ∈ E s , let n i , n j ∈ N s such that S s (e) = n i and T s (e) = n j then the following conditions hold: -The minimum number of edges belonging to the edge label s (e) that can exist between nodes of label s (n i ) and (n j ) is min.-The maximum number of edges belonging to the edge label s (e) that can exist between nodes of label s (n i ) and (n j ) is max.-The total number of edges belonging to edge label s (e) that can exist between nodes of label s (n i ) and s (n j ) in a graph database must not be less than min and more than max.
Example 7 By using Definition 7 the logical graph schema generated for Airbnb case study is presented in Fig. 6.The logical graph schema's topology is the same as the conceptual graph schema presented in Fig. 1.
Based in Definition 7 we can observe that a logical graph schema extends the conceptual graph schema by defining the property labeling functions over the nodes and edges of conceptual graph schema.Therefore, the intensional information captured in the conceptual graph schema is maintained in the logical graph schema.Additionally, the logical graph schema consists of extensional information as unique, mandatory, optional properties and edge cardinalities (Angles et al. 2021).Furthermore, the data type associated with each property is also captured in the logical graph schema.
Example 8 Figure 6 shows the properties associated with nodes and edges of the logical graph schema.For instance, the node labeled as host consists of a mandatory and an optional property.The mandatory property host_id is of data type Integer and must be unique.The value associated with the Boolean flag being TRUE signifies the uniqueness constraint.The optional property name is of data type String and does not contain the uniqueness constraint.As discussed in Definition 7 edges of the logical graph schema contain information about the cardinality.For instance, the edge between node labeled as host and listing is labeled as owns and the cardinality associated in (1,n).This means that a host can own multiple listings and a listing can be associated with a single host.In the cardinality n represents a place holder for a natural number that can be calculated while creating the database creation script.
In our approach, the combination of conceptual and logical graph schema modeling stages represent the four steps of database design as suggested by Chen (1976).Information such as entity set, relationship set and organization of data into entities and relationships is covered in conceptual graph schema modeling stage (Angles et al. 2021).In the logical graph schema modeling stage semantic information such as cardinality of edges and properties associated with nodes and edges are defined (Angles et al. 2021).

operators for designing logical graph schemas
The three operators, JOIN, DETACH and DELETE_NODE can also be used for designing and manipulating the logical graph schema.As mentioned in Definition 7 a logical graph schema is an extension of conceptual graph schema.Therefore, node and edge labeling functions as well as source and target function are valid in a logical graph schema.The semantics associated with these functions are also same.
A logical graph schema consists of additional functions such as mandatory and optional property labeling and edge cardinality functions.The use of operators namely JOIN, DETACH and DELETE_NODE is constrained due the additional labeling functions at the logical graph schema modeling stage.We now discuss the application of operators for logical graph schema modeling: JOIN: The application of JOIN on two given logical graph schemas works in the similar manner as for source, target, node and edge labeling functions as presented in Definition 6.The additional mappings are required for property and cardinality labeling functions which are discussed as follows: Definition 8 (JOIN on Logical Graph Schema) Given two logical graph schemas The node and edge labeling functions, source and target functions are defined as in Definition 6.
The notion of two logical graph schemas being join compatible is same as discussed for conceptual graph schemas as discussed in Sect.3.2.2.With respect to the properties two logical graph schemas are join compatible if nodes have same mandatory and optional properties that is, ∃n 1 ∈ N s1 and ∃n 2 ∈ N s2 such that Δ m1 (n 1 ) = Δ m2 (n 2 ) and Δ o1 (n 1 ) = Δ o2 (n 2 ) .In such a scenario we say that nodes n 1 and n 2 of two logical graph schemas are join compatible.
DETACH: The DETACH operator can be utilized by a database designer to delete an existing edge from a logical graph schema.Deleting an existing edge from a logical graph schema requires checking that the two conceptual graphs share some common edge with same labels as discussed in Sect.3.2.2.Additionally, deleting edges in logical graph schemas also requires that the edge properties and cardinalities must be same.In order to formalize the notion of DETACH operator at the logical schema level we further divide the set of mandatory and optional properties into node and edge properties.Let m and m be two sets containing mandatory properties specific to nodes and edge respectively such that m = m ∪ m .Similarly, let o and o be two sets containing optional proper- ties specific to nodes and edge respectively then o = o ∪ o Definition 9 (DETACH on logical graph schema) Given two logical graph schema The node and edge labeling functions, source and target functions are defined as in Definition 6.
In order to delete existing edges by using the DETACH operator there must exist some edges that are common between two logical graph schemas that is (E s1 ∩ E s2 ) ≠ � .This means that labels for both edges must be the same.Additionally, the properties and cardinalities associated with the common edges must be same as well that is ∃e 1 ∈ E s1 and ∃e 2 ∈ E s2 such that Δ m1 (e 1 ) = Δ m2 (e 2 ), Δ o1 (e 1 ) = Δ o2 (e 2 ) and s1 (e 1 ) = s2 (e 2 ).

DELETE_NODE:
The DELETE_NODE operator can be utilized by a database designer to delete disconnected nodes from a logical graph schema.As discussed in Sect.3.2.2 in order to delete an existing disconnected node the two logical graph schemas must contain common nodes.As mentioned in Definition 6 the node labeling must be same.Additionally the mandatory and optional properties must be the same as well.
Definition 10 (DELETE_NODE on logical graph schema) Given two logical graph schemas where: • N s3 , E s3 , s3 , s3 , S s3 , T s3 is a conceptual graph schema as discussed in Defini- tion 4. The node and edge labeling functions, source and target functions are defined as in Definition 6.
In order to delete existing nodes by using the DELETE_NODE operator there must exist some nodes that are common between two logical graph schemas that is (N s1 ∩ N s2 ≠ �) .This means that labels for both nodes must be the same.Addition- ally, the mandatory and optional properties associated with the common nodes must be same as well that is ∃n 1 ∈ N s1 and ∃n 2 ∈ N s2 such that Δ m1 (n 1 ) = Δ m2 (n 2 ) and Δ o1 (n 1 ) = Δ o2 (n 2 ).

Axiomatic specifications of operators
The axiomatic specifications of any algebra enable us to check its completeness (Tucker and Stephenson 2003).In order to show the axiomatic specification we use infix notation for the operators in .As such we use the (⊔) notation for the JOIN operator, (◊) notation for the DETACH operator and (∇) notation for the DELETE_NODE operator.
The axiomatic specification of FLASc operators is presented in Table 1.For defining the identity axiom, we define an identity graph I G = (�, �) which means that the identity graph does not contain any nodes and edges.We can observe that JOIN, DETACH and DELETE_NODE operators follow associativity, commutativity, idempotent and identity axioms.
The distributive axioms for the JOIN, DETACH and DELETE_NODE operators is presented in Table 2.The axiomatic specification of FLASc operators enable us to Table 2 Distributive axiom of FLASc operators use FLASc for generating new graph schemas from existing logical and conceptual graph schemas.The integrity constraints that can be enforced by a logical graph schema presented in Definition 7 include graph entity integrity constraints such as property uniqueness, label uniqueness, property data type and mandatory property constraints.The enforcement of these constraints and semantics constraints such as edge pattern, graph pattern, and path pattern constraints can be done at the physical modeling stage by using database-specific query languages.Following the graph schema to generate database creation scripts at the physical modeling stage ensures data consistency.

Schema instance consistency
The schema instance consistency is used to ensure that the labeled property graph database constructed at the physical modeling stage adheres to the logical graph schema generated by using .A labeled property graph database uses a graph structure for storing and managing data, allowing the modeling of real world entities as nodes and edges (Angles et al. 2018;Sharma and Sinha 2019).Nodes are used to store data and relationships or interactions between nodes are stored as edges (Angles et al. 2017;Sharma et al. 2019).Nodes and edges in a graph database can have properties associated with them.Let d be a set of infinite keys (e.g., id, name, age, etc.), d be a set of infinite values (e.g., 345, James, 33, etc.) and d be a set of finite data types (e.g., String, Integer etc.) we define a function Υ ∶ d → d that maps values in set d to their respective data types in the set d .The set of properties associated with the nodes and edges of a graph database are defined as d ⊆ ( d × d ) such that each p d ∈ d is a key- value pair where each value has a data type.To accommodate the existence of mandatory and optional properties the set of properties can be further written as d = dm ∪ do .Formally a labeled property graph database is defined as follows: where k sm ∈ s and t s ∈ s .-Then, k sm × t s = k dm × Υ(v d ) that is, the key and data type of value stored in node (or edge) of graph database is same as the key and data type of node (or edge) in the graph schema.
• For each n i ∈ N d (or e i ∈ E d ), there exists where k so ∈ s and t s ∈ s .-Then, k so × t s = k do × Υ(v d ) that is, the key and data type of value stored in node (or edge) of graph database is same as the key and data type of node (or edge) in the graph schema.
• The total number of edges of a certain label generated in the labeled property graph database must be between the minimum and maximum cardinality values associated with edges of same label in the graph schema.
Cardinality can be enforced programatically at the physical modeling stage by using the logical graph schema generated by .Similarly, the adherence to node and edge labeling, property (optional and mandatory) labeling can be enforced at the physical modeling stage.The logical graph schema is independent of the underlying implementations.Moreover, the graph schema can be used in both integrated and layered physical modeling approaches.To support our claim in the following two sections, we experimentally demonstrate the use of graph schema to transform and load data-sets by using both approaches for physical modeling for graph databases.
37 Page 24 of 45 However, while demonstrating the integrated approach we do not make any changes to the source code of graph database system and consider this as future work.

Using FLASc to enforce integrity constraints
In this section, we demonstrate the use of graph schema generated by FLASc for enforcing integrity constraints, which are essential for ensuring data consistency in graph databases.We illustrate the manual integration of conceptual, logical and physical modeling stages.We design the database creation scripts using the logical graph schema generated by FLASc for Airbnb data-set as shown in Fig. 6.We do not make any changes to the source code of Neo4j; however, the formulation of database creation scripts in Cypher is driven by the logical graph schema.We then execute these scripts directly over the Neo4j graph database.
As discussed in Sharma and Sinha (2019) Airbnb data-set consists of three CSV files containing information related to listings, review and calendar data.The listing file contains information, such as hosts that own the listings, amenities provided in the listings, location of the listing etc.The reviews file contains information related to the users who have stayed in the listings and provided feedback in reviews.The calendar file contains information related to booking details such as pricing and occupancy.These files contain multiple lines (rows) of data, where each row contains a comma-separated list of values.For instance, a CSV file containing information related to listings from Airbnb's data is shown in Table 3.

Manual generation of database creation scripts
The logical graph schema generated by FLASc for Airbnb data-set contains intensional and extensional information that assists a database designer for enforcing integrity constraints in the database scripts.

Enforcement of graph entity integrity constraints
Graph entity integrity constraints are used to enforce restrictions on properties associated with nodes and edges in a graph database.The extensional information captured in the logical graph schema as discussed in Definition 7 is used to enforce such constraints.We discuss the enforcement of graph entity integrity constraints for transforming and loading Airbnb data-set into Neo4j graph database by using Cypher query language.
Node property uniqueness constraint The sample listing file as shown in Table 3 has Listing ID associated with each listing.Furthermore, in the logical graph schema shown in Fig. 6 listing_id field the uniqueness flag is set to be True which means that the listing_id must be unique.Therefore, before creating the listing nodes in the Neo4j graph database, the uniqueness constraint must be established to reduce data corruption chances.This is achieved by running Query 1 The uniqueness constraint specified in Query 1 ensures that multiple nodes with same listing_id are not created in the Neo4j graph database.The use of IF NOT EXISTS clause is used to ensure that the constraint is enforced at most once.The next constraints to be enforced are the mandatory node and edge property constraints.
Mandatory node property constraint The sample listing file also contains information about the host_id and in the logical graph schema as shown in Fig. 6, the host_id is a mandatory field.Therefore, additional constraints must be enforced on the listing nodes.This can be achieved by running the following query in Cypher.The node property existence constraint specified in Query 2 ensures that listing nodes must always have a value assigned to the property host_id the ASSERT EXISTS clause is used to enforce such a condition.
Mandatory edge property constraint The mandatory property constraints can also be specified on the edges that have to be created in the graph database.The logical graph schema as discussed in Definition 7 helps in enforcing this constraint in two ways; first, it provides details about the edge labels.Second, it also provides details about mandatory, unique and optional properties associated with the edges.For example, as shown in Fig. 6 the edge labeled as owns has a mandatory property since which can be enforced by running the following Cypher query.The mandatory edge property constraint shown in Query 3 is used to ensure that their is always a value assigned to id of every edge labeled as OWNS in the graph database.
Node key constraint This constraint can be applied over a set of node properties.This constraint combines the functionality provided by uniqueness and mandatory property constraints.For example, the node labeled as host has two mandatory and unique properties user_id and name.This constraint can be enforced in the Neo4j graph database by using Query 4. Property data type constraint Logical graph schema is used to enforce property data type constraint over the node and edge properties.As discussed in Definition 7 a logical graph schema contains properties that have a data type associated with them.Therefore, database creation scripts are designed by utilizing this information.For instance, in the logical graph schema shown in Fig. 6 listing_id and host_id are of Integer data type the Cypher query to enforce this constraint is presented as Query 5.

Query 5 Cypher query to enforce property data type and edge pattern constraint
The property data type constraint is enforced by using the inbuilt toInteger() function in Cypher, as shown in lines 7 and 10 of Query 5.The use of this function is due to the specification in logical graph schema that the data type associated with listing_id and host_id must of Integer data type.In Query 5 the use of Cypher's MERGE clauses in lines 7 and 10, represents the creation of two nodes that is a listing node and a host node.This also illustrates the combination of conceptual and logical modeling stages where a basic conceptual graph schema containing two disconnected nodes as discussed in Definition 5 is further labeled with node properties, representing the use of node labeling function ( ) as discussed in Definition 7. Additionally, Cypher also supports the use of CASE statements as illustrated in lines 7-12 of Query 5.The CASE statements are used to ensure that if there exists some missing value in the csv files, then those values are loaded as a user defined values such as 'System' in our case.
Other graph entity integrity constraints such as node and edge label uniqueness are by default maintained by the logical graph schema generated using FLASc.By Definition 7 a node/edge can only have one label associated with it.On the other hand, Neo4j allows a node to be associated with more than one label (Bonifati et al. 2018;Neo4j 2021).FLASc does not support this for the sake of simplicity.Such features are not present in all graph database systems and tend to make the definitions of graph schema and graph databases complex (Angles et al. 2020(Angles et al. , 2019)).Constraints such as edge property uniqueness can be specified in FLASc however, such constraints cannot be enforced in Neo4j.

Enforcement of semantic integrity constraints
Semantic integrity constraints are used to enforce a topological restriction on the graph database.The intensional information captured in the graph schema during the conceptual modeling stage becomes useful to enforce semantic integrity constraints.
Edge pattern constraint To enforce edge pattern constraint the topological information stored in the logical graph schema is used while creating the database creation scripts.For instance, Query 5 is also used to create edges between nodes of label host and listing.Each edge created by using Query 5 is labeled as owns and represents a valid edge in the logical graph schema shown in Fig. 6.According to the Neo4j Cypher manual4 MERGE clause serves as a combination of MATCH and CREATE clauses.Therefore, in Query 5 the MERGE clause in lines 7 and 10 is used to first create and then match the host and listing nodes.The WITH clause as presented in line 13 of Query 5 allows query parts to be chained together, 5 therefore, the host and listing nodes created in lines 7-12 are passed by using the WITH clause to facilitate the creation of edges between host and listing node types, that is in lines 13-17 of Query 5.The DISTINCT clause along with the WITH clause is used to ensure the removal of duplicate nodes in Query 5.The WHERE clause in line 14-16 at is used to define some constraints to filter results based on the values obtained from the csv files.The CREATE clause at line 17 in Query 5 represents the creation of a graph containing two nodes and an edge connecting them as discussed in Definition 5.The edge of the graph is further labeled with edge properties further representing the use of edge labeling function ( ) as discussed in Definition 7.
Graph pattern constraint Enforcing graph pattern constraints require knowledge about the topology of the data-set, which is captured by logical graph schema.These constraints check for the existence of certain graph structure in the database before any new node or edge can be created.Graph pattern constraint in Cypher is presented as Query 6 which ensures that listing nodes that have been reviewed by a user are attached to booking_detail nodes by edges that are labeled as has.
Query 6 Cypher query to enforce graph pattern constraint In Query 6 the MATCH clause in line 3 is used to check if graph pattern exists or not.This graph pattern (Angles et al. 2017) is built by using the intensional information in the logical graph schema presented in Fig. 6 that assists in formulating valid graph patterns for enforcing such constraints.The MATCH clause in this query connects two graph patterns which are join compatible (Sharma et al. 2021).The CRE-ATE clause in line 6 is used to combine the graph obtained from the MATCH clause with a logical graph schema specified in the CREATE clause.This represents the use of JOIN operator.The two logical graph schemas are join compatible since they share the node l labeled as listing.Query 6 also illustrates the use of :auto USING PERIODIC COMMIT clause in line 1, which is used to handle the large amount of data being processed.
Path pattern constraint These constraints check for the existence of certain paths in a graph database before a new node or edge can be created.Query languages for graph databases use the formalism of conjunctive two-way regular path queries (C2RPQs) and nested regular expressions (NREs) to express and then search for path patterns (Florescu et al. 1998;Wood 2012;Angles et al. 2014;Bagan et al. 2015;Barceló et al. 2011Barceló et al. , 2012Barceló et al. , 2016;;Reutter 2013;Barceló et al. 2012).Furthermore, other expressive formalism such as conjunctive queries and union of conjunctive queries extended with Tarski's relation algebra (CQT/UCQT) proposed in Sharma et al. (2021) can also be used to enforce path constraints.In these formalisms regular expressions defined over the edge labels of the graph database are used to describe path patterns (Angles et al. 2017).The intensional information captured in logical graph schema assists in creating valid path patterns.Query 7 illustrates the enforcement of path pattern constraint in Cypher.Very similar to Query 6 the use of CREATE clause in the query represents the use of JOIN operator to combine the graph obtained from the MATCH clause at line 3 with the logical graph schema specified in the CREATE clause in line 6.
In Query 7 the path pattern constraint is specified in the MATCH clause, which represents the regular expression (wrote.review_for)formed by applying concatenation operator over the edge labels wrote, review_for and has.Other regular expressions operators such as union and Kleene star can also be used to form more expressions.However, Cypher only provides limited support for regular expressions as the Kleene star operator's use over the concatenation of two more edge labels is not allowed in Cypher (Angles et al. 2017;Sharma et al. 2021).Further modifications can be done to the query language by using formalism such as Tarski's algebra instead of regular expressions for increasing their expressiveness (Sharma et al. 2021).
Other Constraints such as schema instance consistency are ensured since the generation of database creation scripts is driven by the logical graph schema.Constraints such as functional dependencies are not easy to enforce in graph databases (Angles and Gutierrez 2008); however, in order to enforce functional dependencies while modeling graph databases, a designer can follow the approach proposed in Park et al. (2014).This approach states that every non-key property must only provide information about the associated nodes and edges.Constraints such as edge identify uniqueness and cardinality constraints cannot be directly enforced in Neo4j.However, enforcing such constraints can be done by writing a wrapper in programming languages such as Java, Python that can be used to ensure that edge ids must be unique.
The logical graph schema generated by FLASc enables us to enforce several practical integrity constraints.FLASc assists in the generation of robust conceptual and logical graph schemas.FLASc can be integrated with the existing Extract-Transform-load process for ensuring data consistency when data from heterogeneous sources is being loaded into a graph database such as Neo4j.The manual approach presented in this section has limitations.Firstly this approach requires a database designer to possess knowledge of graph database query language such as Cypher.Secondly, creating the database creation scripts manually can be cumbersome and error-prone, making the process less maintainable, scalable and manageable.Finally, Cypher does not support loading data from heterogeneous sources into the Neo4j graph database.Therefore, to mitigate such limitations in the next section, we present our layered approach.

A layered approach for data transformation and loading using FLASc
Graph databases are schema-less or schema optional; therefore, maintaining data consistency and integrity is not easy.A graph database can be easily altered unless the database's underlying source code is not amended to support the enforcement of all integrity constraints.Hence in this section, we propose a layered approach that incorporates the development of an additional wrapper to ensure data consistency.While following the layered approach, we use the APIs provided by Neo4j to access the graph database.We illustrate how FLASc can be used to assist the transformation and loading of data from heterogeneous sources into graph databases hence addresses RQ3 and RQ3.1.

Schema driven layered approach
Overview The overall physical view of our layered approach is presented in Fig. 7, that consists of three main components (i) FLASc which serves as a graph schema generator, (ii) an importing subsystem and (iii) a graph database such as Neo4j.
The importing subsystem takes source files and a graph schema generated by FLASc as inputs.The subsystem then creates database creation scripts in by following the intensional and extensional information captured in the graph schema.The subsystem then interacts with the Neo4j graph database by using the APIs and executes the database creation scripts on the graph database.Importing subsystem design The importing subsystem is based on the Extract-Transform-Load (ETL) design pattern.As shown in Fig. 8 the Extract stage is used to fetch data from a source and consolidated it into a repository.The transform stage is used to apply appropriate transformation rules over the repository data.The transform stage uses the graph schema generated by FLASc to apply the transformation rules and create the database creation scripts.The load stage is finally used to execute the scripts on the database.In the load stage, database is accessed by using the specific API calls.
Technology stack The subsystem is developed as a Java Maven project where the front end is designed using Java Swing library. 6The subsystem uses Neo4j libraries for establishing a connection with the Neo4j graph database.Maven is used for handling API specific external dependencies.Neo4j's Cypher language is used for querying and creating the database.

Airbnb case study
Transforming and loading data in CSV format is straight forward in Neo4j and .Furthermore, the Airbnb data-set exists in the form of denormalized relational tables as such connection between nodes can be established based on primary key foreign key relationships.As shown in Query 6, the clause LOAD CSV WITH HEADERS FROM represents the extract stage.In Query 6 the data is being fetched from the Airbnb website as shown in line 2. The data is stored in a repository represented by the "row" variable in the query.The transform stage in Query 6 is represented in lines 3-6 where the MATCH clause is used to search for the existence of patterns, WHERE clause is used to restrict the result set based on some conditions and WITH clause serves as a medium to deliver the data (listings and row) to the CREATE clause.Finally the CREATE clause is used to create the edge between node labeled as listing and booking_details.The transform stage is also responsible for ensuring that the integrity constraints are enforced, which is done by using the graph schema.In a layered approach, the load stage is responsible for creating a connection with the Neo4j graph database by making appropriate API calls.The additional wrapper written in Java is used to execute the entire query on Neo4j finally.
The main advantage of using the layered approach is that additional logic can be written to ensure data consistency.For instance, does not provide inbuilt mechanisms to enforce the uniqueness constraints on edges.A layered approach is beneficial in such scenarios as additional logic can be written in programming languages to generate unique values for a particular edge property.The layered approach's advantage is evident when data in formats other than CSV are to be loaded into the Neo4j graph database.To illustrate this, we present the use of our layered approach to transform and load data-set related to big data analytics case study.

case study
Implementing large-scale big data projects requires ongoing collaborations and monitoring by multiple stakeholders who have differing concerns.BiDaML (Big Data Analytics Modelling Languages) (Khalajzadeh et al. 2019) is a domain-specific language for planning, specifying, monitoring and designing big data analytics projects.suite presents different graph-based diagrams with highly interrelated data.The diagrams considered in this case study consists of five diagrams brainstorming, process, technique, data, and deployment that provide different levels of abstractions.These diagrams are generated for National Bowel Cancer Screening Program (NBCSP) in Australia (AGD of Health 2017).
The suite currently lacks the necessary automation and tooling required to allow individual users to view customised information specific to their needs and preferences within these diagrams.Importing data-sets from highly structured tools, such as the current HTML based implementation of diagrams into graph databases such as Neo4j, is a challenge.This is due to the reason that Neo4j does not provides clauses for importing HTML data.We illustrate the use of our schema driven approach for transforming and loading diagrams into Neo4j.

diagrams data-set
The data-set consists of five diagrams generated by the suite.Brainstorming diagram provides an overview of a data analytics project and all the tasks and sub-tasks involved in designing the solution at a very high level.Users can include comments and extra information for the other stakeholders.Process diagram specifies the analytics process, which includes sequencing the tasks identified in the brainstorming diagram and relating these tasks to participants or stakeholders.Technique diagrams show how tasks from the brainstorming/process diagrams are elaborated further by applying specific techniques.Data diagrams document the data and artefacts produced in each of the above diagrams at a low level, i.e. the technical AIbased layer.They also define the outputs associated with different tasks like output information, reports, results, visualisations, and outcomes.And finally, deployment diagrams depicts the run-time configuration, i.e. the system hardware, the software installed on it, and the middle-ware connecting different machines for development related tasks.
The graph schema generated by using FLASc for diagrams is presented in Fig. 9 where the node labeled as TASK allows edges that are available in different diagrams, including outgoing edges to other tasks allowed in brainstorming, process and technique diagrams.These edges are distinguished from each other via additional edge labels.For instance, edges between task nodes in brainstorming diagrams are labeled as TT.Edges between task nodes in process diagrams are labeled by PR.The schema also allows other node labels like ROOT in brainstorming diagrams, START, END and CONDITION in process diagrams and INFRASTRU CTU RE node labels in deployment diagrams.In , technique and data diagrams can have techniques and data artefacts that are used as nodes in deployment diagrams.For simplicity of the graph schema, we classify techniques, artefacts, etc., as nodes of label OTHER.As shown in Fig. 9 graph schema also captures the extensional information such as mandatory, unique and optional properties related to nodes and edges of diagrams.For example, the node labeled as TASK has nine associated properties where id,diagram_type and name are mandatory properties.The id property must be unique and properties including type,activity_type and organization are optional.

Importing subsystem for diagrams data-set
To transform and load diagram data-set into Neo4j we still use the same ETL design pattern with slight modification to each stage.As shown in Fig. 10 data files in HTML format are passed to the Extract stage that consists of two processes: Parse-HTML and Data builder.The HTML file contains information about nodes and edges of graphs using map tags as well as additional properties such as id, name, type, sub-type, activity-type, stakeholder, comments and organization.Parse-HTML process reads the entire HTML file by using the JSoup library Hedley (2020) and creates a repository containing all the nodes and edges, which is then passed on to the Data Builder for further processing.
The Data builder process first removes duplicate elements in the repository.The builder then converts the repository into a list of edges (and nodes) that need to be stored in the graph database.In the Transform stage, the Cypher Query Builder takes the edge list from the extract stage and graph schema generated using FLASc as inputs to generate queries for loading data into Neo4j.This stage also ensures that appropriate integrity constraints captured in the graph schema are enforced.
The final load stage consists of a Database Connector process and a Neo4j graph database interface.The Database Connector process establishes a connection with the Neo4j graph database using the Neo4j interface.A session is created between the subsystem and the Neo4j database.The Cypher query constructed in the transform stage is packaged into a create query and then executed.This process also ensures that nodes are not duplicated, especially if some of the imported nodes were already present in the database.The time at which each node or edge is created during the ETL operations or during subsequent editing of the diagrams, is stored as a time stamp attribute within each updated element.Additional information, such as clustering of tasks in brainstorming diagrams and mapping tasks to specific stakeholders, is all stored as attributes of the corresponding nodes.

P2660.1 case study
Designing robust Industrial Cyber-Physical Systems (ICPS) largely depends upon identifying industrial agents, that provide complex and harmonious control mechanisms at the software level.These industrial agents practices are used to develop more extensive and feature-rich ICPS.IEEE Standardization projects such as P2660.1 aim at identifying industrial agent practices that can suit the requirements of future ICPS.A key challenge with this project is the identification of industrial agent practices based on some user-defined criteria.This case study is based on a tool (IASelect7 ) developed for IEEE standardization project P2660.1 (P2660.1 2020) that assists in selecting best fit industrial agent practices for ICPS (Sharma et al. 2019).

P2660.1 data-set
The P2660.1 data-set consists of two practices OnDevice and Hybrid.Each practice is of two types Tightly-coupled and loosely-coupled.Practices have an associated set of qualities, which make these practices suitable to use in specific contexts.Hence, selecting the best-fit practices requires identifying the associated qualities.P2660.1 working group identifies four kinds of qualities Domain, The graph schema generated by using FLASc for P2660.1 data-set is presented in Fig. 11 which consist of two practice nodes and four quality nodes.Each practice node is connected to a quality node by an edge labeled as has_ score.This signifies that every practice to be stored in the graph database must connect with a quality, which represents the intensional information associated with the data-set.The extensional information is captured by node and edge properties.All nodes and edges have an associated property id which is a mandatory property, is of Integer data type and value associated with this property must be unique.Property such as type is mandatory but may not be unique.All edges have a unique and mandatory property id.The score property is mandatory but is not unique, and this is because the same score value can be assigned to different practice-quality pair by an ICPS expert.All edges contain an optional property assignedOn with an associated data type date-time.The Parse-AM process is used to reads the entire XLS file by using the Apache POI library (Apache 2020) and converts it into a repository.The other process required to transform and load the P2660.1 data-set into Neo4j are similar to the processes used in the diagram case study presented in Sect.5.3.2.

Lessons learned from the case studies
The formal basis for FLASc and its integration with the ETL design pattern suggests that the data from heterogeneous sources can be transformed and loaded into several graph database by using our approach.We consider three case studies related to cyber-physical systems, big data analytics and tourism as presented in Sects.5.2, 5.3 and 5.4 respectively.The only factor that differs in loading these three diverse datasets is the Extract phase's parse process.
As shown in Figs. 13 and 14 the parse process uses different APIs for reading data from heterogeneous sources.All other stages for loading data into the Neo4j graph database remain the same.Similarly, suppose data has to be transformed and loaded into a database other than Neo4j.In that case, only the Load stage needs to be altered so that APIs specific to the database platform can be utilized.The transform stage in all the scenarios as mentioned above remains the same and consistent.This demonstrates the generalizability of our approach, since by using the FLASc integrated ETL design pattern can be used to load data-sets from heterogeneous sources into a graph database.Furthermore, our approach is not limited to a specific data-set format and a particular graph database.The use of FLASc for loading data-sets from heterogeneous sources becomes more evident when using the layered approach.As shown in Table 4 only a limited number of integrity constraints can be enforced in a layered approach without using FLASc.As shown in Table 3 structured data-sets such as provided in the Airbnb case study exist in the form of CSV files and contain intensional information as primary and foreign keys.However, semi-structured data provided in and P2660.1 data-sets require predefined structural information for systematic transformation and  loading.The intensional information is facilitated by using FLASc hence ensuring data consistency and integrity while using the layered approach.

Discussion, conclusion and future work
In this research, we present a formal algebra FLASc for generating robust graph schema for labeled property graph databases.We illustrate the integration of FLASc with the Extract-Transform-Load design pattern that assists in systematic transformation and loading of data-sets from heterogeneous sources into graph databases such as Neo4j.Graph schemas generated by FLASc assist in specifying integrity constraints in the database creation scripts, ensuring data consistency and integrity.
Our approach presents the integration of conceptual, logical and physical modeling stages for graph databases.FLASc enables users to capture requirements of any given problem domain as basic conceptual graph schemas.The JOIN, DETACH and DELETE_NODE operators provided by FLASc can then be used to construct robust conceptual graph schemas from basic conceptual graph schemas.Properties associated with nodes and edges of graph schema are specified at the logical modeling stage.Finally, in the physical modeling stage, the enforcement of integrity constraints and design of database creation scripts are driven by FLASc generated graph schemas.
The integration of FLASc with the Extract-Transform-Load design pattern illustrates the practical application of our approach.This is demonstrated by using three diverse case studies related to cyber-physical systems, big data analytics and tourism that also illustrates the generalizability of our approach.The intensional and extensional information captured in the graph schema assists in the transform stage of the data loading process.This information can be used to enforce several integrity constraints on the data-sets being loaded into a graph database.
As shown in Table 4, FLASc facilitates the enforcement of several integrity constraints.We can observe that FLASc generated graph schemas are useful in enforcing semantic constraints because such constraints require knowledge of relationships between entities in data-sets.Semantic constraints such as edge, graph and path pattern constraints cannot be enforced without knowledge about relationships in the data-set.As shown in Table 4 graph entity integrity constraints such as edge property uniqueness constraint cannot be enforced in the integrated approach due to the limitations in the Neo4j graph database.Furthermore, FLASc generated logical graph schema also enable a database designer to specify cardinality constraints on the edges of a graph schema.However, due to the limitations in Neo4j graph database cardinality constraints cannot be enforced in the integrated approach.Such challenges can be mitigated in the layered approach by writing additional logic in programming languages such as Java, Python for specifying edge uniqueness and cardinality constraints.
The use of FLASc for loading data from heterogeneous sources becomes more evident while using the layered approach.As shown in Table 4 only a limited number of integrity constraints can be enforced in a layered approach without using FLASc.The support for integrity constraints such as node property uniqueness, mandatory 37 Page 40 of 45 node and edge property constraints are by default provided by Neo4j.Other constraints cannot be enforced without the intensional and extensional information captured in the graph schemas generated by FLASc.In the absence of robustly defined graph schema, the capability to enforce integrity constraints depends on the underlying engine associated with a graph database .

Limitations
As shown in Table 4 graph schemas generated by FLASc provide the ability to enforce several useful integrity constraints.However, other constraints such as relationship types is not covered in our approach.Relationship types represent the nature of relationships such as inheritance, association, composition and realisation, between nodes of a graph database.The enforcement of such constraints is not supported by FLASc in its current state.Furthermore, FLASc cannot be compared with other conceptual modeling tools such as entity-relationship diagrams (ERD) and unified modeling language (UML) diagrams as these tools support the specification of relationship types.
The main motive of FLASc is to assist in the design of robust conceptual graph schemas so that the soundness of logical and physical graph schemas can be ensured.FLASc generated conceptual graph schemas can preciously capture the intensional information.Relationship types are edge related properties (Angles 2018); hence can be classified as extensional information.These properties can be easily captured in the logical graph schema.For instance, by altering Definition 7, the logical graph schema can be enriched to support extensional information such as relationship types.

Conclusion and future work
The scope of our study is limited to the Neo4j graph database.Therefore, the performance evaluation of using our approach for transforming and loading data-sets into other graph databases is not discussed.We consider this as future work where FLASc can be utilised for evaluating the coverage of integrity constraints offered by other graph databases provided by vendors such as Oracle (2021), Apache Tinkerpop (2021) andTigerGraph (2020).We intend to work on extending FLASc to support other integrity constraints such as relationship types and functional dependencies.The support of such constraints can enable FLASc to represent visual models expressed in languages such as Entity relationship diagram (ERD), Unified Modeling Language (UML) and System Modeling Language (SysML).
Moreover, using the FLASc extended ETL design pattern, visual models expressed as ERD, UML or SysML diagrams related to software development projects can be imported into graph databases.Storing software development visual models in graph databases provides the additional advantages of tractability and efficient database manageability, such as automatically identifying inconsistencies across all project diagrams.
In its current state our formal algebra FLASc supports the creation of robustly defined graph schemas that captures the intensional and extensional information.A natural extension to this work is the proposal of a formal schema creation language.We intend to combine our novel query language proposed in Sharma et al. (2021) with FLASc to propose a graph schema creation language.In Sharma et al. (2021) we propose the novel formalims of conjunctive queries and union of conjunctive queries extended with Tarksi's algebra (CQT/UCQT) for extracting data stored in a graph database.This language can be further combined with FLASc for creating a novel graph schema creation language.A main advantage of such an approach is the ability to use restricted form of first-order logic (conjunctive queries) while defining a graph schema which also makes our approach compatible with object role modeling language proposed in Halpin (2005).This will further assist in the industry wide initiative of standardizing query language for graph databases.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.

Fig. 1
Fig. 1 Conceptual graph schema generated for Airbnb case study

Fig. 2 Fig. 3
Fig. 2 The application of JOIN operator to connect two conceptual graph schemas

Fig. 4
Fig.4 The application of DELETE_NODE operator to delete a node from a conceptual graph schemas

Fig. 5
Fig. 5 The application of JOIN and DETACH operators to alter an existing edge

Fig. 6
Fig. 6 Logical graph schema generated for Airbnb case study

Query 2
Cypher query to enforce mandatory node property constraint CREATE CONSTRAINT listing host id IF NOT EXISTS ON (list:listing) ASSERT EXISTS list.hostid

Query 4
Cypher query to enforce node key property constraint 1. CREATE CONSTRAINT ON (u:user) 2. ASSERT u.user id, u.name IS NODE KEY As shown in Query 4 the use of IS NODE KEY keywords along with the ASSERT clause is used to enforce that the properties user_id and name are unique and must have a value associated with them in the graph database.

Fig. 7
Fig. 7 Physical view of Schema driven layered approach

Fig. 10
Fig. 10 ETL stages shown as Data flow diagram to upload diagram data-set into Neo4j

Fig. 13
Fig. 12 ETL stages shown as Data flow diagram to upload P2660.1 data-set into Neo4j Fig. 14 ETL stages shown as Data flow diagram to upload P2660.1 data-set into Neo4j 1 , E 1 , 1 , 1 , S 1 , T 1 ) where L N 1 is a set of node labels and L E 1 is a set of edge labels associated with G 1 .Let G 2 = (N 2 , E 2 , 2 , 2 , S 2 , T 2 ) where L N 2 is a set of node labels and L E 2 is a set of edge labels associated with G 2 is an optional property labeling function that maps all nodes and edges to the powerset, represented as P( o ) , of the optional prop- erty set.•s ∶ E s → s is a cardinality labeling function that maps all edges to a set of cardinalities such that ∀e ∈ E s , the cardinality function s (e) = (

Table 1
Axiomatic specifications of operators in FLASc nodes and E d is a finite set of edges of the graph database • (N d , E d , S d , T d ) a directed multigraph as discussed in Definition 1.• dm and do are mandatory and optional property sets associated with the graph database.•d∶N d → L N is a node labeling function which maps all nodes to labels in the set of node labels L N .•d∶E d → L E is an edge labeling function which maps all edges to labels in the set of edge labels L E .Δ do ∶ (N d ∪ E d ) → P( do ) is a property labeling function which maps all nodes and/or edges to all subsets (including the empty set) of the optional property set do .=(Nd, E d , dm , do , d , d , S d , T d , Δ dm , Δ do ) as defined in Definition 11 and a labeled property graph schemaG l = (N s , E s , sm , so , s , s , s , S s , T s , Δ sm , Δ so , s )as defined in Definition 7. We say that G d is consistent with G l when:• For each node n ∈ N d , there must exist a corresponding node in graph schema where n � ∈ N s such that d (n) = s (n � ).• For each edge e i ∈ G d there must exist a corresponding edge in graph schema that is e • Δ dm ∶ (N d ∪ E d ) → P + ( dm )is a property labeling function which maps all nodes and/or edges to all subsets (excluding the empty set) of the mandatory property set dm .•

Table 3
Sample data from listing.csv in the Airbnb data-set Cypher query to enforce node property uniqueness constraint Query 1

Table 4
Coverage of integrity constraints